Solutions DatabaseSystemConcepts 7thed

Download as pdf or txt
Download as pdf or txt
You are on page 1of 193

CHAPTER

1
Introdu tion

Pra ti e Exer ises


1.1 This hapter has des ribed several major advantages of a database system. What
are two disadvantages?
Answer:
Two disadvantages asso iated with database systems are listed below.

a. Setup of the database system requires more knowledge, money, skills, and
time.
b. The omplexity of the database may result in poor performan e.

1.2 List ve ways in whi h the type de laration system of a language su h as Java
or C++ di ers from the data denition language used in a database.
Answer:
a. Exe uting an a tion in the DDL results in the reation of an obje t in the
database; in ontrast, a programming language type de laration is simply
an abstra tion used in the program.
b. Database DDLs allow onsisten y onstraints to be spe ied, whi h pro-
gramming language type systems generally do not allow. These in lude
domain onstraints and referential integrity onstraints.
. Database DDLs support authorization, giving di erent a ess rights to
di erent users. Programming language type systems do not provide su h
prote tion (at best, they prote t attributes in a lass from being a essed
by methods in another lass).
d. Programming language type systems are usually mu h ri her than the SQL
type system. Most databases support only basi types su h as di erent
types of numbers and strings, although some databases do support some
omplex types su h as arrays and obje ts.
1
2 Chapter 1 Introdu tion

e. A database DDL is fo used on spe ifying types of attributes of relations;


in ontrast, a programming language allows obje ts and olle tions of ob-
je ts to be reated.
1.3 List six major steps that you would take in setting up a database for a parti ular
enterprise.
Answer:
Six major steps in setting up a database for a parti ular enterprise are:
• Dene the high-level requirements of the enterprise (this step generates a
do ument known as the system requirements spe i ation.)
• Dene a model ontaining all appropriate types of data and data relation-
ships.
• Dene the integrity onstraints on the data.
• Dene the physi al level.
• For ea h known problem to be solved on a regular basis (e.g., tasks to be
arried out by lerks or web users), dene a user interfa e to arry out the
task, and write the ne essary appli ation programs to implement the user
interfa e.
• Create/initialize the database.
1.4 Suppose you want to build a video site similar to YouTube. Consider ea h of the
points listed in Se tion 1.2 as disadvantages of keeping data in a le-pro essing
system. Dis uss the relevan e of ea h of these points to the storage of a tual
video data, and to metadata about the video, su h as title, the user who uploaded
it, tags, and whi h users viewed it.
Answer:
• Data redundan y and in onsisten y. This would be relevant to metadata to
some extent, although not to the a tual video data, whi h are not updated.
There are very few relationships here, and none of them an lead to redun-
dan y.
• Di ulty in a essing data. If video data are only a essed through a few
predened interfa es, as is done in video sharing sites today, this will not
be a problem. However, if an organization needs to nd video data based
on spe i sear h onditions (beyond simple keyword queries), if metadata
were stored in les it would be hard to nd relevant data without writing
appli ation programs. Using a database would be important for the task of
nding data.
• Data isolation. Sin e data are not usually updated, but instead newly re-
ated, data isolation is not a major issue. Even the task of keeping tra k of
Pra ti e Exer ises 3

who has viewed what videos is ( on eptually) append only, again making
isolation not a major issue. However, if authorization is added, there may
be some issues of on urrent updates to authorization information.
• Integrity problems. It seems unlikely there are signi ant integrity on-
straints in this appli ation, ex ept for primary keys. If the data are dis-
tributed, there may be issues in enfor ing primary key onstraints. Integrity
problems are probably not a major issue.
• Atomi ity problems. When a video is uploaded, metadata about the video
and the video should be added atomi ally, otherwise there would be an
in onsisten y in the data. An underlying re overy me hanism would be
required to ensure atomi ity in the event of failures.
• Con urrent-a ess anomalies. Sin e data are not updated, on urrent a ess
anomalies would be unlikely to o ur.
• Se urity problems. These would be an issue if the system supported autho-
rization.
1.5 Keyword queries used in web sear h are quite di erent from database queries.
List key di eren es between the two, in terms of the way the queries are spe ied
and in terms of what is the result of a query.
Answer:
Queries used in the web are spe ied by providing a list of keywords with no spe-
i syntax. The result is typi ally an ordered list of URLs, along with snippets
of information about the ontent of the URLs. In ontrast, database queries
have a spe i syntax allowing omplex queries to be spe ied. And in the rela-
tional world the result of a query is always a table.
CHAPTER
2
Introdu tion to the Relational
Model

Pra ti e Exer ises


2.1 Consider the employee database of Figure 2.17. What are the appropriate pri-
mary keys?
Answer:
The appropriate primary keys are shown below:

employee(person name, street, ity)


works(person name, ompany name, salary)
ompany ( ompany name, ity)

2.2 Consider the foreign-key onstraint from the dept name attribute of instru tor to
the department relation. Give examples of inserts and deletes to these relations
that an ause a violation of the foreign-key onstraint.
Answer:

• Inserting a tuple:
(10111, Ostrom, E onomi s, 110000)

employee (ID, person name, street, ity)


works (ID, ompany name, salary)

ompany ( ompany name, ity)

Figure 2.17 Employee database.

5
6 Chapter 2 Introdu tion to the Relational Model

into the instru tor table, where the department table does not have the de-
partment E onomi s, would violate the foreign-key onstraint.
• Deleting the tuple:
(Biology, Watson, 90000)
from the department table, where at least one student or instru tor tuple
has dept name as Biology, would violate the foreign-key onstraint.
2.3 Consider the time slot relation. Given that a parti ular time slot an meet more
than on e in a week, explain why day and start time are part of the primary key
of this relation, while end time is not.
Answer:
The attributes day and start time are part of the primary key sin e a parti ular
lass will most likely meet on several di erent days and may even meet more
than on e in a day. However, end time is not part of the primary key sin e a
parti ular lass that starts at a parti ular time on a parti ular day annot end at
more than one time.
2.4 In the instan e of instru tor shown in Figure 2.1, no two instru tors have the
same name. From this, an we on lude that name an be used as a superkey
(or primary key) of instru tor?
Answer:

No. For this possible instan e of the instru tor table the names are unique, but
in general this may not always be the ase (unless the university has a rule that
two instru tors annot have the same name, whi h is a rather unlikey s enario).
2.5 What is the result of rst performing the Cartesian produ t of student and advi-
sor , and then performing a sele tion operation on the result with the predi ate

s id = ID? (Using the symboli notation of relational algebra, this query an be

written as s id=ID (student  advisor ).)


Answer:

The result attributes in lude all attribute values of student followed by all at-
tributes of advisor. The tuples in the result are as follows: For ea h student who
has an advisor, the result has a row ontaining that student's attributes, followed
by an s id attribute identi al to the student's ID attribute, followed by the i id
attribute ontaining the ID of the students advisor.
Students who do not have an advisor will not appear in the result. A student
who has more than one advisor will appear a orresponding number of times
in the result.
2.6 Consider the employee database of Figure 2.17. Give an expression in the rela-
tional algebra to express ea h of the following queries:
a. Find the name of ea h employee who lives in ity “Miami”.
Pra ti e Exer ises 7

(
bran h bran h name , bran h ity, assets)
ustomer (ID, ustomer name, ustomer street, ustomer ity)
loan (loan number , bran h name, amount )

borrower (ID, loan number )

a ount (a ount number , bran h name, balan e)

depositor (ID, a ount number )

Figure 2.18 Bank database.

b. Find the name of ea h employee whose salary is greater than $100000.


. Find the name of ea h employee who lives in “Miami” and whose salary
is greater than $100000.

Answer:

a. person name ( ity = “Miami”


(employee))
b. person name (salary > 100000
(employee Æ works))
. person name ( ity = “Miami” á salary >100000
(employee Æ works))
2.7 Consider the bank database of Figure 2.18. Give an expression in the relational
algebra for ea h of the following queries:
a. Find the name of ea h bran h lo ated in “Chi ago”.
b. Find the ID of ea h borrower who has a loan in bran h “Downtown”.

Answer:

a. bran h name (bran h ity = “Chi ago”


(bran h))
b. ID (bran h name = “Downtown”
(borrower Æborrower loan number=loan loan number
: :

loan )).
2.8 Consider the employee database of Figure 2.17. Give an expression in the rela-
tional algebra to express ea h of the following queries:
a. Find the ID and name of ea h employee who does not work for “BigBank”.
b. Find the ID and name of ea h employee who earns at least as mu h as
every employee in the database.

Answer:

a. To nd employees who do not work for BigBank, we rst nd all those
who do work for BigBank. Those are exa tly the employees not part of the
8 Chapter 2 Introdu tion to the Relational Model

desired result. We then use set di eren e to nd the set of all employees
minus those employees that should not be in the result.

ID person name ( employee )*


ID person name
,

(employee Æemployee ID=works ID ( ompany name=``


,

: : BigBank
¨¨ (works)))

b. We use the same approa h as in part a by rst nding those employess


who do not earn the highest salary, or, said di erently, for whom some
other employee earns more. Sin e this involves omparing two employee
salary values, we need to referen e the employee relation twi e and there-
fore use renaming.

ID person name ( )* employee

A ID A person name (A ( ) ÆA salary B (


,

: , :
employee
: < B:salary employee ))

2.9 The division operator of relational algebra, “Ÿ”, is dened as follows. Let r(R)
and s(S) be relations, and let S Ó R; that is, every attribute of s hema S is
also in s hema R. Given a tuple t, let t[S℄ denote the proje tion of tuple t on
the attributes in S. Then r Ÿ s is a relation on s hema R * S (that is, on the
s hema ontaining all attributes of s hema R that are not in s hema S). A tuple
t is in r Ÿ s if and only if both of two onditions hold:

• tis in R*S (r)


• For every tuple ts in s, there is a tuple tr in r satisfying both of the following:
a. tr [S℄ = t
s [S ℄
b. tr [R * S ℄ = t

Given the above denition:

a. Write a relational algebra expression using the division operator to nd


the IDs of all students who have taken all Comp. S i. ourses. (Hint:
proje t takes to just ID and ourse id , and generate the set of all Comp.
S i. ourse id s using a sele t expression, before doing the division.)
b. Show how to write the above query in relational algebra, without using
division. (By doing so, you would have shown how to dene the division
operation using the other relational algebra operations.)

Answer:

a. ID (ID ourse id (


,
takes ) Ÿ  ourse id (dept name= 'Comp. S i'
( ourse))
b. The required expression is as follows:
Pra ti e Exer ises 9

r } ID ourse id ( )
,
takes

s }  ourse id (dept name= 'Comp. S i'


( ourse))

ID ( takes ) * ID ((ID ( takes )  s ) * r )

In general, let r(R) and s(S) be given, with S Ó R. Then we an express


the division operation using basi relational algebra operations as follows:

r Ÿ = R*S ( ) * R*S ((R*S ( ) 


s r r s ) * R*S S ( ))
,
r

To see that this expression is true, we observe that R*S (r) gives us all
tuples t that satisfy the rst ondition of the denition of division. The
expression on the right side of the set di eren e operator

R*S ((R*S ( )  r s ) * R*S S ( )) ,


r

serves to eliminate those tuples that fail to satisfy the se ond ondition of
the denition of division. Let us see how it does so. Consider R*S (r)  s.
This relation is on s hema R, and pairs every tuple in R*S (r) with every
tuple in s. The expression R*S S (r) merely reorders the attributes of r.
Thus, (R*S (r)  s) * R*S S (r) gives us those pairs of tuples from
,

R*S (r) and s that do not appear in r. If a tuple tj is in


,

R*S ((R*S ( )  r s ) * R*S S ( )) ,


r

then there is some tuple ts in s that does not ombine with tuple tj to form
a tuple in r. Thus, tj holds a value for attributes R * S that does not appear
in r Ÿ s. It is these values that we eliminate from R*S (r).
CHAPTER
3
Introdu tion to SQL

Pra ti e Exer ises


3.1 Write the following queries in SQL, using the university s hema. (We suggest
you a tually run these queries on a database, using the sample data that we
provide on the web site of the book, db-book. om. Instru tions for setting up
a database, and loading sample data, are provided on the above web site.)

a. Find the titles of ourses in the Comp. S i. department that have 3 redits.
b. Find the IDs of all students who were taught by an instru tor named Ein-
stein; make sure there are no dupli ates in the result.
. Find the highest salary of any instru tor.
d. Find all instru tors earning the highest salary (there may be more than
one with the same salary).
e. Find the enrollment of ea h se tion that was o ered in Fall 2017.
f. Find the maximum enrollment, a ross all se tions, in Fall 2017.
g. Find the se tions that had the maximum enrollment in Fall 2017.

Answer:

a. Find the titles of ourses in the Comp. S i. department that have 3 redits.
sele t title
from ourse
where dept name = 'Comp. S i.' and redits = 3

b. Find the IDs of all students who were taught by an instru tor named Ein-
stein; make sure there are no dupli ates in the result.
This query an be answered in several di erent ways. One way is as fol-
lows.
11
12 Chapter 3 Introdu tion to SQL

sele t distin t takes.ID


from takes, instru tor, tea hes
where takes. ourse id = tea hes. ourse id and
takes.se id = tea hes.se id and
takes.semester = tea hes.semester and
takes.year = tea hes.year and
tea hes.id = instru tor.id and
instru tor.name = 'Einstein'

. Find the highest salary of any instru tor.

sele t max(salary)
from instru tor

d. Find all instru tors earning the highest salary (there may be more than
one with the same salary).

sele t ID, name


from instru tor
where salary = (sele t max(salary) from instru tor)

e. Find the enrollment of ea h se tion that was o ered in Fall 2017.

sele t ourse id, se id,


(sele t ount(ID)
from takes
where takes.year = se tion.year
and takes.semester = se tion.semester
and takes. ourse id = se tion. ourse id
and takes.se id = se tion.se id )
as enrollment
from se tion
where semester = 'Fall'
and year = 2017

Note that if the result of the subquery is empty, the aggregate fun tion
ountreturns a value of 0.
One way of writing the query might appear to be:
Pra ti e Exer ises 13

sele t takes. ourse id, takes.se id, ount(ID)


from se tion, takes
where takes. ourse id = se tion. ourse id
and takes.se id = se tion.se id
and takes.semester = se tion.semester
and takes.year = se tion.year
and takes.semester = 'Fall'
and takes.year = 2017
group by takes. ourse id, takes.se id

But note that if a se tion does not have any students taking it, it would
not appear in the result. One way of ensuring su h a se tion appears with
a ount of 0 is to use the outer join operation, overed in Chapter 4.
f. Find the maximum enrollment, a ross all se tions, in Fall 2017.
One way of writing this query is as follows:

sele t max(enrollment)
from (sele t ount(ID) as enrollment
from se tion, takes
where takes.year = se tion.year
and takes.semester = se tion.semester
and takes. ourse id = se tion. ourse id
and takes.se id = se tion.se id
and takes.semester = 'Fall'
and takes.year = 2017
group by takes. ourse id, takes.se id)

As an alternative to using a nested subquery in the from lause, it is pos-


sible to use a with lause, as illustrated in the answer to the next part of
this question.
A subtle issue in the above query is that if no se tion had any enroll-
ment, the answer would be empty, not 0. We an use the alternative using
a subquery, from the previous part of this question, to ensure the ount is
0 in this ase.
g. Find the se tions that had the maximum enrollment in Fall 2017.
The following answer uses a with lause, simplifying the query.
14 Chapter 3 Introdu tion to SQL

with se enrollment as (
sele t takes. ourse id, takes.se id, ount(ID) as enrollment
from se tion, takes
where takes.year = se tion.year
and takes.semester = se tion.semester
and takes. ourse id = se tion. ourse id
and takes.se id = se tion.se id
and takes.semester = 'Fall'
and takes.year = 2017
group by takes. ourse id, takes.se id)
sele t ourse id, se id
from se enrollment
where enrollment = (sele t max(enrollment) from se enrollment)

It is also possible to write the query without the with lause, but the sub-
query to nd enrollment would get repeated twi e in the query.
While not in orre t to add distin t in the ount, it is not ne essary in light
of the primary key onstraint on takes.

3.2 Suppose you are given a relation grade points(grade, points) that provides a on-
version from letter grades in the takes relation to numeri s ores; for example,
an “A” grade ould be spe ied to orrespond to 4 points, an “A*” to 3.7 points,
a “B+” to 3.3 points, a “B” to 3 points, and so on. The grade points earned by a
student for a ourse o ering (se tion) is dened as the number of redits for the
ourse multiplied by the numeri points for the grade that the student re eived.
Given the pre eding relation, and our university s hema, write ea h of the
following queries in SQL. You may assume for simpli ity that no takes tuple has
the null value for grade.
a. Find the total grade points earned by the student with ID 12345, a ross
all ourses taken by the student.
b. Find the grade point average (GPA) for the above student, that is, the total
grade points divided by the total redits for the asso iated ourses.
. Find the ID and the grade-point average of ea h student.
d. Now re onsider your answers to the earlier parts of this exer ise under
the assumption that some grades might be null. Explain whether your
solutions still work and, if not, provide versions that handle nulls properly.

Answer:

a. Find the total grade-points earned by the student with ID 12345, a ross
all ourses taken by the student.
Pra ti e Exer ises 15

sele t sum( redits *points)


fromtakes, ourse, grade points
where takes.grade = grade points.grade
and takes. ourse id = ourse. ourse id
and ID = 12345
In the above query, a student who has not taken any ourse would not
have any tuples, whereas we would expe t to get 0 as the answer. One way
of xing this problem is to use the outer join operation, whi h we study
later in Chapter 4. Another way to ensure that we get 0 as the answer is
via the following query:
(sele t sum( redits * points)
from takes, ourse, grade points
where takes.grade = grade points.grade
and takes. ourse id = ourse. ourse id
and ID= 12345)
union
(sele t 0
from student
where ID = 12345 and
not exists ( sele t * from takes where ID = 12345))
b. Find the grade point average (GPA) for the above student, that is, the total
grade-points divided by the total redits for the asso iated ourses.
sele t sum( redits * points)/sum( redits) as GPA
from takes, ourse, grade points
where takes.grade = grade points.grade
and takes. ourse id = ourse. ourse id
and ID= 12345
As before, a student who has not taken any ourse would not appear in
the above result; we an ensure that su h a student appears in the result by
using the modied query from the previous part of this question. However,
an additional issue in this ase is that the sum of redits would also be 0,
resulting in a divide-by-zero ondition. In fa t, the only meaningful way
of dening the GPA in this ase is to dene it as null. We an ensure that
su h a student appears in the result with a null GPA by adding the following
union lause to the above query.

union
(sele t null as GPA
from student
where ID = 12345 and
not exists ( sele t * from takes where ID = 12345))
16 Chapter 3 Introdu tion to SQL

. Find the ID and the grade-point average of ea h student.


sele t ID, sum( redits * points)/sum( redits) as GPA
from takes, ourse, grade points
where takes.grade = grade points.grade
and takes. ourse id = ourse. ourse id
group by ID
Again, to handle students who have not taken any ourse, we would have
to add the following union lause:
union
(sele t ID, null as GPA
from student
where not exists ( sele t * from takes where takes.ID = student.ID))
d. Now re onsider your answers to the earlier parts of this exer ise under
the assumption that some grades might be null. Explain whether your
solutions still work and, if not, provide versions that handle nulls properly.
The queries listed above all in lude a test of equality on grade between
grade points and takes. Thus, for any takes tuple with a null grade, that
student's ourse would be eliminated from the rest of the omputation
of the result. As a result, the redits of su h ourses would be eliminated
also, and thus the queries would return the orre t answer even if some
grades are null.
3.3 Write the following inserts, deletes, or updates in SQL, using the university
s hema.
a. In rease the salary of ea h instru tor in the Comp. S i. department by
10%.
b. Delete all ourses that have never been o ered (i.e., do not o ur in the
se tion relation).
. Insert every student whose tot red attribute is greater than 100 as an in-
stru tor in the same department, with a salary of $10,000.

Answer:

a. In rease the salary of ea h instru tor in the Comp. S i. department by


10%.
update instru tor
set salary = salary * 1.10
where dept name = Comp. S i.
b. Delete all ourses that have never been o ered (that is, do not o ur in
the se tion relation).
Pra ti e Exer ises 17

person (driver id , name, address)


ar (li ense plate, model , year)
a ident (report number, year, lo ation)
owns (driver id , li ense plate)
parti ipated (report number, li ense plate, driver id , damage amount)

Figure 3.17 Insuran e database

delete from ourse


where ourse id not in
(sele t ourse id from se tion)

. Insert every student whose tot red attribute is greater than 100 as an in-
stru tor in the same department, with a salary of $10,000.
insert into instru tor
sele t ID, name, dept name, 10000
from student
where tot red > 100
3.4 Consider the insuran e database of Figure 3.17, where the primary keys are
underlined. Constru t the following SQL queries for this relational database.
a. Find the total number of people who owned ars that were involved in
a idents in 2017.
b. Delete all year-2010 ars belonging to the person whose ID is 12345.

Answer:

a. Find the total number of people who owned ars that were involved in
a idents in 2017.
Note: This is not the same as the total number of a idents in 2017. We
must ount people with several a idents only on e. Furthermore, note
that the question asks for owners, and it might be that the owner of the
ar was not the driver a tually involved in the a ident.
sele t ount (distin t person.driver id)
from a ident, parti ipated, person, owns
where a ident.report number = parti ipated.report number
and owns.driver id = person.driver id
and owns.li ense plate = parti ipated.li ense plate
and year = 2017
18 Chapter 3 Introdu tion to SQL

b. Delete all year-2010 ars belonging to the person whose ID is 12345.

delete ar
where year = 2010 and li ense plate in
(sele t li ense plate
from owns o
where o.driver id = 12345)

Note: The owns, a ident and parti ipated re ords asso iated with the
deleted ars still exist.

3.5 Suppose that we have a relation marks(ID, s ore) and we wish to assign grades
to students based on the s ore as follows: grade F if s ore < 40, grade C if 40
f s ore < 60, grade B if 60 f s ore < 80, and grade A if 80 f s ore. Write SQL
queries to do the following:

a. Display the grade for ea h student, based on the marks relation.


b. Find the number of students with ea h grade.

Answer:

a. Display the grade for ea h student, based on the marks relation.

sele t ID,
ase
when s ore < 40 then 'F'
when s ore < 60 then 'C'
when s ore < 80 then 'B'
else 'A'
end
from marks

b. Find the number of students with ea h grade.


Pra ti e Exer ises 19

with grades as
(
sele t ID,
ase
when s ore < 40 then 'F'
when s ore < 60 then 'C'
when s ore < 80 then 'B'
else 'A'
end as grade
from marks
)
sele t grade, ount(ID)
from grades
group by grade

As an alternative, the with lause an be removed, and instead the deni-


tion of grades an be made a subquery of the main query.
3.6 The SQL like operator is ase sensitive (in most systems), but the lower() fun -
tion on strings an be used to perform ase-insensitive mat hing. To show how,
write a query that nds departments whose names ontain the string “s i” as a
substring, regardless of the ase.
Answer:

sele t dept name


from department
where lower(dept name) like '%s i%'

3.7 Consider the SQL query


sele tp.a1
fromp, r1, r2
where p.a1 = r 1.a1 or p.a1 = r 2.a1

Under what onditions does the pre eding query sele t values of p:a1 that are
either in r1 or in r2? Examine arefully the ases where either r1 or r2 may be
empty.
Answer:
The query sele ts those values of p.a1 that are equal to some value of r1.a1 or
r2.a1 if and only if both r1 and r2 are non-empty. If one or both of r1 and r2 are
empty, the Cartesian produ t of p, r1 and r2 is empty, hen e the result of the
query is empty. If p itself is empty, the result is empty.
3.8 Consider the bank database of Figure 3.18, where the primary keys are under-
lined. Constru t the following SQL queries for this relational database.
20 Chapter 3 Introdu tion to SQL

bran h(bran h name, bran h ity, assets)


ustomer (ID, ustomer name, ustomer street, ustomer ity)
loan (loan number, bran h name, amount)
borrower (ID, loan number)
a ount (a ount number, bran h name, balan e )
depositor (ID, a ount number)

Figure 3.18 Banking database.

a. Find the ID of ea h ustomer of the bank who has an a ount but not a
loan.
b. Find the ID of ea h ustomer who lives on the same street and in the same
ity as ustomer 12345.
. Find the name of ea h bran h that has at least one ustomer who has an
a ount in the bank and who lives in “Harrison”.

Answer:

a. Find the ID of ea h ustomer of the bank who has an a ount but not a
loan.

(sele t ID
from depositor)
ex ept
(sele t ID
from borrower)

b. Find the ID of ea h ustomer who lives on the same street and in the same
ity as ustomer 12345.

sele t F.ID
from ustomer as F, ustomer as S
where F. ustomer street = S. ustomer street
and F. ustomer ity = S. ustomer ity
and S. ustomer id = 12345

. Find the name of ea h bran h that has at least one ustomer who has an
a ount in the bank and who lives in “Harrison”.
Pra ti e Exer ises 21

sele t distin t bran h name


from a ount, depositor, ustomer
where ustomer.id = depositor.id
and depositor.a ount number = a ount.a ount number
and ustomer ity = 'Harrison'

3.9 Consider the relational database of Figure 3.19, where the primary keys are
underlined. Give an expression in SQL for ea h of the following queries.

a. Find the ID, name, and ity of residen e of ea h employee who works for
“First Bank Corporation”.
b. Find the ID, name, and ity of residen e of ea h employee who works for
“First Bank Corporation” and earns more than $10000.
. Find the ID of ea h employee who does not work for “First Bank Corpo-
ration”.
d. Find the ID of ea h employee who earns more than every employee of
“Small Bank Corporation”.
e. Assume that ompanies may be lo ated in several ities. Find the name
of ea h ompany that is lo ated in every ity in whi h “Small Bank Cor-
poration” is lo ated.
f. Find the name of the ompany that has the most employees (or ompa-
nies, in the ase where there is a tie for the most).
g. Find the name of ea h ompany whose employees earn a higher salary,
on average, than the average salary at “First Bank Corporation”.

Answer:

a. Find the ID, name, and ity of residen e of ea h employee who works for
“First Bank Corporation”.

employee (ID, person name, street, ity)


works (ID, ompany name, salary)
ompany ( ompany name, ity)
manages (ID, manager id )

Figure 3.19 Employee database.


22 Chapter 3 Introdu tion to SQL

sele te.ID, e.person name, ity


from employee as e, works as w
where w. ompany name = First Bank Corporation and
w.ID = e.ID
b. Find the ID, name, and ity of residen e of ea h employee who works for
“First Bank Corporation” and earns more than $10000.

sele t *
from employee
where ID in
(sele t ID
from works
where ompany name = First Bank Corporation and salary > 10000)
This ould be written also in the style of the answer to part a.
. Find the ID of ea h employee who does not work for “First Bank Corpo-
ration”.
sele t ID
fromworks
where ompany name <> First Bank Corporation

If one allows people to appear in employee without appearing also in


works, the solution is slightly more ompli ated. An outer join as dis-
ussed in Chapter 4 ould be used as well.
sele t ID
from employee
where ID not in
(sele t ID
from works
where ompany name = First Bank Corporation)
d. Find the ID of ea h employee who earns more than every employee of
“Small Bank Corporation”.
sele t ID
from works
where salary > all
(sele t salary
from works
where ompany name = Small Bank Corporation)

If people may work for several ompanies and we wish to onsider the total
earnings of ea h person, the problem is more omplex. But note that the
Pra ti e Exer ises 23

fa t that ID is the primary key for works implies that this annot be the
ase.
e. Assume that ompanies may be lo ated in several ities. Find the name
of ea h ompany that is lo ated in every ity in whi h “Small Bank Cor-
poration” is lo ated.

sele t S. ompany name


from ompany as S
where not exists ((sele t ity
from ompany
where ompany name = Small Bank Corporation)
ex ept
(sele t ity
from ompany as T
where S. ompany name = T. ompany name))

f. Find the name of the ompany that has the most employees (or ompa-
nies, in the ase where there is a tie for the most).
sele t ompany name
from works
group by ompany name
having ount (distin t ID) >= all
(sele t ount (distin t ID)
from works
group by ompany name)
g. Find the name of ea h ompany whose employees earn a higher salary,
on average, than the average salary at “First Bank Corporation”.
sele t ompany name
from works
group by ompany name
having avg (salary) > (sele t avg (salary)
from works
where ompany name = First Bank Corporation)

3.10 Consider the relational database of Figure 3.19. Give an expression in SQL for
ea h of the following:
a. Modify the database so that the employee whose ID is 12345 now lives
in “Newtown”.
b. Give ea h manager of “First Bank Corporation” a 10 per ent raise unless
the salary be omes greater than $100000; in su h ases, give only a 3
per ent raise.
24 Chapter 3 Introdu tion to SQL

Answer:

a. Modify the database so that the employee whose ID is 12345 now lives
in “Newtown”.
update employee
set ity = Newtown
where ID = 12345
b. Give ea h manager of “First Bank Corporation” a 10 per ent raise unless
the salary be omes greater than $100000; in su h ases, give only a 3
per ent raise.
update works T
setT.salary = T.salary * 1.03
where T .ID in (sele t manager id
from manages)
and T.salary * 1.1 > 100000
and T. ompany name = First Bank Corporation

update works T
set T.salary = T.salary * 1.1
where T .ID in (sele t manager id
from manages)
and T.salary * 1.1 <= 100000
and T. ompany name = First Bank Corporation

The above updates would give di erent results if exe uted in the opposite
order. We give below a safer solution using the ase statement.
update works T
set T.salary = T.salary <
( ase
when (T.salary < 1:1 > 100000) then 1.03
else 1.1
end)
where T.ID in (sele t manager id
from manages) and
T. ompany name = First Bank Corporation
CHAPTER
4
Intermediate SQL

Pra ti e Exer ises


4.1 Consider the following SQL query that seeks to nd a list of titles of all ourses
taught in Spring 2017 along with the name of the instru tor.

sele t name, title


from instru tor natural join tea hes natural join se tion natural join ourse

where semester = Spring and year = 2017

What is wrong with this query?


Answer:
Although the query is synta ti ally orre t, it does not ompute the expe ted
answer be ause dept name is an attribute of both ourse and instru tor. As a
result of the natural join, results are shown only when an instru tor tea hes a
ourse in her or his own department.
4.2 Write the following queries in SQL:

a. Display a list of all instru tors, showing ea h instru tor's ID and the num-
ber of se tions taught. Make sure to show the number of se tions as 0 for
instru tors who have not taught any se tion. Your query should use an
outer join, and should not use subqueries.
b. Write the same query as in part a, but using a s alar subquery and not
using outer join.
. Display the list of all ourse se tions o ered in Spring 2018, along with
the ID and name of ea h instru tor tea hing the se tion. If a se tion has
more than one instru tor, that se tion should appear as many times in
the result as it has instru tors. If a se tion does not have any instru tor,
it should still appear in the result with the instru tor name set to “—”.
25
26 Chapter 4 Intermediate SQL

d. Display the list of all departments, with the total number of instru tors
in ea h department, without using subqueries. Make sure to show depart-
ments that have no instru tors, and list those departments with an instru -
tor ount of zero.

Answer:

a. Display a list of all instru tors, showing ea h instru tor's ID and the num-
ber of se tions taught. Make sure to show the number of se tions as 0 for
instru tors who have not taught any se tion. Your query should use an
outer join, and should not use subqueries.
sele t ID, ount( se id ) as Number of se tions
from instru tor natural left outer join tea hes

group by ID
The above query should not be written using ount(*) sin e that would
ount null values also. It ould be written using any attribute from tea hes
whi h does not o ur in instru tor, whi h would be orre t although it
may be onfusing to the reader. (Attributes that o ur in instru tor would
not be null even if the instru tor has not taught any se tion.)
b. Write the same query as above, but using a s alar subquery, and not using
outerjoin.
sele t ID,
(sele t ount(*) as Number of se tions
from tea hes T where T.id = I.id)
from instru tor I

. Display the list of all ourse se tions o ered in Spring 2018, along with
the ID and name of ea h instru tor tea hing the se tion. If a se tion has
more than one instru tor, that se tion should appear as many times in
the result as it has instru tors. If a se tion does not have any instru tor,
it should still appear in the result with the instru tor name set to “—”.
sele t , se id , ID,
ourse id

de ode(name, null, '*', name) as name

from (se tion natural left outer join tea hes)


natural left outer join instru tor

where semester ='Spring' and year= 2018


The query may also be written using the oales e operator, by repla ing
*'). A more omplex version of the query
de ode(..) with oales e(name, '
an be written using union of join result with another query that uses a
subquery to nd ourses that do not mat h; refer to Exer ise 4.3.
Exer ises 27

d. Display the list of all departments, with the total number of instru tors
in ea h department, without using subqueries. Make sure to show depart-
ments that have no instru tors, and list those departments with an instru -
tor ount of zero.
sele t dept name, ount(ID)
from department natural left outer join instru tor

group by dept name

4.3 Outer join expressions an be omputed in SQL without using the SQL outer
join operation. To illustrate this fa t, show how to rewrite ea h of the following
SQL queries without using the outer join expression.
a. sele t * from student natural left outer join takes
b. sele t * from student natural full outer join takes

Answer:

a. sele t* from student natural left outer join takes


an be rewritten as:
sele t * from student natural join takes
union
sele t ID, name , dept name, tot red , null, null, null, null, null
from student S1 where not exists
(sele t ID from takes T1 where T1.id = S1.id)
b. sele t * from student natural full outer join takes
an be rewritten as:
(sele t * from student natural join takes)
union
(sele t ID, name, dept name, tot red , null, null, null, null, null
from student S1

where not exists


(sele t ID from takes T1 where T1.id = S1.id))
union
(sele t ID, null, null, null, ourse id , se id , semester, year, grade
from takes T1

where not exists


(sele t ID from student S1 whereT1.id = S1.id))
4.4 Suppose we have three relations r(A, B), s(B, C ), and t(B, D), with all attributes
de lared as not null.
a. Give instan es of relations r, s, and t su h that in the result of
(r natural left outer join s) natural left outer join t
attribute C has a null value but attribute D has a non-null value.
28 Chapter 4 Intermediate SQL

b. Are there instan es of r, s, and t su h that the result of


r natural left outer join (s natural left outer join t )

has a null value for C but a non-null value for D? Explain why or why not.

Answer:

a. Consider r = (a, b), s = (b1, 1), t = (b, d ). The se ond expression would
give (a, b, null, d ).
b. Sin e s natural left outer join t is omputed rst, the absen e of nulls is
both s and t implies that ea h tuple of the result an have D null, but C
an never be null.

4.5 Testing SQL queries: To test if a query spe ied in English has been orre tly
written in SQL, the SQL query is typi ally exe uted on multiple test databases,
and a human he ks if the SQL query result on ea h test database mat hes the
intention of the spe i ation in English.

a. In Se tion 4.1.1 we saw an example of an erroneous SQL query whi h was


intended to nd whi h ourses had been taught by ea h instru tor; the
query omputed the natural join of instru tor, tea hes, and ourse, and as
a result it unintentionally equated the dept name attribute of instru tor and
ourse. Give an example of a dataset that would help at h this parti ular

error.
b. When reating test databases, it is important to reate tuples in referen ed
relations that do not have any mat hing tuple in the referen ing relation
for ea h foreign key. Explain why, using an example query on the univer-
sity database.
. When reating test databases, it is important to reate tuples with null
values for foreign-key attributes, provided the attribute is nullable (SQL
allows foreign-key attributes to take on null values, as long as they are not
part of the primary key and have not been de lared as not null). Explain
why, using an example query on the university database.

Hint : Use the queries from Exer ise 4.2.


Answer:

a. Consider the ase where a professor in the Physi s department tea hes
an Ele . Eng. ourse. Even though there is a valid orresponding entry in
tea hes, it is lost in the natural join of instru tor , tea hes and ourse, sin e

the instru tor's department name does not mat h the department name
of the ourse. A dataset orresponding to the same is:
Exer ises 29

instru tor= {(12345,'Gauss', 'Physi s', 10000)}


tea hes = {(12345, 'EE321', 1, 'Spring', 2017)}
ourse = {('EE321', 'Magnetism', 'Ele . Eng.', 6)}

b. The query in question 4.2(a) is a good example for this. Instru tors who
have not taught a single ourse should have number of se tions as 0 in
the query result. (Many other similar examples are possible.)
. Consider the query
sele t * from tea hes natural join instru tor ;
In this query, we would lose some se tions if tea hes.ID is allowed to be
null and su h tuples exist. If, just be ause tea hes.ID is a foreign key to

instru tor , we did not reate su h a tuple, the error in the above query

would not be dete ted.


4.6 Show how to dene the view student grades (ID, GPA) giving the grade-point
average of ea h student, based on the query in Exer ise 3.2; re all that we used
a relation grade points(grade, points) to get the numeri points asso iated with
a letter grade. Make sure your view denition orre tly handles the ase of null
values for the grade attribute of the takes relation.
Answer:
We should not add redits for ourses with a null grade; further, to orre tly
handle the ase where a student has not ompleted any ourse, we should make
sure we don't divide by zero, and should instead return a null value.
We break the query into a subquery that nds sum of redits and sum of
redit-grade-points, taking null grades into a ount The outer query divides the
above to get the average, taking are of divide by zero.

reate view student grades(ID, GPA) as


sele t ID, redit points / de ode( redit sum, 0, null, redit sum)
from ((sele t ID, sum(de ode(grade, null, 0, redits)) as redit sum,
sum(de ode(grade, null, 0, redits*points)) as redit points
from(takes natural join ourse) natural left outer join grade points
group by ID)
union
sele t ID, null, null
from student

where ID not in (sele t ID from takes ))

The view dened above takes are of null grades by onsidering the redit points
to be 0 and not adding the orresponding redits in redit sum.
30 Chapter 4 Intermediate SQL

employee(ID, person name, street, ity)


works(ID, ompany name, salary)
ompany ( ompany name, ity)

manages (ID, manager id )

Figure 4.12 Employee database.

The query above ensures that a student who has not taken any ourse with
non-null redits, and has redit sum = 0 gets a GPA of null. This avoids the
division by zero, whi h would otherwise have resulted.
In systems that do note support de ode, an alternative is the ase onstru t.
Using ase, the solution would be written as follows:

reate view student grades(ID, GPA) as


sele t ID, redit points / ( ase when redit sum = 0 then null
else redit sum end)
from ((sele t ID, sum ( ase when grade is null then 0
else redits end) as redit sum,
sum ( ase when grade is null then 0
else redits*points end) as redit points
from(takes natural join ourse) natural left outer join grade points
group by ID)
union
sele t ID, null, null
from student

where ID not in (sele t ID from takes ))

An alternative way of writing the above query would be to use student natural
left outer join gpa, in order to onsider students who have not taken any ourse.

4.7 Consider the employee database of Figure 4.12. Give an SQL DDL denition
of this database. Identify referential-integrity onstraints that should hold, and
in lude them in the DDL denition.
Answer:
Plese see ??.
Note that alternative data types are possible. Other hoi es for not null at-
tributes may be a eptable.
4.8 As dis ussed in Se tion 4.4.8, we expe t the onstraint “an instru tor annot
tea h se tions in two di erent lassrooms in a semester in the same time slot”
to hold.
Exer ises 31

reate table employee

(ID numeri (6,0),


person name har(20),
street har(30),
ity har(30),
primary key (ID))

reate table works

(ID numeri (6,0),


ompany name har(15),
salary integer,
primary key (ID),
foreign key (ID) referen es employee,
foreign key ( ompany name) referen es ompany)

reate table ompany

( ompany name har(15),


ity har(30),
primary key ( ompany name))

reate table manages

(ID numeri (6,0),


manager iid numeri (6,0),
primary key (ID),
foreign key (ID) referen es employee,
foreign key (manager iid) referen es employee(ID))

Figure 4.101 Figure for Exer ise 4.7.

a. Write an SQL query that returns all (instru tor, se tion) ombinations that
violate this onstraint.
b. Write an SQL assertion to enfor e this onstraint (as dis ussed in Se -
tion 4.4.8, urrent generation database systems do not support su h as-
sertions, although they are part of the SQL standard).

Answer:
32 Chapter 4 Intermediate SQL

a. Query:
sele t ID, name, se id , semester, year, time slot id ,
ount(distin t building , room number )

from instru tor natural join tea hes natural join se tion

group by (ID, name, se id , semester, year, time slot id )


having ount(building, room number ) > 1

Note that the distin t keyword is required above. This is to allow two dif-
ferent se tions to run on urrently in the same time slot and are taught
by the same instru tor without being reported as a onstraint violation.
b. Query:
reate assertion he k not exists
( sele t ID, name, se id , semester, year, time slot id ,
ount(distin t building, room number )
from instru tor natural join tea hes natural join se tion

group by (ID, name, se id , semester, year, time slot id )


having ount(building, room number ) > 1)

4.9 SQL allows a foreign-key dependen y to refer to the same relation, as in the
following example:

reate table manager

(employee ID har(20),
manager ID har(20),
primary key employee ID,

foreign key (manager ID) referen es manager(employee ID)


on delete as ade )

Here, employee ID is a key to the table manager, meaning that ea h employee


has at most one manager. The foreign-key lause requires that every manager
also be an employee. Explain exa tly what happens when a tuple in the relation
manager is deleted.

Answer:
The tuples of all employees of the manager, at all levels, get deleted as well! This
happens in a series of steps. The initial deletion will trigger deletion of all the
tuples orresponding to dire t employees of the manager. These deletions will
in turn ause deletions of se ond-level employee tuples, and so on, till all dire t
and indire t employee tuples are deleted.
4.10 Given the relations a(name, address, title) and b(name, address, salary), show
how to express a natural full outer join b using the full outer-join operation with
an on ondition rather than using the natural join syntax. This an be done using
the oales e operation. Make sure that the result relation does not ontain two
Exer ises 33

opies of the attributes name and address and that the solution is orre t even
if some tuples in a and b have null values for attributes name or address.
Answer:

sele t oales e(a.name, b.name ) as name,


oales e(a.address, b.address ) as address,
a.title ,
b.salary

from a full outer join b on a.name = b.name and


a.address = b.address

4.11 Operating systems usually o er only two types of authorization ontrol for data
les: read a ess and write a ess. Why do database systems o er so many kinds
of authorization?
Answer: There are many reasons—we list a few here. One might wish to allow
a user only to append new information without altering old information. One
might wish to allow a user to a ess a relation but not hange its s hema. One
might wish to limit a ess to aspe ts of the database that are not te hni ally
data a ess but instead impa t resour e utilization, su h as reating an index.
4.12 Suppose a user wants to grant sele t a ess on a relation to another user. Why
should the user in lude (or not in lude) the lause granted by urrent role in the
grant statement?

Answer: Both ases give the same authorization at the time the statement
is exe uted, but the long-term e e ts di er. If the grant is done based on the
role, then the grant remains in e e t even if the user who performed the grant
leaves and that user's a ount is terminated. Whether that is a good or bad idea
depends on the spe i situation, but usually granting through a role is more
onsistent with a well-run enterprise.
4.13 Consider a view v whose denition referen es only relation r.
• If a user is granted sele t authorization on v, does that user need to have
sele t authorization on r as well? Why or why not?

• If a user is granted update authorization on v, does that user need to have


update authorization on r as well? Why or why not?
• Give an example of an insert operation on a view v to add a tuple t that is
not visible in the result of sele t * from v. Explain your answer.

Answer:

• No. This allows a user to be granted a ess to only part of relation r.


34 Chapter 4 Intermediate SQL

• Yes. A valid update issued using view v must update r for the update to be
stored in the database.
• Any tuple t ompatible with the s hema for v but not satisfying the where
lause in the denition of view v is a valid example. One su h example
appears in Se tion 4.2.4.
CHAPTER
5
Advan ed SQL
Pra ti e Exer ises
5.1 Consider the following relations for a ompany database:

• emp (ename, dname, salary)


• mgr (ename, mname)

and the Java ode in Figure 5.20, whi h uses the JDBC API. Assume that the
userid, password, ma hine name, et . are all okay. Des ribe in on ise English
what the Java program does. (That is, produ e an English senten e like “It nds
the manager of the toy department,” not a line-by-line des ription of what ea h
Java statement does.)
Answer:

It prints out the manager of “dog,” that manager's manager, et ., until we rea h
a manager who has no manager (presumably, the CEO, who most ertainly is a
at). Note: If you try to run this, use your own Ora le ID and password.
5.2 Write a Java method using JDBC metadata features that takes a ResultSet as
an input parameter and prints out the result in tabular form, with appropriate
names as olumn headings.
Answer:

Please see ??
5.3 Suppose that we wish to nd all ourses that must be taken before some given
ourse. That means nding not only the prerequisites of that ourse, but prereq-
uisites of prerequisites, and so on. Write a omplete Java program using JDBC
that:

• Takes a ourse id value from the keyboard.


• Finds prerequisites of that ourse using an SQL query submitted via JDBC.
35
36 Chapter 5 Advan ed SQL

import java.sql.*;
publi lass Mystery {
publi stati void main(String[℄ args) {
try (
Conne tion on=DriverManager.getConne tion(
"jdb :ora le:thin:star/X//edgar. se.lehigh.edu:1521/XE");
q = "sele t mname from mgr where ename = ?";
PreparedStatement stmt= on.prepareStatement();
)
{
String q;
String empName = "dog";
boolean more;
ResultSet result;
do {
stmt.setString(1, empName);
result = stmt.exe uteQuery(q);
more = result.next();
if (more) {
empName = result.getString("mname");
System.out.println (empName);
}
} while (more);
s. lose();
on. lose();
}
at h(Ex eption e){
e.printSta kTra e();
}
}
}

Figure 5.20 Java ode for Exer ise 5.1 (using Ora le JDBC).

• For ea h ourse returned, nds its prerequisites and ontinues this pro ess
iteratively until no new prerequisite ourses are found.
• Prints out the result.
For this exer ise, do not use a re ursive SQL query, but rather use the iterative
approa h des ribed previously. A well-developed solution will be robust to the
error ase where a university has a identally reated a y le of prerequisites
(that is, for example, ourse A is a prerequisite for ourse B, ourse B is a pre-
requisite for ourse C , and ourse C is a prerequisite for ourse A).
Pra ti e Exer ises 37

printTable(ResultSet result) throws SQLException {


metadata = result.getMetaData();
num cols = metadata.getColumnCount();
for(int i = 1; i <= num cols; i++) {
System.out.print(metadata.getColumnName(i) + ’\t’);
}
System.out.println();
while(result.next()) {
for(int i = 1; i <= num cols; i++) {
System.out.print(result.getString(i) + ’\t’
}
System.out.println();
} }

Figure 5.101 Java method using JDBC for Exer ise 5.2.

Answer:
Please see ??
5.4 Des ribe the ir umstan es in whi h you would hoose to use embedded SQL
rather than SQL alone or only a general-purpose programming language.
Answer:

Writing queries in SQL is typi ally mu h easier than oding the same queries
in a general-purpose programming language. However, not all kinds of queries
an be written in SQL. Also, nonde larative a tions su h as printing a report,
intera ting with a user, or sending the results of a query to a graphi al user inter-
fa e annot be done from within SQL. Under ir umstan es in whi h we want
the best of both worlds, we an hoose embedded SQL or dynami SQL, rather
than using SQL alone or using only a general-purpose programming language.
5.5 Show how to enfor e the onstraint “an instru tor annot tea h two di erent
se tions in a semester in the same time slot.” using a trigger (remember that the
onstraint an be violated by hanges to the tea hes relation as well as to the
se tion relation).

Answer:
Please see ??
5.6 Consider the bank database of Figure 5.21. Let us dene a view bran h ust as
follows:
38 Chapter 5 Advan ed SQL

import java.sql.*;
import java.util.Scanner;
import java.util.Arrays;
public class AllCoursePrereqs {
public static void main(String[] args) {
try (
Connection con=DriverManager.getConnection
("jdbc:oracle:thin:@edgar0.cse.lehigh.edu:1521:cse241","star","pw");
Statement s=con.createStatement();
){
String q;
String c;
ResultSet result;
int maxCourse = 0;
q = "select count(*) as C from course";
result = s.executeQuery(q);
if (!result.next()) System.out.println ("Unexpected empty result.");
else maxCourse = Integer.parseInt(result.getString("C"));
int numCourse = 0, oldNumCourse = -1;
String[] prereqs = new String [maxCourse];
Scanner krb = new Scanner(System.in);
System.out.print("Input a course id (number): ");
String course = krb.next();
String courseString = "" + ’\’’ + course + ’\’’;
while (numCourse != oldNumCourse) {
for (int i = oldNumCourse + 1; i < numCourse; i++) {
courseString += ", " + ’\’’ + prereqs[i] + ’\’’ ;
}
oldNumCourse = numCourse;
q = "select prereq_id from prereq where course_id in ("
+ courseString + ")";
result = s.executeQuery(q);
while (result.next()) {
c = result.getString("prereq_id");
boolean found = false;
for (int i = 0; i < numCourse; i++)
found |= prereqs[i].equals(c);
if (!found) prereqs[numCourse++] = c;
}
courseString = "" + ’\’’ + prereqs[oldNumCourse] + ’\’’;
}
Arrays.sort(prereqs,0,numCourse);
System.out.print("The courses that must be taken prior to "
+ course + " are: ");
for (int i = 0; i < numCourse; i++)
System.out.print ((i==0?" ":", ") + prereqs[i]);
System.out.println();
} catch(Exception e){e.printStackTrace();
} }

Figure 5.102 Complete Java program using JDBC for Exer ise 5.3.
Pra ti e Exer ises 39

reate trigger onese before insert on se tion

referen ing new row as nrow

for ea h row

when (nrow.time slot id in (


sele t time slot id

from tea hes natural join se tion

where ID in (
sele t ID
from tea hes natural join se tion

where se id= nrow.se id and ourse id = nrow. ourse id and


semester = nrow.semester and year = nrow.year

)))
begin
rollba k

end ;

reate trigger onetea h before insert on tea hes

referen ing new row as nrow

for ea h row

when (exists (
sele t time slot id

from tea hes natural join se tion

where ID = nrow.ID
interse t
sele t time slot id

from se tion

where se id = nrow.se id and ourse id = nrow. ourse id and


semester = nrow.semester and year = nrow.year
))
begin

rollba k
end ;

Figure 5.103 Trigger ode for Exer ise 5.5.

reate view bran h ust as

sele t bran h name, ustomer name

from depositor, a ount

where depositor.a ount number = a ount.a ount number


40 Chapter 5 Advan ed SQL

bran h (bran h name, bran h ity, assets)


ustomer ( ustomer name, ustomer street, ust omer ity)
loan (loan number , bran h name, amount )

borrower ( ustomer name, loan number )

a ount (a ount number , bran h name, balan e )

depositor ( ustomer name, a ount number )

Figure 5.21 Banking database for Exer ise 5.6.

Suppose that the view is materialized; that is, the view is omputed and stored.
Write triggers to maintain the view, that is, to keep it up-to-date on insertions
to depositor or a ount. It is not ne essary to handle deletions or updates. Note
that, for simpli ity, we have not required the elimination of dupli ates.
Answer:

Please see ??
5.7 Consider the bank database of Figure 5.21. Write an SQL trigger to arry out
the following a tion: On delete of an a ount, for ea h ustomer-owner of the

reate trigger insert into bran h ust via depositor

after insert on depositor

referen ing new row as inserted

for ea h row

insert into bran h ust

sele t bran h name, inserted. ustomer name

from a ount

where inserted.a ount number = a ount.a ount number

reate trigger insert into bran h ust via a ount

after insert on a ount

referen ing new row as inserted

for ea h statement
insert into bran h ust

sele t inserted.bran h name, ustomer name

from depositor

where depositor.a ount number = inserted.a ount number

Figure 5.22 Trigger ode for Exer ise 5.6.


Pra ti e Exer ises 41

a ount, he k if the owner has any remaining a ounts, and if she does not,
delete her from the depositor relation.
Answer:

reate trigger he k-delete-trigger after delete on a ount

referen ing old row as orow

for ea h row
delete from depositor

where depositor. ustomer name not in


( sele t ustomer name from depositor
where a ount number <> orow.a ount number )
end

5.8 Given a relation S(student, subje t, marks), write a query to nd the top 10 stu-
dents by total marks, by using SQL ranking. In lude all students tied for the nal
spot in the ranking, even if that results in more than 10 total students.
Answer:

sele t *
from (
sele t student, total , rank() over (order by (total ) des ) as t rank

from (
sele t student , sum(marks) as total
from S group by student

)
)
where t rank <= 10
5.9 Given a relation nyse(year, month, day, shares traded, dollar volume) with trad-
ing data from the New York Sto k Ex hange, list ea h trading day in order of
number of shares traded, and show ea h day's rank.
Answer:

sele t year, month, day, shares traded ,


rank () over (order by shares traded des ) as mostshares
from nyse

5.10 Using the relation from Exer ise 5.9, write an SQL query to generate a report
showing the number of shares traded, number of trades, and total dollar volume
broken down by year, ea h month of ea h year, and ea h trading day.
Answer:
42 Chapter 5 Advan ed SQL

sele t year, month, day , sum(shares traded ) as shares,


sum (num trades ) as trades, sum(dollar volume) as total volume

from nyse

group by rollup (year, month, day)

5.11 Show how to express group by ube(a, b, , d ) using rollup; your answer should
have only one group by lause.
Answer:

groupby rollup (a), rollup(b), rollup( ), rollup(d)


CHAPTER
6
Database Design using the E-R
Model

Pra ti e Exer ises


6.1 Constru t an E-R diagram for a ar insuran e ompany whose ustomers own
one or more ars ea h. Ea h ar has asso iated with it zero to any number of
re orded a idents. Ea h insuran e poli y overs one or more ars and has one
or more premium payments asso iated with it. Ea h payment is for a parti ular
period of time, and has an asso iated due date, and the date when the payment
was re eived.
Answer:

One possible E-R diagram is shown in Figure 6.101. Payments are modeled as
weak entities sin e they are related to a spe i poli y.
Note that the parti ipation of a ident in the relationship parti ipated is not
total, sin e it is possible that there is an a ident report where the parti ipating
ar is unknown.
6.2 Consider a database that in ludes the entity sets student, ourse, and se tion
from the university s hema and that additionally re ords the marks that students
re eive in di erent exams of di erent se tions.

a. Constru t an diagram that models exams as entities and uses a ternary


E-R

relationship as part of the design.


b. Constru t an alternative diagram that uses only a binary relationship
E-R

between student and se tion. Make sure that only one relationship exists
between a parti ular student and se tion pair, yet you an represent the
marks that a student gets in di erent exams.

Answer:

43
44 Chapter 6 Database Design using the E-R Model

customer car policy


1..* 1..1
customer_id owns license_no covers policy_id
name model
address

participated payment

accident premium_ payment


report_id payment_no
date due_date
place amount
received_on

Figure 6.101 E-R diagram for a ar insuran e ompany.

a. The diagram is shown in Figure 6.102. Note that an alternative is to


E-R

model examinations as weak entities related to a se tion, rather than as


strong entities. The marks relationship would then be a binary relation-
ship between student and exam, without dire tly involving se tion.
b. The diagram is shown in Figure 6.103. Note that here we have not
E-R

modeled the name, pla e, and time of the exam as part of the relationship
attributes. Doing so would result in dupli ation of the information, on e
per student, and we would not be able to re ord this information without
an asso iated student. If we wish to represent this information, we need
to retain a separate entity orresponding to ea h exam.
6.3 Design an diagram for keeping tra k of the s oring statisti s of your favorite
E-R

sports team. You should store the mat hes played, the s ores in ea h mat h, the
players in ea h mat h, and individual player s oring statisti s for ea h mat h.

marks
student
section course
student_id
name sec_id course_id
dept_name exam_marks semester sec_course title
tot_cred year credits

exam
exam_id
name
place
time

Figure 6.102 E-R diagram for marks database.


Pra ti e Exer ise 45

{exam_marks
exam_id
marks
}
student
section course
student_id
name sec_id course_id
dept_name exam_marks semester sec_course title
tot_cred year credits

Figure 6.103 Another E-R diagram for marks database.

Summary statisti s should be modeled as derived attributes with an explanation


as to how they are omputed.
Answer:
The diagram is shown in Figure 6.104. The derived attribute season s ore is
omputed by summing the s ore values asso iated with the player entity set via
the played relationship set.
6.4 Consider an E-Rdiagram in whi h the same entity set appears several times,
with its attributes repeated in more than one o urren e. Why is allowing this
redundan y a bad pra ti e that one should avoid?
Answer:
The reason an entity set would appear more than on e is if one is drawing a
diagram that spans multiple pages.
The di erent o urren es of an entity set may have di erent sets of at-
tributes, leading to an in onsistent diagram. Instead, the attributes of an entity
set should be spe ied only on e. All other o urren es of the entity should
omit attributes. Sin e it is not possible to have an entity set without any at-
tributes, an o urren e of an entity set without attributes learly indi ates that
the attributes are spe ied elsewhere.

score
match
match_id player
date player_id
stadium played
name
opponent age
own_score season_score()
opp_score

Figure 6.104 E-R diagram for favorite team statisti s.


46 Chapter 6 Database Design using the E-R Model

A RA

B R C B RB E RC C

(a) (b)

RAB A RAC

B RBC C

(c)

Figure 6.29 Representation of a ternary relationship using binary relationships.

6.5 An E-R diagram an be viewed as a graph. What do the following mean in terms
of the stru ture of an enterprise s hema?
a. The graph is dis onne ted.
b. The graph has a y le.

Answer:

a. If a pair of entity sets are onne ted by a path in an diagram, the


E-R

entity sets are related, though perhaps indire tly. A dis onne ted graph
implies that there are pairs of entity sets that are unrelated to ea h other.
In an enterprise, we an say that the two parts of the enterprise are om-
pletely independent of ea h other. If we split the graph into onne ted
omponents, we have, in e e t, a separate database orresponding to ea h
independent part of the enterprise.
b. As indi ated in the answer to the previous part, a path in the graph be-
tween a pair of entity sets indi ates a (possibly indire t) relationship be-
tween the two entity sets. If there is a y le in the graph, then every pair
of entity sets on the y le are related to ea h other in at least two distin t
ways. If the E-Rdiagram is a y li , then there is a unique path between
every pair of entity sets and thus a unique relationship between every pair
of entity sets.
Pra ti e Exer ise 47

RA

B RB E RC C

Figure 6.105 E-R diagram for Exer ise Exer ise 6.6b.

6.6 Consider the representation of the ternary relationship of Figure 6.29a using
the binary relationships illustrated in Figure 6.29b (attributes not shown).
a. Show a simple instan e of E , A, B, C , RA , RB , and RC that annot orre-
spond to any instan e of A, B, C , and R.
b. Modify the diagram of Figure 6.29b to introdu e onstraints that will
E-R

guarantee that any instan e of E , A, B, C , RA , RB , and RC that satises the


onstraints will orrespond to an instan e of A, B, C , and R.
. Modify the pre eding translation to handle total parti ipation onstraints
on the ternary relationship.

Answer:

a. Let E = ^e1 , e2 `, A = ^a1 , a2 `, B = ^b1 `, C = ^ 1 `, RA =


^(e1 , a1 ), (e2 , a2 )`, RB = ^(e1 , b1 )`, and RC = ^(e1 , 1 )`. We see that
be ause of the tuple (e2 , a2 ), no instan e of A, B, C , and R exists that or-
responds to E , RA , RB and RC .
b. See Figure 6.105. The idea is to introdu e total parti ipation onstraints
between E and the relationships RA , RB , RC so that every tuple in E has a
relationship with A, B, and C .
. Suppose A totally parti ipates in the relationhip R, then introdu e a total
parti ipation onstraint between A and RA , and similarly for B and C .
6.7 A weak entity set an always be made into a strong entity set by adding to its
attributes the primary-key attributes of its identifying entity set. Outline what
sort of redundan y will result if we do so.
Answer:
The primary key of a weak entity set an be inferred from its relationship with
the strong entity set. If we add primary-key attributes to the weak entity set, they
will be present in both the entity set, and the relationship set and they have to
be the same. Hen e there will be redundan y.
48 Chapter 6 Database Design using the E-R Model

6.8 Consider a relation su h as se ourse, generated from a many-to-one relation-


ship set se ourse. Do the primary and foreign key onstraints reated on the
relation enfor e the many-to-one ardinality onstraint? Explain why.
Answer:
In this example, the primary key of se tion onsists of the attributes ( ourse id ,
se id , semester , year ),
whi h would also be the primary key of se ourse, while
ourse id is a foreign key from se ourse referen ing ourse. These onstraints
ensure that a parti ular se tion an only orrespond to one ourse, and thus the
many-to-one ardinality onstraint is enfor ed.
However, these onstraints annot enfor e a total parti ipation onstraint, sin e
a ourse or a se tion may not parti ipate in the se ourse relationship.
6.9 Suppose the advisor relationship set were one-to-one. What extra onstraints
are required on the relation advisor to ensure that the one-to-one ardinality
onstraint is enfor ed?
Answer:

In addition to de laring s ID as primary key for advisor, we de lare i ID as a


superkey for advisor (this an be done in SQLusing the unique onstraint on
i ID).

6.10 Consider a many-to-one relationship R between entity sets A and B. Suppose


the relation reated from R is ombined with the relation reated from A. In
SQL , attributes parti ipating in a foreign key onstraint an be null. Explain
how a onstraint on total parti ipation of A in R an be enfor ed using not null
onstraints in .
SQL

Answer:
The foreign-key attribute in R orresponding to the primary key of B should be
made not null. This ensures that no tuple of A whi h is not related to any entry
in B under R an ome in R. For example, say a is a tuple in A whi h has no
orresponding entry in R. This means when R is ombined with A, it would have
a foreign-key attribute orresponding to B as null, whi h is not allowed.
6.11 In SQL , foreign key onstraints an referen e only the primary key attributes of
the referen ed relation or other attributes de lared to be a superkey using the
unique onstraint. As a result, total parti ipation onstraints on a many-to-many
relationship set (or on the “one” side of a one-to-many relationship set) annot
be enfor ed on the relations reated from the relationship set, using primary
key, foreign key, and not null onstraints on the relations.

a. Explain why.
b. Explain how to enfor e total parti ipation onstraints using omplex
he k onstraints or assertions (see Se tion 4.4.8). (Unfortunately, these
features are not supported on any widely used database urrently.)
Pra ti e Exer ise 49

Answer:

a. For the many-to-many ase, the relationship set must be represented as a


separate relation that annot be ombined with either parti ipating entity.
Now, there is no way in SQLto ensure that a primary-key value o urring
in an entity E 1 also o urs in a many-to-many relationship R, sin e the
orresponding attribute in R is not unique; foreign keys an only
SQL

refer to the primary key or some other unique key.


Similarly, for the one-to-many ase, there is no way to ensure that an at-
tribute on the one side appears in the relation orresponding to the many
side, for the same reason.
b. Let the relation R be many-to-one from entity A to entity B with a and b as
their respe tive primary keys. We an put the following he k onstraints
on the "one" side relation B:
onstraint total part he k (b in (sele t b from A));
set onstraints total part deferred;

Note that the onstraint should be set to deferred so that it is only he ked
at the end of the transa tion; otherwise if we insert a b value in B before
it is inserted in A, the onstraint would be violated, and if we insert it in
A before we insert it in B, a foreign-key violation would o ur.

6.12 Consider the following latti e stru ture of generalization and spe ialization (at-
tributes not shown).

X Y

A B C

For entity sets A, B, and C , explain how attributes are inherited from the higher-
level entity sets X and Y . Dis uss how to handle a ase where an attribute of X
has the same name as some attribute of Y .
Answer:

A inherits all the attributes of X, plus it may dene its own attributes. Similarly,
C inherits all the attributes of Y plus its own attributes. B inherits the attributes
of both X and Y. If there is some attribute name whi h belongs to both X and Y,
it may be referred to in B by the qualied name X.name or Y.name.
6.13 An diagram usually models the state of an enterprise at a point in time.
E-R

Suppose we wish to tra k temporal hanges, that is, hanges to data over time.
For example, Zhang may have been a student between September 2015 and
50 Chapter 6 Database Design using the E-R Model

May 2019, while Shankar may have had instru tor Einstein as advisor from May
2018 to De ember 2018, and again from June 2019 to January 2020. Similarly,
attribute values of an entity or relationship, su h as title and redits of ourse,
salary, or even name of instru tor , and tot red of student, an hange over time.
One way to model temporal hanges is as follows: We dene a new data type
alled valid time, whi h is a time interval, or a set of time intervals. We then
asso iate a valid time attribute with ea h entity and relationship, re ording the
time periods during whi h the entity or relationship is valid. The end time of an
interval an be innity; for example, if Shankar be ame a student in September
2018, and is still a student, we an represent the end time of the valid time in-
terval as innity for the Shankar entity. Similarly, we model attributes that an
hange over time as a set of values, ea h with its own valid time.

a. Draw an diagram with the student and instru tor entities, and the ad-
E-R

visor relationship, with the above extensions to tra k temporal hanges.

b. Convert the E-R diagram dis ussed above into a set of relations.

It should be lear that the set of relations generated is rather omplex, leading
to di ulties in tasks su h as writing queries in . An alternative approa h,
SQL

whi h is used more widely, is to ignore temporal hanges when designing the
E-R model (in parti ular, temporal hanges to attribute values), and to modify
the relations generated from the model to tra k temporal hanges.
E-R

Answer:
.

a. The E-Rdiagram is shown in Figure 6.106.


The primary key attributes student id and instru tor id are assumed to be
immutable, that is, they are not allowed to hange with time. All other
attributes are assumed to potentially hange with time.
Note that the diagram uses multivalued omposite attributes su h as
valid times or name, with subattributes su h as start time or value. The
value attribute is a subattribute of several attributes su h as name, tot red
and salary, and refers to the name, total redits or salary during a parti -
ular interval of time.
b. The generated relations are as shown below. Ea h multivalued attribute
has turned into a relation, with the relation name onsisting of the orig-
inal relation name on atenated with the name of the multivalued at-
tribute. The relation orresponding to the entity has only the primary-key
attribute, and this is needed to ensure uniqueness.
Pra ti e Exer ise 51

student(student id)
student valid times(student id, start time, end time)
student name(student id, value, start time, end time
student dept name(student id, value, start time, end time
student tot red(student id, value, start time, end time
instru tor(instru tor id)
instru tor valid times(instru tor id, start time, end time)
instru tor name(instru tor id, value, start time, end time
instru tor dept name(instru tor id, value, start time, end time
instru tor salary(instru tor id, value, start time, end time
advisor(student id, instru tor id, start time, end time)

The primary keys shown are derived dire tly from the diagram. If we
E-R

add the additional onstraint that time intervals annot overlap (or even
the weaker ondition that one start time annot have two end times), we
an remove the end time from all the above primary keys.

student instructor
student_id instructor_id
{valid_times {valid_times
start_time {valid_time start_time
end_time start_time end_time
} end_time }
{name } {name
value value
start_time start_time
end_time end_time
} advisor }
{dept_name {dept_name
value value
start_time start_time
end_time end_time
} }
{tot_cred {salary
value value
start_time start_time
end_time end_time
} }

Figure 6.106 E-R diagram for Exer ise 6.13


CHAPTER
7
Relational Database Design
Pra ti e Exer ises
7.1 Suppose that we de ompose the s hema R =( , , , , ) into
A B C D E

( , , )
A B C

( , , ).
A D E

Show that this de omposition is a lossless de omposition if the following set F

of fun tional dependen ies holds:

™
™
A BC

™
CD E

™
B D

E A

™
Answer:

A de omposition ^ 1 , 2 ` is a lossless de omposition if 1 ã or


™
R R R R
2 R
1
1 ã 2 . Let 1 = ( , , ), 2 = ( , , ), and ã = .
™
R
2
R R R A B C R A D E
1
R R
2 A

Sin e is a andidate key (see Pra ti e Exer ise 7.6), 1 ã 2


A R R
1.
R

7.2 List all nontrivial fun tional dependen ies satised by the relation of Figure
7.18.

A B C

1
a b
1
1
1
a b
1
2
2
a b
1
1
2
a b
1
3

Figure 7.17 Relation of Exer ise 7.2.

53
54 Chapter 7 Relational Database Design

™ ™
Answer:

The nontrivial fun tional dependen ies are: and , and a


™
A B C B

dependen y they logi ally imply: AC . does not fun tionally determine
B C

A be ause the rst and third tuples have the same but di erent values. The
C A

same tuples also show does not fun tionally determine . Likewise, does not
B A A

fun tionally determine be ause the rst two tuples have the same value and
C A

di erent values. The same tuples also show does not fun tionally determine
™
C B

C . There are 19 trivial fun tional dependen ies of the form , where
Ó .
7.3 Explain how fun tional dependen ies an be used to indi ate the following:

• A one-to-one relationship set exists between entity sets student and instru -

.
tor

• A many-to-one relationship set exists between entity sets student and instru -

.
tor

Answer:
Let ( ) denote the primary key attribute of relation .
Pk r r

• The fun tional dependen ies ( ) ( ™ ) and


™
Pk student Pk instru tor

( )
Pk instru tor ( ) indi ate a one-to-one relationship be-
Pk student

ause any two tuples with the same value for must have the same
student

value for , and any two tuples agreeing on


instru tor must have
instru tor

the same value for .


student

• The fun tional dependen y ( )


Pk student ( ™ ) indi ates a many-
Pk instru tor

to-one relationship sin e any student value whi h is repeated will have the
same instru tor value, but many student values may have the same instru -
tor value.

Use Armstrong's axioms to prove the soundness of the union rule. ( : Use the
™ ™
7.4 Hint

augmentation rule to show that, if , then . Apply the augmentation


™
rule again, using , and then apply the transitivity rule.)
Answer:

To prove that:

if ™ and ™ then ™

Following the hint, we derive:


Pra ti e Exer ises 55

™ given
™ augmentation rule
™ union of identi al sets
™ given
™ augmentation rule
™ transitivity rule and set union ommutativity
7.5 Use Armstrong's axioms to prove the soundness of the pseudotransitivity rule.
Answer:

Proof using Armstrong's axioms of the pseudotransitivity rule:


™
if ™
and Æ, then Æ. ™
™ given
™ augmentation rule and set union ommutativity
™ Æ given
™ Æ transitivity rule
7.6 Compute the losure of the following set F of fun tional dependen ies for rela-
tion s hema = ( , , , , ).
R A B C D E

™
™
A BC

™
CD E

™
B D

E A

List the andidate keys for . R

Answer:
Note: It is not reasonable to expe t students to enumerate all of + . Some short- F

hand representation of the result should be a eptable as long as the nontrivial


members of + are found.
F

Starting with A ™, we an on lude:


BC and .A ™ B A ™ C

Sin e A ™ B and B ™ D, A ™ D (de omposition,


transitive)
Sin e A ™ CD and CD ™ E , A ™ E (union, de om-
position, transi-
tive)
Sin e ™ , we have (re exive)
™
A A

from the above steps (union)


™ , ™
A ABCDE

Sin e (transitive)
™ , ™
E A E ABCDE

Sin e (transitive)
™ and ™ , ™
CD E CD ABCDE

Sin e B D BC CD BC (augmentative,
transitive)
™ , ™ , ™ , et .
ABCDE

Also, C C D D BD D
56 Chapter 7 Relational Database Design

Therefore, any fun tional dependen y with , , , or on the left-hand


A E BC CD

side of the arrow is in + , no matter whi h other attributes appear in the FD.
™ ™
F

Allow * to represent any set of attributes in , then + is , ,


™ ™ ™ ™ ™ ™
R F BD B BD D

, , , , , , and all FDs of the


™ ™ ™ ™
C C D D BD BD B D B B B BD

form < A , < , < , < where is any subset of


BC CD E

^ , , , , `. The andidate keys are , , , and .


A B C D E A BC CD E

7.7 Using the fun tional dependen ies of Exer ise 7.6, ompute the anoni al
over .F

Answer:
The given set of FDs F is:-

™
™
A BC

™
CD E

™
B D

E A

The left side of ea h FD in is unique. Also, none of the attributes in the left
F

side or right side of any of the FDs is extraneous. Therefore the anoni al over
F
is equal to . F

7.8 Consider the algorithm in Figure 7.19 to ompute + . Show that this algorithm
is more e ient than the one presented in Figure 7.8 (Se tion 7.4.2) and that it
omputes + orre tly.
Answer:

The algorithm is orre t be ause:


• If is added to then there is a proof that . To see this, observe ™
™
A result A

that trivially, so is orre tly part of . If Ì is added to


™
result A

result, there must be some FD su h that Ë and is already a A

subset of . (Otherwise
result would be nonzero and the if ondition
fd ount

would be false.) A full proof an be given by indu tion on the depth of


re ursion for an exe ution of addin, but su h a proof an be expe ted only
from students with a good mathemati al ba kground.
• If Ë + , then is eventually added to . We prove this by indu tion
™
A A result

on the length of the proof of using Armstrong's axioms. First observe


A

that if pro edure addin is alled with some argument , all the attributes in
will be added to . Also if a parti ular FD's
result be omes 0, all fd ount

the attributes in its tail will denitely be added to . The base ase of
Ù
result

the proof, Ë Ë + , is obviously true be ause the rst all to


™
A A

addin has the argument . The indu tive hypothesis is that if an A

be proved in steps or less, then Ë


n : If there is a proof in
A result +1 n
Pra ti e Exer ises 57

result := ç;
/* fd ountis an array whose th element ontains the number i

of attributes on the left side of the th that are i FD

not yet known to be in + */


for := 1 to ð ð do
i F

™ denote the th
begin
let i FD ;
fd ount [ ℄ := ð ð;
i

end
/* appearsis an array with one entry for ea h attribute. The
entry for attribute is a list of integers. Ea h integer
A

on the list indi ates that appears on the left side


i A

of the th */ i FD

for ea h attribute do A

begin
appears [ ℄ :=
A NIL ;
for i := 1 to ð Fð do

™ denote the th
begin

let i FD ;
if A Ë then add to i appears [ ℄;A

end
end

addin ( );
return ( result );
pro edure addin ( );
for ea h attribute in do A

begin
if A Ì result then

begin

result := result ä ^ `; A

for ea h element of i appears A [ ℄ do


begin

fd ount [ ℄ :=
i [℄ fd ount i * 1;
if fd ount [ ℄ := 0 then
i

™ denote the th
begin

let i FD ;
addin ( );
end

end
end

end

Figure 7.18 An algorithm to ompute + .


58 Chapter 7 Relational Database Design

steps that ™ , then the last step was an appli ation of either re exivity,
™
A

augmentation, or transitivity on a fa t proved in or fewer steps. n

If re exivity or augmentation was used in the ( + 1)st step, must have n A

been in by the end of the th step itself. Otherwise, by the indu tive
™
result n

hypothesis, Ó result. Therefore, the dependen y used in proving ,


A Ë , will have set to 0 by the end of the th step. Hen e will
fd ount n A

be added to result.
To see that this algorithm is more e ient than the one presented in the hap-
ter, note that we s an ea h FD on e in the main program. The resulting array
appears has size proportional to the size of the given FDs. The re ursive alls
to addin result in pro essing linear in the size of . Hen e the algorithm
appears

has time omplexity whi h is linear in the size of the given FDs. On the other
hand, the algorithm given in the text has quadrati time omplexity, as it may
perform the loop as many times as the number of FDs, in ea h loop s anning
all of them on e.
Given the database s hema ( , , ), and a relation on the s hema , write
™
7.9 R A B C r R

an SQL query to test whether the fun tional dependen y holds on re- B C

lation . Also write an SQL assertion that enfor es the fun tional dependen y.
r

Assume that no null values are present. (Although part of the SQL standard,
su h assertions are not supported by any database implementation urrently.)
Answer:

a. The query is given below. Its result is non-empty if and only if B ™ C

does not hold on . r

sele t B

from r

group by B

having ount distin t ( C)>1

b.

reate assertion b to he k

(not exists
(sele t B

from r

group by B

having ount distin t ( C )>1


)
)
Pra ti e Exer ises 59

7.10 Our dis ussion of lossless de omposition impli itly assumed that attributes on
the left-hand side of a fun tional dependen y annot take on null values. What
ould go wrong on de omposition, if this property is violated?
Answer:
The natural join operator is dened in terms of the Cartesian produ t and the
sele tion operator. The sele tion operator gives unknown for any query on a null
value. Thus, the natural join ex ludes all tuples with null values on the ommon
attributes from the nal result. Thus, the de omposition would be lossy (in a
manner di erent from the usual ase of lossy de omposition), if null values
o ur in the left-hand side of the fun tional dependen y used to de ompose the
relation. (Null values in attributes that o ur only in the right-hand side of the
fun tional dependen y do not ause any problems.)
In the BCNF de omposition algorithm, suppose you use a fun tional depen-
™
7.11
den y to de ompose a relation s hema ( , , ) into 1 ( , ) and 2 ( , ).
r r r

a. What primary and foreign-key onstraint do you expe t to hold on the


de omposed relations?
b. Give an example of an in onsisten y that an arise due to an erroneous
update, if the foreign-key onstraint were not enfor ed on the de omposed
relations above.
. When a relation s hema is de omposed into 3NF using the algorithm in
Se tion 7.5.2, what primary and foreign-key dependen ies would you ex-
pe t to hold on the de omposed s hema?

Answer:

a. should be a primary key for 1 , and should be the foreign key from 2 ,
r r

referen ing 1 .
r

b. If the foreign key onstraint is not enfor ed, then a deletion of a tuple from
1 would not have a orresponding deletion from the referen ing tuples in
r

2 . Instead of deleting a tuple from , this would amount to simply setting


r r

the value of to null in some tuples.


. For every s hema i ( ) added to the de omposition be ause of a fun -
™
r

tional dependen y , should be made the primary key. Also, a


andidate key for the original relation is lo ated in some newly reated
relation k and is a primary key for that relation.
r

Foreign-key onstraints are reated as follows: for ea h relation i reated


r

above, if the primary key attributes of i also o ur in any other relation


r

j , then a foreign-key onstraint is reated from those attributes in j , ref-


r r

eren ing (the primary key of) i . r


60 Chapter 7 Relational Database Design

7.12 Let 1 ,
R R
2 , § , n be a de omposition of s hema
R U . Let ( ) be a relation, and
u U

let i =
r R ( ). Show that
I
u

u Ó r
1 Æ r
2 Æ5Æ r
n

Answer:

Consider some tuple in . t u

Note that i = R ( ) implies that [ i ℄ Ë i , 1 f


r
i
u t R r i f n . Thus,
[ 1℄
t R Æ t R [ 2℄ Æ§Æ [ n℄
t R Ë r
1 Æ r
2 Æ§Æ r
n
By the denition of natural join,
[ 1℄
t R Æ [ 2℄
t R ƧÆ
[ n ℄ =  ( ( [ 1 ℄  [ 2 ℄  §  [ n ℄))
t R t R t R t R

where the ondition is satised if values of attributes with the same name
in a tuple are equal and where = . The Cartesian produ t of single tuples U

generates one tuple. The sele tion pro ess is satised be ause all attributes with
the same name must have the same value sin e they are proje tions from the
same tuple. Finally, the proje tion lause removes dupli ate attribute names.
By the denition of de omposition, = 1 ä 2 ä § ä n , whi h means U R R R

that all attributes of are in [ 1 ℄ Æ [ 2 ℄ Æ § Æ [ n ℄. That is, is equal to


t t R t R t R t

the result of this join.


Sin e is any arbitrary tuple in ,
t u

u Ó r
1 Æ r
2 Æ§Æ r
n

7.13 Show that the de omposition in Exer ise 7.1 is not a dependen y-preserving
de omposition.
Answer:
Therer are several fun tional dependen ies that are not preserved. We dis uss
one example here. The dependen y ™
is not preserved. 1 , the restri tion
™ ™ ™ ™ ™
B D F

of to ( , , ) is , , , , ,
™ ™ ™ ™ ™ ™ ™
F A B C A ABC A AB A AC A BC A B

, , , , , , ,
™ ™ ™ ™
A C A A B B C C AB AC AB ABC AB BC

, , , , (same as ), (same as ),
™ ™
AB AB AB A AB B AB C AC AB BC AB

(same as ). 2 , the restri tion of to ( , , ) is , ,


™ ™ ™ ™ ™ ™
ABC AB F F C D E A ADE A AD

, , , , , , (same as ), ,
™
A AE A DE A A A D A E D D E A AD

, , (same as ). ( 1 ä 2 )+ is easily seen not to ontain


™
AE DE ADE A F F B D

sin e the only FD in 1 ä 2 with as the left side is , a trivial FD.


™
F F B B B

Thus B is not preserved.


D

A simpler argument is as follows: 1 ontains no dependen ies with on F D

the right side of the arrow. 2 ontains no dependen ies with on the left side
™
F B

of the arrow. Therefore for to be preserved there must be a fun tional


™ ™ ™
B D

dependen y in 1+ and
B in 2+ (so
F would follow by D F B D
Pra ti e Exer ises 61

transitivity). Sin e the interse tion of the two s hemes is , = . Observe that
™
A A

B is not in 1+ sin e + =
A F . B BD

7.14 Show that there an be more than one anoni al over for a given set of fun -
tional dependen ies, using the following set of dependen ies:

X ™, YZ Y ™
, and XZ . Z ™ XY

Consider the rst fun tional dependen y. We an verify that is


™
Answer: Z

extraneous in and delete it. Subsequently, we an similarly he k that


™ ™
X YZ

is extraneous in and delete it, and that is extraneous in


™ ™ ™
X Y XZ Y Z XY

and delete it, resulting in a anoni al over , , .


™
X Y Y Z Z X

However, we an also verify that is extraneous in and delete it.


™
Y X YZ

Subsequently, we an similarly he k that is extraneous in and delete


™
Z Y XZ

it, and that is extraneous in and delete it, resulting in a anoni al


™ ™ ™
X Z XY

over X , Z ,Y .
X Z Y

7.15 The algorithm to generate a anoni al over only removes one extraneous at-
tribute at a time. Use the fun tional dependen ies from Exer ise 7.14 to show
what an go wrong if two attributes inferred to be extraneous are deleted at
on e.
In ™
, one an infer that is extraneous, and so is . But
™
Answer: X YZ Y Z

deleting both will result in a set of dependen ies from whi h an no X YZ

longer be inferred. Deleting results in no longer being extraneous, and delet-


Y Z

ing results in no longer being extraneous. The anoni al over algorithm


Z Y

only deletes one attribute at a time, avoiding the problem that ould o ur if
two attributes are deleted at the same time.
7.16 Show that it is possible to ensure that a dependen y-preserving de omposition
into 3NF is a lossless de omposition by guaranteeing that at least one s hema
ontains a andidate key for the s hema being de omposed. ( : Show that Hint

the join of all the proje tions onto the s hemas of the de omposition annot
have more tuples than the original relation.)
Answer:

Let F be a set of fun tional dependen ies that hold on a s hema . Let  = R

^ 1
R , R
2 , § , n ` be a dependen y-preserving 3NF de omposition of . Let
R be R X

a andidate key for . R

Consider a legal instan e of . Let = X ( ) Æ R1 ( ) Æ R2 ( ) § Æ R ( ).


r R j r r r r

We want to prove that = .


n

r j

We laim that if 1 and 2 are two tuples in su h that 1 [ ℄ = 2 [ ℄, then


t t j t X t X

t
1 = 2 . To prove this laim, we use the following indu tive argument:
t

Let ¨ = 1 ä 2 ä § ä n , where ea h i is the restri tion of to the s hema


F F F F F F

R
i in . Consider the use of the algorithm given in Figure 7.8 to ompute the
62 Chapter 7 Relational Database Design

losure of under ¨ . We use indu tion on the number of times that the
X F for

loop in this algorithm is exe uted.


• Basis: In the rst step of the algorithm, result is assigned to , and hen e
X

given that 1 [ ℄ = 2 [ ℄, we know that 1 [


t X t X t result ℄ = 2[
t ℄ is true.
result

• Indu tion Step : Let 1 [ ℄ = 2[ ℄ be true at the end of the th


t result t result k

exe ution of the loop. for

Suppose the fun tional dependen y onsidered in the + 1 th exe ution


™
k

of the loop is , and that Ó . Ó implies that


™
for result result

t
1 [ ℄ = 2 [ ℄ is true. The fa ts that
t holds for some attribute set
R
i in  and that 1 [ i ℄ and 2 [ i ℄ are in R ( ) imply that 1 [ ℄ = 2 [ ℄ is
t R t R r t t

also true. Sin e is now added to


i

by the algorithm, we know that result

t
1 [ ℄
result = 2 [ ℄ is
t true at the
result end of the + 1 th exe ution of the k for

loop.
Sin e  is dependen y-preserving and is a key for , all attributes in are in X R R

resultwhen the algorithm terminates. Thus, 1 [ ℄ = 2 [ ℄ is true, that is, 1 = 2 t R t R t t

– as laimed earlier.
Our laim implies that the size of X ( ) is equal to the size of . Note also j j

that X ( ) = X ( ) = (sin e is a key for ). Thus we have proved that the


j r r X R

size of equals that of . Using the result of Exer ise 7.12, we know that Ó .
j r r j

Hen e we on lude that = . r j

Note that sin e is trivially in 3NF,  ä ^ ` is a dependen y-preserving


X X

lossless de omposition into 3NF.


7.17 Give an example of a relation s hema ¨ and set ¨ of fun tional dependen- R F

ies su h that there are at least three distin t lossless de ompositions of ¨ into R

BCNF.
Answer:
Given the relation ¨ = ( , , , ) the set of fun tional dependen ies ¨ =
™ ™ ™
R A B C D F

A , B C , D B C allows three distin t BCNF de ompositions.

R
1 = ^( , ), ( , ), ( , )`
A B C D B C

is in BCNF as is

R
2 = ^( A , ), ( , ), ( , )`
B C D A C

R
3 = ^( B , ), ( , ), ( , )`
C A D A B

Let a prime attribute be one that appears in at least one andidate key. Let and
™ ™
7.18
be sets of attributes su h that holds, but does not hold. Let be A
Pra ti e Exer ises 63

an attribute that is not in , is not in , and for whi h holds. We say that™ A

A is transitively dependent on . We an restate the denition of 3NF as follows:


A relation s hema is in 3NF with respe t to a set of fun tional dependen ies
R F

if there are no nonprime attributes in for whi h is transitively dependent


A R A

on a key for . Show that this new denition is equivalent to the original one.
R

Answer:

Suppose is in 3NF a ording to the textbook denition. We show that it is in


R

3NF a ording to the denition in the exer ise. Let be a nonprime attribute A

in that is transitively dependent on a key for . Then there exists Ó


™ ™ ™
R R R

su h that , , Ì , Ì , and does not hold. But


™
A A A

then violates the textbook denition of 3NF sin e


A

• A Ì implies ™ is nontrivial A

• Sin e ™ does not hold, is not a superkey


• A is not any andidate key, sin e A is nonprime
Now we show that if is in 3NF a ording to the exer ise denition, it is in
R

3NF a ording to the textbook denition. Suppose is not in 3NF a ording


™
R

to the the textbook denition. Then there is an FD that fails all three
onditions. Thus
• ™ is nontrivial.
• is not a superkey for . R

• Some in * is not in any andidate key.


A

This implies that is nonprime and ™


. Let be a andidate key for .
™ ™
A A R

Then , does not hold (sin e is not a superkey), Ì , and A

A Ì (sin e is nonprime). Thus is transitively dependent on , violating


A A

the exer ise denition.


A fun tional dependen y ™
is alled a partial dependen y if there is a
™
7.19
proper subset of su h that ; we say that is on . A
partially dependent

relation s hema is in se ond normal form (2NF) if ea h attribute in meets


R A R

one of the following riteria:


• It appears in a andidate key.
• It is not partially dependent on a andidate key.
Show that every 3NF s hema is in 2NF. ( Hint : Show that every partial depen-
den y is a transitive dependen y.)
Answer:
Referring to the denitions in Exer ise 7.18, a relation s hema is said to be in R

3NF if there is no nonprime attribute in for whi h is transitively dependent


A R A

on a key for . R
64 Chapter 7 Relational Database Design

We an also rewrite the denition of 2NF given here as:


“A relation s hema is in 2NF if no nonprime attribute is partially dependent
R A

on any andidate key for .” R

To prove that every 3NF s hema is in 2NF, it su es to show that if a non-


prime attribute is partially dependent on a andidate key , then is also
A A

transitively dependent on the key .


Let be a nonprime attribute in . Let be a andidate key for . Suppose
A R R

A is partially dependent on .
• From the denition of a partial dependen y, we know that for some proper
subset of , ™ . A

• Sin e Ï , ™ . Also, ™ does not hold, sin e is a andidate key.


• Finally, sin e A is nonprime, it annot be in either or .
Thus we on lude that ™
is a transitive dependen y. Hen e we have proved
A

that every 3NF s hema is also in 2NF.


7.20 Give an example of a relation s hema R and a set of dependen ies su h that R

is in BCNF but is not in 4NF.


Answer:
There are, of ourse, an innite number of su h examples. We show the simplest
one here.
Let be the s hema ( , , ) with the only nontrivial dependen y being
R A B C A ™™
B
CHAPTER
8
Complex Data Types
Pra ti e Exer ises
8.1 Provide information about the student named Shankar in our sample univer-
sity database, in luding information from the student tuple orresponding to
Shankar, the takes tuples orresponding to Shankar and the ourse tuples or-
responding to these takes tuples, in ea h of the following representations:

a. Using JSON , with an appropriate nested representation.


b. Using XML , with the same nested representation.
. Using RDF triples.
d. As an RDF graph.

Answer:

a. FILL IN
b. FILL IN
. FILL IN
d. FILL IN

8.2 Consider the RDF representation of information from the university s hema as
shown in Figure 8.3. Write the following queries in SPARQL .

a. Find the titles of all ourses taken by any student named Zhang.
b. Find titles of all ourses su h that a student named Zhang takes a se tion
of the ourse that is taught by an instru tor named Srinivasan.
. Find the attribute names and values of all attributes of the instru -
tor named Srinivasan, without enumerating the attribute names in your
query.
65
66 Chapter 8 Complex Data Types

Answer:
FILL IN
8.3 A ar-rental ompany maintains a database for all vehi les in its urrent eet.
For all vehi les, it in ludes the vehi le identi ation number, li ense number,
manufa turer, model, date of pur hase, and olor. Spe ial data are in luded for
ertain types of vehi les:
• Tru ks: argo apa ity.
• Sports ars: horsepower, renter age requirement.
• Vans: number of passengers.
• O -road vehi les: ground learan e, drivetrain (four- or two-wheel drive).
Constru t an SQL s hema denition for this database. Use inheritan e where
appropriate.
Answer:
For this problem, we use table inheritan e. We assume that MyDate, Color and
DriveTrainType are pre-dened types.

reate type Vehi le


(vehi le id integer,
li ense number har(15),
manufa turer har(30),
model har(30),
pur hase date MyDate,
olor Color)

reate table vehi le of type Vehi le

reate table tru k


( argo apa ity integer)
under vehi le

reate table sportsCar


(horsepower integer
renter age requirement integer)
under vehi le

reate table van


(num passengers integer)
under vehi le
Pra ti e Exer ises 67

reate tableo RoadVehi le


(ground learan e real
driveTrain DriveTrainType)
under vehi le

8.4 Consider a database s hema with a relation Emp whose attributes are as shown
below, with types spe ied for multivalued attributes.
Emp = (ename, ChildrenSet multiset(Children), SkillSet multiset(Skills))
Children = (name, birthday)
Skills = (type, ExamSet setof(Exams))
Exams = (year, ity)
Dene the above s hema in , using the
SQL SQL S erver table type syntax from
Se tion 8.2.1.1 to de lare multiset attributes.
Answer:

a. No answer.
b. Queries in SQL .
i. Program:
sele tename
from emp as e, e.ChildrenSet as
where 'Mar h' in
(sele t birthday.month
from
)
ii. Program:
sele te.ename
fromemp as e, e.SkillSet as s, s.ExamSet as x
where s.type = 'typing' and x. ity = 'Dayton'

iii. Program:
sele t distin ts.type
from emp as e, e.SkillSet as s
8.5 Consider the E-Rdiagram in Figure 8.7 showing entity set instru tor.
Give an SQLs hema denition orresponding to the diagram, treating
E-R

phone number as an array of 10 elements, using Ora le or ostgre syntax.


P SQL

Answer:
The orresponding SQL:1999s hema denition is given below. Note that the
derived attribute age has been translated into a method.
68 Chapter 8 Complex Data Types

instructor
ID
name
first_name
middle_inital
last_name
address
street
street_number
street_name
apt_number
city
state
zip
{phone_ number}
date_of_birth
age ( )

Figure 8.7 E-R diagram with omposite, multivalued, and derived attributes.

reate type Name


(rst name var har(15),
middle initial har,
last name var har(15))
reate type Street
(street name var har(15),
street number var har(4),
apartment number var har(7))
reate type Address
(street Street,
ity var har(15),
state var har(15),
zip ode har(6))
reate table ustomer
(name Name,
ustomer id var har(10),
address Adress,
phones varray(10) of har(7) ,
dob date)
method integer age()
Pra ti e Exer ises 69

employee (person name, street, ity)


works (person name, ompany name, salary)
ompany ( ompany name, ity)
manages (person name, manager name)

Figure 8.8 Relational database for Exer ise 8.6.

The above array syntax is based on Ora le, in ostgre


P SQL phones would be
de lared to have type har(7)[℄.
8.6 Consider the relational s hema shown in Figure 8.8.
a. Give a s hema denition in SQL orresponding to the relational s hema
but using referen es to express foreign-key relationships.
b. Write ea h of the following queries on the s hema, using .
SQL

i. Find the ompany with the most employees.


ii. Find the ompany with the smallest payroll.
iii. Find those ompanies whose employees earn a higher salary, on aver-
age, than the average salary at First Bank Corporation.

Answer:

a. The s hema denition is given below.


reate type Employee
(person name var har(30),
street var har(15),
ity var har(15))
reate type Company
( ompany name var har(15),
( ity var har(15))
reate table employee of Employee
reate table ompany of Company
reate type Works
(person ref(Employee) s ope employee,
omp ref(Company) s ope ompany,
salary int)
reate table works of Works
reate type Manages
(person ref(Employee) s ope employee,
(manager ref(Employee) s ope employee)
reate table manages of Manages
70 Chapter 8 Complex Data Types

b. i. sele t omp* >name


from works
group by omp
having ount(person) g all(sele t ount(person)
from works
group by omp)

ii. sele t omp* >name


from works
group by omp
having sum(salary) f all(sele t sum(salary)
from works
group by omp)

iii. sele t omp* >name


from works
group by omp
having avg(salary) > (sele t avg(salary)
from works
where omp* > ompany name="First Bank Corporation")

8.7 Compute the relevan e (using appropriate denitions of term frequen y and
inverse do ument frequen y) of ea h of the Pra ti e Exer ises in this hapter
to the query “ relation”.
SQL

Answer:
We do not onsider the questions ontaining neither of the keywords be ause
their relevan e to the keywords is zero. The number of words in a question
in lude stop words. We use the equations given in Se tion 31.2 to ompute rel-
evan e; the log term in the equation is assumed to be to the base 2.

Q# #wo- # #“rela- “SQL” “relation” “SQL” “relation” Tota


-rds “ SQL ” -tion” term freq. term freq. relv. relv. relv.
1 84 1 1 0.0170 0.0170 0.0002 0.0002 0.0004
4 22 0 1 0.0000 0.0641 0.0000 0.0029 0.0029
5 46 1 1 0.0310 0.0310 0.0006 0.0006 0.0013
6 22 1 0 0.0641 0.0000 0.0029 0.0000 0.0029
7 33 1 1 0.0430 0.0430 0.0013 0.0013 0.0026
8 32 1 3 0.0443 0.1292 0.0013 0.0040 0.0054
9 77 0 1 0.0000 0.0186 0.0000 0.0002 0.0002
14 30 1 0 0.0473 0.0000 0.0015 0.0000 0.0015
15 26 1 1 0.0544 0.0544 0.0020 0.0020 0.0041
Pra ti e Exer ises 71

8.8 Show how to represent the matri es used for omputing PageRank as relations.
Then write an SQL query that implements one iterative step of the iterative
te hnique for nding PageRank; the entire algorithm an then be implemented
as a loop ontaining the query.
Answer:
FILL
8.9 Suppose the student relation has an attribute named lo ation of type point, and
the lassroom relation has an attribute lo ation of type polygon. Write the fol-
lowing queries in SQL using the ost
P GIS spatial fun tions and predi ates that
we saw earlier:
a. Find the names of all students whose lo ation is within the lassroom
Pa kard 101.
b. Find all lassrooms that are within 100 meters or Pa kard 101; assume all
distan es are represented in units of meters.
. Find the ID and name of student who is geographi ally nearest to the
student with ID 12345.
d. Find the ID and names of all pairs of students whose lo ations are less
than 200 meters apart.

Answer:
FILL
CHAPTER
9
Appli ation Development
Pra ti e Exer ises
9.1 What is the main reason why servlets give better performan e than programs
that use the ommon gateway interfa e (CGI), even though Java programs gen-
erally run slower than C or C++ programs?
Answer:

The CGI interfa e starts a new pro ess to servi e ea h request, whi h has a
signi ant operating system overhead. On the other hand, servlets are run as
threads of an existing pro ess, avoiding this overhead. Further, the pro ess run-
ning threads ould be the web server pro ess itself, avoiding interpro ess om-
muni ation, whi h an be expensive. Thus, for small to moderate-sized tasks,
the overhead of Java is less than the overhead saved by avoiding pro ess re-
ation and ommuni ation.
For tasks involving a lot of CPU a tivity, this may not be the ase, and using
CGI with a C or C++ program may give better performan e.
9.2 List some benets and drawba ks of onne tionless proto ols over proto ols
that maintain onne tions.
Answer:

Most omputers have limits on the number of simultaneous onne tions they
an a ept. With onne tionless proto ols, onne tions are broken as soon as
the request is satised, and therefore other lients an open onne tions. Thus
more lients an be served at the same time. A request an be routed to any one
of a number of di erent servers to balan e load, and if a server rashes, another
an take over without the lient noti ing any problem.
The drawba k of onne tionless proto ols is that a onne tion has to be
reestablished every time a request is sent. Also, session information has to be
sent ea h time in the form of ookies or hidden elds. This makes them slower
than the proto ols whi h maintain onne tions in ase state information is re-
quired.

73
74 Chapter 9 Appli ation Development

9.3 Consider a arelessly written web appli ation for an online-shopping site, whi h
stores the pri e of ea h item as a hidden form variable in the web page sent to
the ustomer; when the ustomer submits the form, the information from the
hidden form variable is used to ompute the bill for the ustomer. What is the
loophole in this s heme? (There was a real instan e where the loophole was
exploited by some ustomers of an online-shopping site before the problem was
dete ted and xed.)
Answer:
A ha ker an edit the HTML sour e ode of the web page and repla e the value
of the hidden variable pri e with another value, use the modied web page to
pla e an order. The web appli ation would then use the user-modied value as
the pri e of the produ t.
9.4 Consider another arelessly written web appli ation whi h uses a servlet that
he ks if there was an a tive session but does not he k if the user is autho-
rized to a ess that page, instead depending on the fa t that a link to the page is
shown only to authorized users. What is the risk with this s heme? (There was
a real instan e where appli ants to a ollege admissions site ould, after logging
into the web site, exploit this loophole and view information they were not au-
thorized to see; the unauthorized a ess was, however, dete ted, and those who
a essed the information were punished by being denied admission.)
Answer:

Although the link to the page is shown only to authorized users, an unauthorized
user may somehow ome to know of the existen e of the link (for example, from
an unauthorized user, or via web proxy logs). The user may then log in to the
system and a ess the unauthorized page by entering its URL in the browser. If
the he k for user authorization was inadvertently left out from that page, the
user will be able to see the result of the page.
The HTTP referer attribute an be used to blo k a naive attempt to exploit su h
loopholes by ensuring the referer value is from a valid page of the web site.
However, the referer attribute is set by the browser and an be spoofed, so a
mali ious user an easily work around the referer he k.
9.5 Why is it important to open JDBC onne tions using the try-with-resour es (try
§ §
( ){ } ) syntax?

Answer:
This ensures onne tions are losed properly, and you will not run out of
database onne tions.
9.6 List three ways in whi h a hing an be used to speed up web server perfor-
man e.
Answer:
Pra ti e Exer ises 75

Ca hing an be used to improve performan e by exploiting the ommonalities


between transa tions.
a. If the appli ation ode for servi ing ea h request needs to open a on-
ne tion to the database, whi h is time onsuming, then a pool of open
onne tions may be reated beforehand, and ea h request uses one from
those.
b. The results of a query generated by a request an be a hed. If the same
request omes again, or generates the same query, then the a hed result
an be used instead of onne ting to the database again.
. The nal web page generated in response to a request an be a hed. If
the same request omes again, then the a hed page an be outputed.
9.7 The netstat ommand (available on Linux and on Windows) shows the a tive
network onne tions on a omputer. Explain how this ommand an be used to
nd out if a parti ular web page is not losing onne tions that it opened, or if
onne tion pooling is used, not returning onne tions to the onne tion pool.
You should a ount for the fa t that with onne tion pooling, the onne tion
may not get losed immediately.
Answer:
The tester should run netstat to nd all onne tions open to the ma hine/so ket
used by the database. (If the appli ation server is separate from the database
server, the ommand may be exe uted at either of the ma hines). Then the web
page being tested should be a essed repeatedly (this an be automated by using
tools su h as JMeter to generate page a esses). The number of onne tions to
the database would go from 0 to some value (depending on the number of on-
ne tions retained in the pool), but after some time the number of onne tions
should stop in reasing. If the number keeps in reasing, the ode underlying the
web page is learly not losing onne tions or returning the onne tion to the
pool.
9.8 Testing for SQL-inje tion vulnerability:
a. Suggest an approa h for testing an appli ation to nd if it is vulnerable to
SQL inje tion atta ks on text input.
b. Can SQL inje tion o ur with forms of HTML input other than text boxes?
If so, how would you test for vulnerability?

Answer:

a. One approa h is to enter a string ontaining a single quote in ea h of the


input text boxes of ea h of the forms provided by the appli ation to see
76 Chapter 9 Appli ation Development

if the appli ation orre tly saves the value. If it does not save the value
orre tly and/or gives an error message, it is vulnerable to SQL inje tion.
b. Yes, SQL inje tion an even o ur with sele tion inputs su h as drop-
down menus, by modifying the value sent ba k to the server when the
input value is hosen—for example by editing the page dire tly, or in the
browser's DOM tree. Most modern browsers provide a way for users to
edit the DOM tree. This feature an be able to modify the values sent to
the appli ation, inserting a single quote into the value.
9.9 A database relation may have the values of ertain attributes en rypted for se-
urity. Why do database systems not support indexing on en rypted attributes?
Using your answer to this question, explain why database systems do not allow
en ryption of primary-key attributes.
Answer:
It is not possible in general to index on an en rypted value, unless all o ur-
ren es of the value en rypt to the same value (and even in this ase, only equality
predi ates would be supported). However, mapping all o urren es of a value to
the same en rypted value is risky, sin e statisti al analysis an be used to reveal
ommon values, even without de ryption; te hniques based on adding random
“salt” bits are used to prevent su h analysis, but they make indexing impossible.
One possible workaround is to store the index unen rypted, but then the index
an be used to leak values. Another option is to keep the index en rypted, but
then the database system should know the de ryption key, to de rypt required
parts of the index on the y. Sin e this requires modifying large parts of the
database system ode, databases typi ally do not support this option.
The primary-key onstraint has to be he ked by the database when tuples are
inserted, and if the values are en rypted as above, the database system will not be
able to dete t primary-key violations. Therefore, database systems that support
en ryption of spe ied attributes do not allow primary-key attributes, or for that
matter foreign-key attributes, to be en rypted.
9.10 Exer ise 9.9 addresses the problem of en ryption of ertain attributes. However,
some database systems support en ryption of entire databases. Explain how the
problems raised in Exer ise 9.9 are avoided if the entire database is en rypted.
Answer:

When the entire database is en rypted, it is easy for the database to perform
de ryption as data are fet hed from disk into memory, so in-memory storage is
unen rypted. With this option, everything in the database, in luding indi es, is
en rypted when on disk, but unen rypted in memory. As a result, only the data
a ess layer of the database system ode needs to be modied to perform en-
ryption, leaving other layers untou hed. Thus, indi es an be used un hanged,
and primary-key and foreign-key onstraints enfor ed without any hange to the
orresponding layers of the database system ode.
Pra ti e Exer ises 77

9.11 Suppose someone impersonates a ompany and gets a erti ate from a
erti ate-issuing authority. What is the e e t on things (su h as pur hase or-
ders or programs) ertied by the impersonated ompany, and on things erti-
ed by other ompanies?
Answer:

The key problem with digital erti ates (when used o ine, without onta ting
the erti ate issuer) is that there is no way to withdraw them.
For instan e (this a tually happened, but names of the parties have been
hanged) person C laims to be an employee of ompany X and gets a new
publi key ertied by the ertifying authority A. Suppose the authority A in-
orre tly believed that C was a ting on behalf of ompany X , and it gave C a
erti ate ert. Now C an ommuni ate with person Y , who he ks the er-
ti ate ert presented by C and believes the publi key ontained in ert really
belongs to X . C an ommuni ate with Y using the publi key, and Y trusts the
ommuni ation is from ompany X .
Person Y may now reveal ondential information to C or a ept a pur-
hase order from C or exe ute programs ertied by C , based on the publi key,
thinking he is a tually ommuni ating with ompany X . In ea h ase there is
potential for harm to Y .
Even if A dete ts the impersonation, as long as Y does not he k with A (the
proto ol does not require this he k), there is no way for Y to nd out that the
erti ate is forged.
If X was a erti ation authority itself, further levels of fake erti ates ould
be reated. But erti ates that are not part of this hain would not be a e ted.
9.12 Perhaps the most important data items in any database system are the passwords
that ontrol a ess to the database. Suggest a s heme for the se ure storage
of passwords. Be sure that your s heme allows the system to test passwords
supplied by users who are attempting to log into the system.
Answer:
A s heme for storing passwords would be to en rypt ea h password (after
adding randomly generated “salt” bits to prevent di tionary atta ks), and then
use a hash index on the user-id to store/a ess the en rypted password. The
password being used in a login attempt is then en rypted (if randomly gener-
ated “salt” bits were used initially, these bits should be stored with the user-id
and used when en rypting the user-supplied password). The en rypted value
is then ompared with the stored en rypted value of the orre t password. An
advantage of this s heme is that passwords are not stored in lear text, and the
ode for de ryption need not even exist. Thus, “one-way” en ryption fun tions,
su h as se ure hashing fun tions, whi h do not support de ryption an be used
for this task. The se ure hashing algorithm SHA-1 is widely used for su h one-
way en ryption.
CHAPTER
10
Big Data
Pra ti e Exer ises
10.1 Suppose you need to store a very large number of small les, ea h of size say 2
kilobytes. If your hoi e is between a distributed le system and a distributed
key-value store, whi h would you prefer, and explain why.
Answer:
The key-value store, sin e the distributed le system is designed to store a mod-
erate number of large les. With ea h le blo k being multiple megabytes,
kilobyte-sized les would result in a lot of wasted spa e in ea h blo k and poor
storage performan e.
10.2 Suppose you need to store data for a very large number of students in a dis-
tributed do ument store su h as MongoDB. Suppose also that the data for
ea h student orrespond to the data in the student and the takes relations.
How would you represent the above data about students, ensuring that all the
data for a parti ular student an be a essed e iently? Give an example of
the data representation for one student.
Answer:
We would store the student data as a JSON obje t, with the takes tuples for
the student stored as a JSON array of obje ts, ea h obje t orresponding to a
single takes tuple. Give example ...
10.3 Suppose you wish to store utility bills for a large number of users, where ea h
bill is identied by a ustomer ID and a date. How would you store the bills in
a key-value store that supports range queries, if queries request the bills of a
spe ied ustomer for a spe ied date range.
Answer:

Create a key by on atenating the ustomer ID and date (with date represented
in the form year/month/date, e.g., 2018/02/28) and store the re ords indexed
on this key. Now the required re ords an be retrieved by a range query.

79
80 Chapter 10 Big Data

10.4 Give pseudo ode for omputing a join r Ær A=s A s using a single MapRedu e
: :

step, assuming that the map() fun tion is invoked on ea h tuple of r and s.
Assume that the map() fun tion an nd the name of the relation using on-
text.relname().
Answer:

With the map fun tion, output re ords from both the input relations, using the
join attribute value as the redu e key. The redu e fun tion gets re ords from
both relations with mat hing join attribute values and outputs all mat hing
pairs.
10.5 What is the on eptual problem with the following snippet of Apa he Spark
ode meant to work on very large data. Note that the olle t() fun tion returns
a Java olle tion, and Java olle tions (from Java 8 onwards) support map and
redu e fun tions.

JavaRDD<String< lines = s .textFile("logDire tory");


int totalLength = lines. olle t().map(s *> s.length())
.redu e(0,(a,b) *> a+b);

Answer:
The problem with the ode is that the olle t() fun tion gathers the RDD data
at a single node, and the map and redu e fun tions are then exe uted on that
single node, not in parallel as intended.
10.6 Apa he Spark:
a. How does Apa he Spark perform omputations in parallel?
b. Explain the statement: “Apa he Spark performs transformations on
RDDs in a lazy manner.”
. What are some of the benets of lazy evaluation of operations in Apa he
Spark?

Answer:

a. RDDs are stored partitioned a ross multiple nodes. Ea h of the trans-


formation operations on an RDD are exe uted in parallel on multiple
nodes.
b. Transformations are not exe uted immediately but postponed until the
result is required for fun tions su h as olle t() or saveAsTextFile().
. The operations are organized into a tree, and query optimization an
be applied to the tree to speed up omputation. Also, answers an be
pipelined from one operation to another, without being written to disk,
to redu e time overheads of disk storage.
Pra ti e Exer ises 81

10.7 Given a olle tion of do uments, for ea h word wi , let ni denote the number of
times the word o urs in the olle tion. Let N be the total number of word o -
urren es a ross all do uments. Next, onsider all pairs of onse utive words
(wi , wj ) in the do ument; let ni j denote the number of o urren es of the word
,

pair (wi , wj ) a ross all do uments.


Write an Apa he Spark program that, given a olle tion of do uments in a
dire tory, omputes N , all pairs (wi , ni ), and all pairs ((wi , wj ), ni j ). Then output
all word pairs su h that ni j _N g 10 < (ni _N ) < (nj _N ). These are word pairs
,

that o ur 10 times or more as frequently as they would be expe ted to o ur


if the two words o urred independently of ea h other.
You will nd the join operation on RDDs useful for the last step, to bring
related ounts together. For simpli ity, do not bother about word pairs that
ross lines. Also assume for simpli ity that words only o ur in lower ase and
that there are no pun tuation marks.
Answer:
FILL IN ANSWER (available with SS)
10.8 Consider the following query using the tumbling window operator:

sele titem, System.Timestamp as window end, sum(amount)


from order timestamp by datetime
group by itemid, tumblingwindow(hour, 1)

Give an equivalent query using normal SQL onstru ts, without using the tum-
bling window operator. You an assume that the timestamp an be onverted
to an integer value that represents the number of se onds elapsed sin e (say)
midnight, January 1, 1970, using the fun tion to se onds(timestamp). You an
also assume that the usual arithmeti fun tions are available, along with the
fun tion oor(a) whi h returns the largest integer f a.
Answer:
Divide by 3600, and take oor, group by that. To output the timestamp of the
window end, add 1 to hour and multiply by 3600
10.9 Suppose you wish to model the university s hema as a graph. For ea h of the
following relations, explain whether the relation would be modeled as a node
or as an edge:
(i) student, (ii) instru tor, (iii) ourse, (iv) se tion, (v) takes, (vi) tea hes.
Does the model apture onne tions between se tions and ourses?
Answer:

Ea h relation orresponding to an entity (student, instru tor, ourse, and se -


tion) would be modeled as a node. Takes and tea hes would be modeled as
edges. There is a further edge between ourse and se tion, whi h has been
82 Chapter 10 Big Data

merged into the se tion relation and annot be aptured with the above s hema.
It an be modeled if we reate a separate relation that links se tions to ourses.
CHAPTER
11
Data Analyti s
Pra ti e Exer ises
11.1 Des ribe benets and drawba ks of a sour e-driven ar hite ture for gathering
of data at a data warehouse, as ompared to a destination-driven ar hite ture.
Answer:
In a destination-driven ar hite ture for gathering data, data transfers from the
data sour es to the data warehouse are based on demand from the warehouse,
whereas in a sour e-driven ar hite ture, the transfers are initiated by ea h
sour e.
The benets of a sour e-driven ar hite ture are

• Data an be propagated to the destination as soon as they be ome avail-


able. For a destination-driven ar hite ture to olle t data as soon as they
are available, the warehouse would have to probe the sour es frequently,
leading to a high overhead.
• The sour e does not have to keep histori al information. As soon as data
are updated, the sour e an send an update message to the destination
and forget the history of the updates. In ontrast, in a destination-driven
ar hite ture, ea h sour e has to maintain a history of data whi h have not
yet been olle ted by the data warehouse. Thus storage requirements at
the sour e are lower for a sour e-driven ar hite ture.

On the other hand, a destination-driven ar hite ture has the following advan-
tages.

• In a sour e-driven ar hite ture, the sour e has to be a tive and must han-
dle error onditions su h as not being able to onta t the warehouse for
some time. It is easier to implement passive sour es, and a single a tive
warehouse. In a destination-driven ar hite ture, ea h sour e is required to
provide only a basi fun tionality of exe uting queries.
83
84 Chapter 11 Data Analyti s

• The warehouse has more ontrol on when to arry out data gathering
a tivities and when to pro ess user queries; it is not a good idea to perform
both simultaneously, sin e they may on i t on lo ks.
11.2 Draw a diagram that shows how the lassroom relation of our university exam-
ple as shown in Appendix A would be stored under a olumn-oriented storage
stru ture.
Answer:
The relation would be stored in three les, one per attribute, as shown below.
We assume that the row number an be inferred impli itly from position, by
using xed-size spa e for ea h attribute. Otherwise, the row number would also
have to be stored expli itly.

building
Pa kard
Painter
Taylor
Watson
Watson

room number
101
514
3128
100
120

apa ity
500
10
70
30
50

11.3 Consider the takes relation. Write an SQL query that omputes a ross-tab
that has a olumn for ea h of the years 2017 and 2018, and a olumn for all,
and one row for ea h ourse, as well as a row for all. Ea h ell in the table
should ontain the number of students who took the orresponding ourse in
the orresponding year, with olumn all ontaining the aggregate a ross all
years, and row all ontaining the aggregate a ross all ourses.
Answer:
Pra ti e Exer ises 85

11.4 Consider the data warehouse s hema depi ted in Figure 11.2. Give an SQL
query to summarize sales numbers and pri e by store and date, along with the
hierar hies on store and date.
Answer:
query:

sele tstore-id, ity, state, ountry,


date, month, quarter, year,
sum(number), sum(pri e)

from sales, store, date


where sales.store-id = store.store-id and

sales.date = date.date
groupby rollup( ountry, state, ity, store-id),
rollup(year, quarter, month, date)

11.5 Classi ation an be done using lassi ation rules, whi h have a ondition, a
lass, and a onden e; the onden e is the per entage of the inputs satisfying
the ondition that fall in the spe ied lass.
For example, a lassi ation rule for redit ratings may have a ondition
that salary is between $30,000 and $50,000, and edu ation level is graduate,
with the redit rating lass of good , and a onden e of 80%. A se ond rule may
have a ondition that salary is between $30,000 and $50,000, and edu ation
level is high-s hool, with the redit rating lass of satisfa tory, and a onden e
of 80%. A third rule may have a ondition that salary is above $50,001, with
the redit rating lass of ex ellent, and onden e of 90%. Show a de ision tree
lassier orresponding to the above rules.
Show how the de ision tree lassier an be extended to re ord the on-
den e values.
Answer:

FILL IN
11.6 Consider a lassi ation problem where the lassier predi ts whether a per-
son has a parti ular disease. Suppose that 95% of the people tested do not
su er from the disease. Let pos denote the fra tion of true positives, whi h is
5% of the test ases, and let neg denote the fra tion of true negatives, whi h is
95% of the test ases. Consider the following lassiers:
• Classier C1 , whi h always predi ts negative (a rather useless lassier, of
ourse).
• Classier C2 , whi h predi ts positive in 80% of the ases where the person
a tually has the disease but also predi ts positive in 5% of the ases where
the person does not have the disease.
86 Chapter 11 Data Analyti s

• Classier C3 , whi h predi ts positive in 95% of the ases where the person
a tually has the disease but also predi ts positive in 20% of the ases where
the person does not have the disease.
For ea h lassier, let t pos denote the true positive fra tion, that is the fra tion
of ases where the lassier predi tion was positive, and the person a tually
had the disease. Let f pos denote the false positive fra tion, that is the fra tion
of ases where the predi tion was positive, but the person did not have the
disease. Let t neg denote true negative and f neg denote false negative fra tions,
whi h are dened similarly, but for the ases where the lassier predi tion
was negative.
a. Compute the following metri s for ea h lassier:
i. A ura y, dened as (t pos + t neg)_(pos+neg), that is, the fra tion of
the time when the lassier gives the orre t lassi ation.
ii. Re all (also known as sensitivity) dened as t pos_pos, that is, how
many of the a tual positive ases are lassied as positive.
iii. Pre ision, dened as t pos/(t pos+f pos), that is, how often the positive
predi tion is orre t.
iv. Spe i ity, dened as t neg/neg.
b. If you intend to use the results of lassi ation to perform further s reen-
ing for the disease, how would you hoose between the lassiers?
. On the other hand, if you intend to use the result of lassi ation to start
medi ation, where the medi ation ould have harmful e e ts if given to
someone who does not have the disease, how would you hoose between
the lassiers?

Answer:
FILL
CHAPTER
12
Physi al Storage Systems
Pra ti e Exer ises
12.1 SSDs an be used as a storage layer between memory and magneti disks, with
some parts of the database (e.g., some relations) stored on SSDs and the rest
on magneti disks. Alternatively, SSDs an be used as a bu er or a he for
magneti disks; frequently used blo ks would reside on the SSD layer, while
infrequently used blo ks would reside on magneti disk.
a. Whi h of the two alternatives would you hoose if you need to support
real-time queries that must be answered within a guaranteed short period
of time? Explain why.
b. Whi h of the two alternatives would you hoose if you had a very large
ustomer relation, where only some disk blo ks of the relation are a -
essed frequently, with other blo ks rarely a essed.

Answer:
In the rst ase, SSD as storage layer is better sin e performan e is guaran-
teed. With SSD as a he, some requests may have to read from magneti disk,
ausing delays.
In the se ond ase, sin e we don't know exa tly whi h blo ks are frequently
a essed at a higher level, it is not possible to assign part of the relation to SSD.
Sin e the relation is very large, it is not possible to assign all of the relation to
SSD. The SSD as a he option will work better in this ase.
12.2 Some databases use magneti disks in a way that only se tors in outer tra ks are
used, while se tors in inner tra ks are left unused. What might be the benets
of doing so?
Answer:

The disk's data-transfer rate will be greater on the outer tra ks than the inner
tra ks. This is be ause the disk spins at a onstant rate, so more se tors pass
underneath the drive head in a given amount of time when the arm is posi-
87
88 Chapter 12 Physi al Storage Systems

tioned on an outer tra k than when on an inner tra k. Even more importantly,
by using only outer tra ks, the disk arm movement is minimized, redu ing the
disk a ess laten y. This aspe t is important for transa tion-pro essing sys-
tems, where laten y a e ts the transa tion-pro essing rate.
12.3 Flash storage:
a. How is the ash translation table, whi h is used to map logi al page
numbers to physi al page numbers, reated in memory?
b. Suppose you have a 64-gigabyte ash storage system, with a 4096-byte
page size. How big would the ash translation table be, assuming ea h
page has a 32-bit address, and the table is stored as an array?
. Suggest how to redu e the size of the translation table if very often long
ranges of onse utive logi al page numbers are mapped to onse utive
physi al page numbers.

Answer:

a. It is stored as an array ontaining physi al page numbers, indexed by


logi al page numbers. This representation gives an overhead equal to
the size of the page address for ea h page.
b. It takes 32 bits for every page or every 4096 bytes of storage. Hen e, it
takes 64 megabytes for the 64 gigabytes of ash storage.
. If the mapping is su h that every p onse utive logi al page numbers are
mapped to p onse utive physi al pages, we an store the mapping of
the rst page for every p pages. This redu es the in-memory stru ture by
a fa tor of p. Further, if p is an exponent of 2, we an avoid some of the
least signi ant digits of the addresses stored.
12.4 Consider the following data and parity-blo k arrangement on four disks:

Disk 1 Disk 2 Disk 3 Disk 4


B1 B2 B3 B4
P1 B5 B6 B7
B8 P2 B9 B10

The Bi s represent data blo ks; the Pi s represent parity blo ks. Parity blo k Pi
is the parity blo k for data blo ks B4i*3 to B4i . What, if any, problem might this
arrangement present?
Answer:
Pra ti e Exer ises 89

This arrangement has the problem that Pi and B4i*3 are on the same disk. So
if that disk fails, re onstru tion of B4i*3 is not possible, sin e data and parity
are both lost.
12.5 A database administrator an hoose how many disks are organized into a
single RAID 5 array. What are the trade-o s between having fewer disks ver-
sus more disks, in terms of ost, reliability, performan e during failure, and
performan e during rebuild?
Answer:
Fewer disks has higher ost, but with more disks, the han e of two disk fail-
ures, whi h would lead to data loss, is higher. Further, performan e during
failure would be poor sin e a blo k read from a failed disk would result a large
number of blo k reads from the other disks. Similarly, the overhead for rebuild-
ing the failed disk would also be higher, sin e more disks need to be read to
re onstru t the data in the failed disk.
12.6 A power failure that o urs while a disk blo k is being written ould result in
the blo k being only partially written. Assume that partially written blo ks an
be dete ted. An atomi blo k write is one where either the disk blo k is fully
written or nothing is written (i.e., there are no partial writes). Suggest s hemes
for getting the e e t of atomi blo k writes with the following RAID s hemes.
Your s hemes should involve work on re overy from failure.
a. RAID level 1 (mirroring)
b. RAID level 5 (blo k interleaved, distributed parity)

Answer:

a. To ensure atomi ity, a blo k write operation is arried out as follows:


i. Write the information onto the rst physi al blo k.
ii. When the rst write ompletes su essfully, write the same informa-
tion onto the se ond physi al blo k.
iii. The output is de lared ompleted only after the se ond write om-
pletes su essfully.
During re overy, ea h pair of physi al blo ks is examined. If both are
identi al and there is no dete table partial-write, then no further a tions
are ne essary. If one blo k has been partially rewritten, then we repla e
its ontents with the ontents of the other blo k. If there has been no
partial-write, but they di er in ontent, then we repla e the ontents
of the rst blo k with the ontents of the se ond, or vi e versa. This
re overy pro edure ensures that a write to stable storage either su eeds
ompletely (that is, updates both opies) or results in no hange.
The requirement of omparing every orresponding pair of blo ks
during re overy is expensive to meet. We an redu e the ost greatly by
90 Chapter 12 Physi al Storage Systems

keeping tra k of blo k writes that are in progress, using a small amount
of nonvolatile RAM. On re overy, only blo ks for whi h writes were in
progress need to be ompared.

b. The idea is similar here. For any blo k write, the information blo k is
written rst, followed by the orresponding parity blo k. At the time of
re overy, ea h set onsisting of the nth blo k of ea h of the disks is on-
sidered. If none of the blo ks in the set have been partially written, and
the parity blo k ontents are onsistent with the ontents of the informa-
tion blo ks, then no further a tion need be taken. If any blo k has been
partially written, its ontents are re onstru ted using the other blo ks. If
no blo k has been partially written, but the parity blo k ontents do not
agree with the information blo k ontents, the parity blo k's ontents
are re onstru ted.
12.7 Storing all blo ks of a large le on onse utive disk blo ks would minimize
seeks during sequential le reads. Why is it impra ti al to do so? What do op-
erating systems do instead, to minimize the number of seeks during sequential
reads?
Answer:

Reading data sequentially from a large le ould be done with only one seek
if the entire le were stored on onse utive disk blo ks. Ensuring availability
of large numbers of onse utive free blo ks is not easy, sin e les are reated
and deleted, resulting in fragmentation of the free blo ks on disks. Operating
systems allo ate blo ks on large but xed-sized sequential extents instead, and
only one seek is required per extent.
CHAPTER
13
Data Storage Stru tures
Pra ti e Exer ises
13.1 Consider the deletion of re ord 5 from the le of Figure 13.3. Compare the
relative merits of the following te hniques for implementing the deletion:
a. Move re ord 6 to the spa e o upied by re ord 5, and move re ord 7 to
the spa e o upied by re ord 6.
b. Move re ord 7 to the spa e o upied by re ord 5.
. Mark re ord 5 as deleted, and move no re ords.

Answer:

a. Although moving re ord 6 to the spa e for 5 and moving re ord 7 to the
spa e for 6 is the most straightforward approa h, it requires moving the
most re ords and involves the most a esses.
b. Moving re ord 7 to the spa e for 5 moves fewer re ords but destroys any
ordering in the le.
. Marking the spa e for 5 as deleted preserves ordering and moves no
re ords, but it requires additional overhead to keep tra k of all of the
free spa e in the le. This method may lead to too many “holes” in the
le, whi h if not ompa ted from time to time, will a e t performan e
be ause of the redu ed availability of ontiguous free re ords.
13.2 Show the stru ture of the le of Figure 13.4 after ea h of the following steps:
a. Insert (24556, Turnamian, Finan e, 98000).
b. Delete re ord 2.
. Insert (34556, Thompson, Musi , 67000).

Answer:

91
92 Chapter 13 Data Storage Stru tures

header ~4
re ord 0 10101 Srinivasan Comp. S i. 65000
re ord 1 24556 Turnamian Finan e 98000
re ord 2 15151 Mozart Musi 40000
re ord 3 22222 Einstein Physi s 95000
re ord 4 ~6
re ord 5 33456 Gold Physi s 87000
re ord 6
re ord 7 58583 Calieri History 62000
re ord 8 76543 Singh Finan e 80000
re ord 9 76766 Cri k Biology 72000
re ord 10 83821 Brandt Comp. S i. 92000
re ord 11 98345 Kim Ele . Eng. 80000

Figure 13.101 The file after insert (24556, Turnamian, Finan e, 98000).

header ~2
re ord 0 10101 Srinivasan Comp. S i. 65000
re ord 1 24556 Turnamian Finan e 98000
re ord 2 ~4
re ord 3 22222 Einstein Physi s 95000
re ord 4 ~6
re ord 5 33456 Gold Physi s 87000
re ord 6
re ord 7 58583 Calieri History 62000
re ord 8 76543 Singh Finan e 80000
re ord 9 76766 Cri k Biology 72000
re ord 10 83821 Brandt Comp. S i. 92000
re ord 11 98345 Kim Ele . Eng. 80000

Figure 13.102 The file after delete re ord 2.

We use “~ i” to denote a pointer to re ord “i”.


a. See ??.
b. See ??. Note that the free re ord hain ould have alternatively been
from the header to 4, from 4 to 2, and nally from 2 to 6.
. See ??.
Pra ti e Exer ises 93

header ~4
re ord 0 10101 Srinivasan Comp. S i. 65000
re ord 1 24556 Turnamian Finan e 98000
re ord 2 34556 Thompson Musi 67000
re ord 3 22222 Einstein Physi s 95000
re ord 4 ~6
re ord 5 33456 Gold Physi s 87000
re ord 6
re ord 7 58583 Calieri History 62000
re ord 8 76543 Singh Finan e 80000
re ord 9 76766 Cri k Biology 72000
re ord 10 83821 Brandt Comp. S i. 92000
re ord 11 98345 Kim Ele . Eng. 80000

Figure 13.103 The file after insert (34556, Thompson, Musi , 67000).

13.3 Consider the relations se tion and takes. Give an example instan e of these
two relations, with three se tions, ea h of whi h has ve students. Give a le
stru ture of these relations that uses multitable lustering.
Answer:

The relation se tion with three tuples is as follows:

ourse id se id semester year building room

BIO-301 1 Summer 2010 Painter 514

CS-101 1 Fall 2009 Pa kard 101

CS-347 1 Fall 2009 Taylor 3128

The relation takes with ve students for ea h se tion is as follows:


See ??.
See ??.
The multitable lustering for the above two instan es an be taken as:
13.4 Consider the bitmap representation of the free-spa e map, where for ea h
blo k in the le, two bits are maintained in the bitmap. If the blo k is between
0 and 30 per ent full the bits are 00, between 30 and 60 per ent the bits are
01, between 60 and 90 per ent the bits are 10, and above 90 per ent the bits
are 11. Su h bitmaps an be kept in memory even for quite large les.
a. Outline two benets and one drawba k to using two bits for a blo k,
instead of one byte as des ribed earlier in this hapter.
94 Chapter 13 Data Storage Stru tures

ID ourse id se id semester year grade


00128 CS-101 1 Fall 2009 A
00128 CS-347 1 Fall 2009 A-
12345 CS-347 1 Fall 2009 A
12345 CS-101 1 Fall 2009 C
17968 BIO-301 1 Summer 2010 null
23856 CS-347 1 Fall 2009 A
45678 CS-101 1 Fall 2009 F
54321 CS-101 1 Fall 2009 A-
54321 CS-347 1 Fall 2009 A
59762 BIO-301 1 Summer 2010 null
76543 CS-101 1 Fall 2009 A
76543 CS-347 1 Fall 2009 A
78546 BIO-301 1 Summer 2010 null
89729 BIO-301 1 Summer 2010 null
98988 BIO-301 1 Summer 2010 null

Figure 13.104 The relation takes with five students for ea h se tion.

b. Des ribe how to keep the bitmap up to date on re ord insertions and
deletions.
. Outline the benet of the bitmap te hnique over free lists in sear hing
for free spa e and in updating free spa e information.

Answer:

a. The spa e used is less with 2 bits, and the number of times the free-
spa e map needs to be updated de reases signi antly, sin e many in-
serts/deletes do not result in any hange in the free-spa e map. However,
we have only an approximate idea of the free spa e available, whi h ould
lead both to wasted spa e and/or to in reased sear h ost for nding free
spa e for a re ord.
b. Every time a re ord is inserted/deleted, he k if the usage of the blo k
has hanged levels. In that ase, update the orresponding bits. Note
that we don't need to a ess the bitmaps at all unless the usage rosses
a boundary, so in most of the ases there is no overhead.
. When free spa e for a large re ord or a set of re ords is sought, then
multiple free list entries may have to be s anned before a proper-sized
one is found, so overheads are mu h higher. With bitmaps, one page of
bitmap an store free info for many pages, so I/O spent for nding free
spa e is minimal. Similarly, when a whole blo k or a large part of it is
Pra ti e Exer ises 95

BIO-301 1 Summer 2010 Painter 51


17968 BIO-301 1 Summer 2010 n
59762 BIO-301 1 Summer 2010 n
78546 BIO-301 1 Summer 2010 n
89729 BIO-301 1 Summer 2010 n
98988 BIO-301 1 Summer 2010 n
CS-101 1 Fall 2009 Pa kard 10
00128 CS-101 1 Fall 2009 A
12345 CS-101 1 Fall 2009 C
45678 CS-101 1 Fall 2009 F
54321 CS-101 1 Fall 2009 A
76543 CS-101 1 Fall 2009 A
CS-347 1 Fall 2009 Taylor 31
00128 CS-347 1 Fall 2009 A
12345 CS-347 1 Fall 2009 A
23856 CS-347 1 Fall 2009 A
54321 CS-347 1 Fall 2009 A
76543 CS-347 1 Fall 2009 A

Figure 13.105 The multitable lustering for the above two instan es an be taken as:

deleted, bitmap te hnique is more onvenient for updating free spa e


information.

13.5 It is important to be able to qui kly nd out if a blo k is present in the bu er,
and if so where in the bu er it resides. Given that database bu er sizes are
very large, what (in-memory) data stru ture would you use for this task?
Answer:
Hash table is the ommon option for large database bu ers. The hash fun tion
helps in lo ating the appropriate bu ket on whi h linear sear h is performed.
13.6 Suppose your university has a very large number of takes re ords, a umulated
over many years. Explain how table partitioning an be done on the takes rela-
tion, and what benets it ould o er. Explain also one potential drawba k of
the te hnique.
Answer:
The table an be partitioned on (year, semester). Old takes re ords that are
no longer a essed frequently an be stored on magneti disk, while newer
re ords an be stored on SSD. Queries that spe ify a year an be answered
without reading re ords for other years.
96 Chapter 13 Data Storage Stru tures

A drawba k is that queries that fet h re ords orresponding to multiple years


will have a higher overhead, sin e the re ords may be partitioned a ross di er-
ent relations and disk blo ks.
13.7 Give an example of a relational-algebra expression and a query-pro essing strat-
egy in ea h of the following situations:

a. MRU is preferable to LRU.


b. LRU is preferable to MRU.

Answer:

a. MRU is preferable to LRU where R1 Æ R2 is omputed by using a nested-


loop pro essing strategy where ea h tuple in R2 must be ompared to
ea h blo k in R1 . After the rst tuple of R2 is pro essed, the next needed
blo k is the rst one in R1 . However, sin e it is the least re ently used,
the LRU bu er management strategy would repla e that blo k if a new
blo k was needed by the system.
b. LRU is preferable to MRU where R1 Æ R2 is omputed by sorting the
relations by join values and then omparing the values by pro eeding
through the relations. Due to dupli ate join values, it may be ne essary
to “ba k up” in one of the relations. This “ba king up” ould ross a
blo k boundary into the most re ently used blo k, whi h would have
been repla ed by a system using MRU bu er management, if a new blo k
was needed.
Under MRU, some unused blo ks may remain in memory forever. In
pra ti e, MRU an be used only in spe ial situations like that of the
nested-loop strategy dis ussed in Exer ise Se tion 13.8a.

13.8 PostgreSQL normally uses a small bu er, leaving it to the operating system
bu er manager to manage the rest of main memory available for le system
bu ering. Explain (a) what is the benet of this approa h, and (b) one key
limitation of this approa h.
Answer:

The database system does not know what are the memory demands from other
pro esses. By using a small bu er, PostgreSQL ensures that it does not grab
too mu h of main memory. But at the same time, even if a blo k is evi ted
from bu er, if the le system bu er manager has enough memory allo ated to
it, the evi ted page is likely to still be a hed in the le system bu er. Thus, a
database bu er miss is often not very expensive sin e the blo k is still in the
le system bu er.
Pra ti e Exer ises 97

The drawba k of this approa h is that the database system may not be able to
ontrol the le system bu er repla ement poli y. Thus, the operating system
may make suboptimal de isions on what to evi t from the le system bu er.
CHAPTER
14
Indexing

Pra ti e Exer ises


14.1 Indi es speed query pro essing, but it is usually a bad idea to reate indi es on
every attribute, and every ombination of attributes, that are potential sear h
keys. Explain why.
Answer:
Reasons for not keeping indi es on every attribute in lude:

• Every index requires additional CPU time and disk I/O overhead during
inserts and deletions.
• Indi es on non-primary keys might have to be hanged on updates, al-
though an index on the primary key might not (this is be ause updates
typi ally do not modify the primary-key attributes).
• Ea h extra index requires additional storage spa e.
• For queries whi h involve onditions on several sear h keys, e ien y
might not be bad even if only some of the keys have indi es on them.
Therefore, database performan e is improved less by adding indi es when
many indi es already exist.

14.2 Is it possible in general to have two lustering indi es on the same relation for
di erent sear h keys? Explain your answer.
Answer:
In general, it is not possible to have two primary indi es on the same relation
for di erent keys be ause the tuples in a relation would have to be stored in
di erent order to have the same values stored together. We ould a omplish
this by storing the relation twi e and dupli ating all values, but for a entralized
system, this is not e ient.
14.3 Constru t a B+ -tree for the following set of key values:
99
100 Chapter 14 Indexing

(2, 3, 5, 7, 11, 17, 19, 23, 29, 31)

Assume that the tree is initially empty and values are added in as ending order.
Constru t B+ -trees for the ases where the number of pointers that will t in
one node is as follows:
a. Four
b. Six
. Eight

Answer:

The following were generated by inserting values into the B+ -tree in as ending
order. A node (other than the root) was never allowed to have fewer than än_2å
values/pointers.
a.

19

5 11 29

2 3 5 7 11 17 19 23 29 31

b.

7 19

2 3 5 7 11 17 19 23 29 31

11

2 3 5 7 11 17 19 23 29 31

14.4 For ea h B+ -tree of Exer ise 14.3, show the form of the tree after ea h of the
following series of operations:
a. Insert 9.
Pra ti e Exer ises 101

b. Insert 10.
. Insert 8.
d. Delete 23.
e. Delete 19.

Answer:

• With stru ture Exer ise 14.3.a:


Insert 9:

19

5 9 11 29

2 3 5 7 11 17 19 23 29 31

Insert 10:

19

5 9 11 29

2 3 5 7 10 11 17 19 23 29 31

Insert 8:

19

5 9 11 29

2 3 5 7 10 11 17 19 23 29 31

Delete 23:

11

5 9 19

2 3 5 7 8 9 10 11 17 19 29 31
102 Chapter 14 Indexing

Delete 19:

11

5 9 29

2 3 5 7 8 9 10 11 17 29 31

• With stru ture Exer ise 14.3.b:


Insert 9:

7 19

2 3 5 7 9 11 17 19 23 29 31

Insert 10:

7 19

2 3 5 7 9 10 11 17 19 23 29 31

Insert 8:

7 10 19

2 3 5 7 8 9 10 11 17 9 23 29 31

Delete 23:
7 10 19

2 3 5 7 8 9 10 11 17 19 29 31

Delete 19:

7 10

2 3 5 7 8 9 10 11 17 29 31
Pra ti e Exer ises 103

• With stru ture Exer ise 14.3. :


Insert 9:

11

2 3 5 7 9 11 17 19 23 29 31

Insert 10:

11

2 3 5 7 9 10 11 17 19 23 29 31

Insert 8:

11

2 3 5 7 8 9 10 11 17 19 23 29 31

Delete 23:

11

2 3 5 7 8 9 10 11 17 19 29 31

Delete 19:
11

2 3 5 7 8 9 10 11 17 29 31

14.5 Consider the modied redistribution s heme for B+ -trees des ribed on page
651. What is the expe ted height of the tree as a fun tion of n?
Answer:
If there are K sear h-key values and m * 1 siblings are involved in the redistri-
bution, the expe ted height of the tree is: logâ(m*1)n_mã (K )
14.6 Give pseudo ode for a B+ -tree fun tion findRangeIterator(), whi h is like the
fun tion findRange(), ex ept that it returns an iterator obje t, as des ribed
in Se tion 14.3.2. Also give pseudo ode for the iterator lass, in luding the
variables in the iterator obje t, and the next() method.
Answer:
104 Chapter 14 Indexing

FILL IN
14.7 What would the o upan y of ea h leaf node of a B+ -tree be if index entries
were inserted in sorted order? Explain why.
Answer:

If the index entries are inserted in as ending order, the new entries get dire ted
to the last leaf node. When this leaf node gets lled, it is split into two. Of
the two nodes generated by the split, the left node is left untou hed and the
insertions take pla e on the right node. This makes the o upan y of the leaf
nodes about 50 per ent ex ept for the last leaf.
If keys that are inserted are sorted in des ending order, the above situation
would still o ur, but symmetri ally, with the right node of a split never getting
tou hed again, and o upan y would again be 50 per ent for all nodes other
than the rst leaf.
14.8 Suppose you have a relation r with nr tuples on whi h a se ondary B+ -tree is
to be onstru ted.
a. Give a formula for the ost of building the B+ -tree index by inserting one
re ord at a time. Assume ea h blo k will hold an average of f entries and
that all levels of the tree above the leaf are in memory.
b. Assuming a random disk a ess takes 10 millise onds, what is the ost
of index onstru tion on a relation with 10 million re ords?
. Write pseudo ode for bottom-up onstru tion of a B+ -tree, whi h was
outlined in Se tion 14.4.4. You an assume that a fun tion to e iently
sort a large le is available.

Answer:

a. The ost to lo ate the page number of the required leaf page for an in-
sertion is negligible sin e the non-leaf nodes are in memory. On the leaf
level it takes one random disk a ess to read and one random disk a -
ess to update it along with the ost to write one page. Insertions whi h
lead to splitting of leaf nodes require an additional page write. Hen e to
build a B+ -tree with nr entries it takes a maximum of 2 < nr random disk
a esses and nr + 2 < (nr _f ) page writes. The se ond part of the ost
omes from the fa t that in the worst ase ea h leaf is half lled, so the
number of splits that o ur is twi e nr _f .
The above formula ignores the ost of writing non-leaf nodes, sin e
we assume they are in memory, but in reality they would also be written
eventually. This ost is losely approximated by 2 < (nr _f )_f , whi h
is the number of internal nodes just above the leaf; we an add further
terms to a ount for higher levels of nodes, but these are mu h smaller
than the number of leaves and an be ignored.
Pra ti e Exer ises 105

b. Substituting the values in the above formula and negle ting the ost for
page writes, it takes about 10, 000, 000 < 20 millise onds, or 56 hours,
sin e ea h insertion osts 20 millise onds.
.
fun tion insert in leaf(value K , pointer P )
if(tree is empty) reate an empty leaf node L, whi h is also the root

else Find the last leaf node in the leaf nodes hain L
if (L has less than n * 1 key values)
then insert (K ,P ) at the rst available lo ation in L

else begin

Create leaf node L1


Set L:Pn = L1;
Set K 1 = last value from page L
insert in parent(1, L, K 1, L1)
insert (K ,P ) at the rst lo ation in L1
end

fun tion insert in parent(level l , pointer P , value K , pointer P 1)


if (level l is empty) then begin
Create an empty non-leaf node N , whi h is also the root
insert(P , K , P 1) at the starting of the node N
return
else begin
Find the right most node N at level l
if (N has less than n pointers)
then insert(K , P 1) at the rst available lo ation in N

else begin
Create a new non-leaf page N 1
insert (P 1) at the starting of the node N
insert in parent(l + 1, pointer N , value K , pointer N 1)
end
end

The insert in leaf fun tion is alled for ea h of the value, pointer pairs in
as ending order. Similar fun tion an also be built for des ending order.
The sear h for the last leaf or non-leaf node at any level an be avoided
by storing the urrent last page details in an array.
The last node in ea h level might be less than half lled. To make this
index stru ture meet the requirements of a B+ -tree, we an redistribute
the keys of the last two pages at ea h level. Sin e the last but one node is
always full, redistribution makes sure that both of them are at least half
lled.
106 Chapter 14 Indexing

14.9 The leaf nodes of a B+ -tree le organization may lose sequentiality after a se-
quen e of inserts.
a. Explain why sequentiality may be lost.
b. To minimize the number of seeks in a sequential s an, many databases
allo ate leaf pages in extents of n blo ks, for some reasonably large n.
When the rst leaf of a B+ -tree is allo ated, only one blo k of an n-blo k
unit is used, and the remaining pages are free. If a page splits, and its
n-blo k unit has a free page, that spa e is used for the new page. If the
n-blo k unit is full, another n-blo k unit is allo ated, and the rst n_2 leaf
pages are pla ed in one n-blo k unit and the remaining one in the se ond
n-blo k unit. For simpli ity, assume that there are no delete operations.
i. What is the worst- ase o upan y of allo ated spa e, assuming no
delete operations, after the rst n-blo k unit is full?
ii. Is it possible that leaf nodes allo ated to an n-node blo k unit are not
onse utive, that is, is it possible that two leaf nodes are allo ated
to one n-node blo k, but another leaf node in between the two is
allo ated to a di erent n-node blo k?
iii. Under the reasonable assumption that bu er spa e is su ient to
store an n-page blo k, how many seeks would be required for a leaf-
level s an of the B+ -tree, in the worst ase? Compare this number
with the worst ase if leaf pages are allo ated a blo k at a time.
iv. The te hnique of redistributing values to siblings to improve spa e
utilization is likely to be more e ient when used with the pre eding
allo ation s heme for leaf blo ks. Explain why.

Answer:

a. In a B+ -tree index or le organization, leaf nodes that are adja ent to
ea h other in the tree may be lo ated at di erent pla es on disk. When
a le organization is newly reated on a set of re ords, it is possible to
allo ate blo ks that are mostly ontiguous on disk to leafs nodes that
are ontiguous in the tree. As insertions and deletions o ur on the tree,
sequentiality is in reasingly lost, and sequential a ess has to wait for
disk seeks in reasingly often.
b. i. In the worst ase, ea h n-blo k unit and ea h node of the B+ -tree is
half lled. This gives the worst- ase o upan y as 25 per ent.
ii. No. While splitting the n-blo k unit, the rst n_2 leaf pages are pla ed
in one n-blo k unit and the remaining pages in the se ond n-blo k
unit. That is, every n-blo k split maintains the order. Hen e, the
nodes in the n-blo k units are onse utive.
Pra ti e Exer ises 107

iii. In the regular B+ -tree onstru tion, the leaf pages might not be se-
quential and hen e in the worst- ase, it takes one seek per leaf page.
Using the blo k at a time method, for ea h n-node blo k, we will have
at least n_2 leaf nodes in it. Ea h n-node blo k an be read using one
seek. Hen e the worst- ase seeks ome down by a fa tor of n_2.

iv. Allowing redistribution among the nodes of the same blo k does not
require additional seeks, whereas in regular B+ -trees we require as
many seeks as the number of leaf pages involved in the redistribution.
This makes redistribution for leaf blo ks e ient with this s heme.
Also, the worst- ase o upan y omes ba k to nearly 50 per ent.
(Splitting of leaf nodes is preferred when the parti ipating leaf nodes
are nearly full. Hen e nearly 50 per ent instead of exa t 50 per ent)

14.10 Suppose you are given a database s hema and some queries that are exe uted
frequently. How would you use the above information to de ide what indi es
to reate?
Answer:
Indi es on any attributes on whi h there are sele tion onditions; if there are
only a few distin t values for that attribute, a bitmap index may be reated,
otherwise a normal B+ -tree index.
B+ -tree indi es on primary-key and foreign-key attributes.
Also indi es on attributes that are involved in join onditions in the queries.
14.11 In write-optimized trees su h as the LSM tree or the stepped-merge index, en-
tries in one level are merged into the next level only when the level is full.
Suggest how this poli y an be hanged to improve read performan e during
periods when there are many reads but no updates.
Answer:

If there have been no updates in a while, but there are a lot of index look ups
on an index, then entries at one level, say i, an be merged into the next level,
even if the level is not full. The benet is that reads would then not have to
look up indi es at level i, redu ing the ost of reads.
14.12 What trade o s do bu er trees pose as ompared to LSM trees?
Answer:

The idea of bu er trees an be used with any tree-stru tured index to redu e the
ost of inserts and updates, in luding spatial indi es. In ontrast, LSM trees an
only be used with linearly ordered data that are amenable to merging. On the
other hand, bu er trees require more random I/O to perform insert operations
as ompared to (all variants of) LSM trees.
Write-optimized indi es an signi antly redu e the ost of inserts, and to
a lesser extent, of updates, as ompared to B+ -trees. On the other hand, the
108 Chapter 14 Indexing

index lookup ost an be signi antly higher for write-optimized indi es as


ompared to B+ -trees.
14.13 Consider the instru tor relation shown in Figure 14.1.
a. Constru t a bitmap index on the attribute salary, dividing salary values
into four ranges: below 50,000, 50,000 to below 60,000, 60,000 to below
70,000, and 70,000 and above.
b. Consider a query that requests all instru tors in the Finan e department
with salary of 80,000 or more. Outline the steps in answering the query,
and show the nal and intermediate bitmaps onstru ted to answer the
query.

Answer:
We reprodu e the instru tor relation below.

ID name dept name salary


10101 Srinivasan Comp. S i. 65000
12121 Wu Finan e 90000
15151 Mozart Musi 40000
22222 Einstein Physi s 95000
32343 El Said History 60000
33456 Gold Physi s 87000
45565 Katz Comp. S i. 75000
58583 Calieri History 62000
76543 Singh Finan e 80000
76766 Cri k Biology 72000
83821 Brandt Comp. S i. 92000
98345 Kim Ele . Eng. 80000

a. Bitmap for salary, with S1 , S2 , S3 and S4 representing the given intervals


in the same order

S1 0 0 1 0 0 0 0 0 0 0 0 0
S2 0 0 0 0 0 0 0 0 0 0 0 0
S3 1 0 0 0 1 0 0 1 0 0 0 0
S4 0 1 0 1 0 1 1 0 1 1 1 1

b. The question is a bit trivial if there is no bitmap on the dept name at-
tribute. The bitmap for the dept name attribute is:
Pra ti e Exer ises 109

Comp. S i 1 0 0 0 0 0 1 0 0 0 1 0
Finan e 0 1 0 0 0 0 0 0 1 0 0 0
Musi 0 0 1 0 0 0 0 0 0 0 0 0
Physi s 0 0 0 1 0 1 0 0 0 0 0 0
History 0 0 0 0 1 0 0 1 0 0 0 0
Biology 0 0 0 0 0 0 0 0 0 1 0 0
Ele . Eng. 0 0 0 0 0 0 0 0 0 0 0 1

To nd all instru tors in the Finan e department with salary of 80000
or more, we rst nd the interse tion of the Finan e department bitmap
and S4 bitmap of salary and then s an on these re ords for salary of
80000 or more.
Interse tion of Finan e department bitmap and S4 bitmap of salary.
S4 0 1 0 1 0 1 1 0 1 1 1 1
Finan e 0 1 0 0 0 0 0 0 1 0 0 0
S4 ã Finan e 0 1 0 0 0 0 0 0 1 0 0 0

S an on these re ords with salary 80000 or more gives Wu and Singh as


the instru tors who satisfy the given query.
14.14 Suppose you have a relation ontaining the x, y oordinates and names of
restaurants. Suppose also that the only queries that will be asked are of the
following form: The query spe ies a point and asks if there is a restaurant ex-
a tly at that point. Whi h type of index would be preferable, R-tree or B-tree?
Why?
Answer:
FILL IN
14.15 Suppose you have a spatial database that supports region queries with ir ular
regions, but not nearest-neighbor queries. Des ribe an algorithm to nd the
nearest neighbor by making use of multiple region queries.
Answer:
Start with regions with very small radius, and retry with a larger radius if a
parti ular region does not ontain any result. For example, ea h time the radius
ould be in reased by a fa tor of (say) 1.5. The benet is that sin e we do not
use a very large radius ompared to the minimum radius required, there will
(hopefully!) not be too many points in the ir ular range query result.
CHAPTER
15
Query Pro essing
Pra ti e Exer ises
15.1 Assume (for simpli ity in this exer ise) that only one tuple ts in a blo k and
memory holds at most three blo ks. Show the runs reated on ea h pass of
the sort-merge algorithm when applied to sort the following tuples on the rst
attribute: (kangaroo, 17), (wallaby, 21), (emu, 1), (wombat, 13), (platypus,
3), (lion, 8), (warthog, 4), (zebra, 11), (meerkat, 6), (hyena, 9), (hornbill, 2),
(baboon, 12).
Answer:

We will refer to the tuples (kangaroo, 17) through (baboon, 12) using tuple
numbers t through t . We refer to the j run used by the i pass, as r . The
1 12
th th

ij

initial sorted runs have three blo ks ea h. They are:


r
11
= ^ t
3
,t ,t `
1 2

r
12
= ^ t
6
,t ,t `
5 4

r
13
= ^ t
9
,t ,t `
7 8

r
14
= ^ t
12
,t ,t
11 10
`

Ea h pass merges three runs. Therefore the runs after the end of the rst pass
are:
r
21
= ^ t
3
,t ,t ,t ,t ,t ,t ,t ,t
1 6 9 5 2 7 4 8
`
r
22
= ^ t
12
,t ,t `
11 10

At the end of the se ond pass, the tuples are ompletely sorted into one run:
r
31
= ^ t
12
,t ,t ,t ,t ,t ,t ,t ,t ,t ,t ,t
3 11 10 1 6 9 5 2 7 4 8
`
15.2 Consider the bank database of Figure 15.14, where the primary keys are un-
derlined, and the following SQL query:
111
112 Chapter 15 Query Pro essing

.
sele t T bran h name

from bran h T , bran h S


where T.assets > S.assets and S.bran h ity = “Brooklyn”

Write an e ient relational-algebra expression that is equivalent to this query.


Justify your hoi e.
Answer:

Query:

 (( ( (bran h))) Æ


( ( ( (bran h)))))
T.bran h name bran h name, assets T T.assets > S.assets

assets (bran h ity = 'Brooklyn') S

This expression performs the theta join on the smallest amount of data possi-
ble. It does this by restri ting the right-hand side operand of the join to only
those bran hes in Brooklyn and also eliminating the unneeded attributes from
both the operands.
15.3 Let relations r (A, B, C ) and r (C , D, E ) have the following properties: r has
1 2 1

20,000 tuples, r has 45,000 tuples, 25 tuples of r t on one blo k, and 30


2 1

tuples of r t on one blo k. Estimate the number of blo k transfers and seeks
required using ea h of the following join strategies for r Æ r :
2

1 2

a. Nested-loop join.
b. Blo k nested-loop join.
. Merge join.
d. Hash join.

Answer:

r needs 800 blo ks, and r needs 1500 blo ks. Let us assume M pages of
1 2

memory. If M > 800, the join an easily be done in 1500 + 800 disk a esses,

(
bran h bran h name , bran h ity, assets)
ustomer ( ustomer name, ustomer street, ustomer ity)
loan (loan number , bran h name, amount )

borrower ( ustomer name, loan number )

a ount (a ount number , bran h name, balan e)

depositor ( ustomer name, a ount number )

Figure 15.14 Bank database.


Pra ti e Exer ises 113

using even plain nested-loop join. So we onsider only the ase where M f 800
pages.
a. Nested-loop join:
Using r as the outer relation, we need 20000 < 1500 + 800 =
30, 000, 800 disk a esses. If r is the outer relation, we need 45000 <
1

800 + 1500 = 36, 001, 500 disk a esses.


2

b. Blo k nested-loop join:


If r is the outer relation, we need ä * å < 1500 + 800 disk a esses. If
1
800

is the outer relation, we need ä * å < 800 + 1500 disk a esses.


M 1
1500
r
2
M 1

. Merge join:
Assuming that r and r are not initially sorted on the join key, the total
sorting ost in lusive of the output is B = 1500(2älog * (1500_M )å+
1 2

2) + 800(2älog * (800_M )å + 2) disk a esses. Assuming all tuples


s M 1

M 1

with the same value for the join attributes t in memory, the total ost
is B + 1500 + 800 disk a esses.
s

d. Hash join:
We assume no over ow o urs. Sin e r is smaller, we use it as the build
relation and r as the probe relation. If M > 800_M , i.e., no need for
1

re ursive partitioning, then the ost is 3(1500 + 800) = 6900 disk


2

a esses, else the ost is 2(1500 + 800)älog * (800) * 1å + 1500 + 800


M 1

disk a esses.
15.4 The indexed nested-loop join algorithm des ribed in Se tion 15.5.3 an be
ine ient if the index is a se ondary index and there are multiple tuples with
the same value for the join attributes. Why is it ine ient? Des ribe a way,
using sorting, to redu e the ost of retrieving tuples of the inner relation. Under
what onditions would this algorithm be more e ient than hybrid merge join?
Answer:

If there are multiple tuples in the inner relation with the same value for the
join attributes, we may have to a ess that many blo ks of the inner relation
for ea h tuple of the outer relation. That is why it is ine ient. To redu e this
ost we an perform a join of the outer relation tuples with just the se ondary
index leaf entries, postponing the inner relation tuple retrieval. The result le
obtained is then sorted on the inner relation addresses, allowing an e ient
physi al order s an to omplete the join.
Hybrid merge–join requires the outer relation to be sorted. The above al-
gorithm does not have this requirement, but for ea h tuple in the outer relation
it needs to perform an index lookup on the inner relation. If the outer relation
is mu h larger than the inner relation, this index lookup ost will be less than
the sorting ost, thus this algorithm will be more e ient.
114 Chapter 15 Query Pro essing

15.5 Let r and s be relations with no indi es, and assume that the relations are not
sorted. Assuming innite memory, what is the lowest- ost way (in terms of I/O
operations) to ompute r Æ s? What is the amount of memory required for
this algorithm?
Answer:

We an store the entire smaller relation in memory, read the larger relation
blo k by blo k, and perform nested-loop join using the larger one as the outer
relation. The number of I/O operations is equal to b + b , and the memory
requirement is min(b , b ) + 2 pages.
r s

r s

15.6 Consider the bank database of Figure 15.14, where the primary keys are un-
derlined. Suppose that a B+ -tree index on bran h ity is available on relation
bran h, and that no other index is available. List di erent ways to handle the

following sele tions that involve negation:


a. › (bran h ity < “Brooklyn”)
(bran h)
b. › (bran h ity =“Brooklyn”) (bran h)

. › (bran h ity < “Brooklyn” â <


assets 5000)
(bran h)

Answer:

a. Use the index to lo ate the rst tuple whose bran h ity eld has value
“Brooklyn”. From this tuple, follow the pointer hains till the end, re-
trieving all the tuples.
b. For this query, the index serves no purpose. We an s an the le sequen-
tially and sele t all tuples whose bran h ity eld is anything other than
“Brooklyn”.
. This query is equivalent to the query

(bran h ity g ¨
Brooklyn
¨
á <
assets 5000)
(bran h)

Using the bran h- ity index, we an retrieve all tuples with bran h- ity
value greater than or equal to “Brooklyn” by following the pointer hains
from the rst “Brooklyn” tuple. We also apply the additional riteria of
assets < 5000 on every tuple.

15.7 Write pseudo ode for an iterator that implements indexed nested-loop join,
where the outer relation is pipelined. Your pseudo ode must dene the stan-
dard iterator fun tions open(), next(), and lose(). Show what state information
the iterator must maintain between alls.
Answer:

Let outer be the iterator whi h returns su essive tuples from the pipelined
outer relation. Let inner be the iterator whi h returns su essive tuples of
Pra ti e Exer ises 115

the inner relation having a given value at the join attributes. The inner iter-
ator returns these tuples by performing an index lookup. The fun tions In-

dexedNLJoin::open , IndexedNLJoin:: lose and IndexedNLJoin::next to imple-


ment the indexed nested-loop join iterator are given below. The two iterators
outer and inner, the value of the last read outer relation tuple t and a ag done
r r

indi ating whether the end of the outer relation s an has been rea hed are the
state information whi h need to be remembered by IndexedNLJoin between
alls. Please see ??

15.8 Design sort-based and hash-based algorithms for omputing the relational di-
vision operation (see Pra ti e Exer ise 2.9 for a denition of the division op-
eration).
Answer:

Suppose r(T ä S) and s(S) are two relations and r Ÿ s has to be omputed.
For a sorting-based algorithm, sort relation s on S. Sort relation r on (T , S).
Now, start s anning r and look at the T attribute values of the rst tuple. S an r
till tuples have same value of T . Also s an s simultaneously and he k whether
every tuple of s also o urs as the S attribute of r, in a fashion similar to merge
join. If this is the ase, output that value of T and pro eed with the next value of
T . Relation s may have to be s anned multiple times, but r will only be s anned

on e. Total disk a esses, after sorting both the relations, will be ðrð + N < ðsð,
where N is the number of distin t values of T in r.
We assume that for any value of T , all tuples in r with that T value t in
memory, and we onsider the general ase at the end. Partition the relation
r on attributes in T su h that ea h partition ts in memory (always possible

be ause of our assumption). Consider partitions one at a time. Build a hash


table on the tuples, at the same time olle ting all distin t T values in a separate
hash table. For ea h value of T , Now, for ea h value V of T , ea h value s of
T

S , probe the hash table on (V , s). If any of the values is absent, dis ard the
T

value V , else output the value V .


T T

In the ase that not all r tuples with one value for T t in memory, parti-
tion r and s on the S attributes su h that the ondition is satised, and run
the algorithm on ea h orresponding pair of partitions r and s . Output the
i i

interse tion of the T values generated in ea h partition.


15.9 What is the e e t on the ost of merging runs if the number of bu er blo ks
per run is in reased while overall memory available for bu ering runs remains
xed?
Answer:

Seek overhead is redu ed, but the the number of runs that an be merged in a
pass de reases, potentially leading to more passes. A value of b that minimizes
b

overall ost should be hosen.


116 Chapter 15 Query Pro essing

IndexedNLJoin::open ()
begin

.
outer open ();
.
inner open ();
done := false;
() ‘ false)
r

( .
if outer next

move tuple from outer's output bu er to t ; r

else

done
r
:= true;
end

IndexedNLJoin:: lose ()
begin

.
outer lose ();
.
inner lose ();
end

boolean IndexedNLJoin::next ()
begin

while (›done ) r

begin

(
if inner next t. ( [JoinAttrs℄) ‘ false)
r

begin

move tuple from inner's output bu er to t ;


ompute t Æ t and pla e it in output bu er;
s

r s

return true;
end

else

(
if outer next . () ‘ false)
begin

move tuple from outer's output bu er to t ; r

rewind inner to rst tuple of s;


end

else

done
r
:= true;
end

return false ;
end

Figure 15.101 Answer for Exer ise 15.7.


Pra ti e Exer ises 117

15.10 Consider the following extended relational-algebra operators. Des ribe how to
implement ea h operation using sorting and using hashing.
a. Semijoin ( ): The multiset semijoin operator r  s is dened as follows:
if a tuple r appears n times in r, it appears n times in the result of r
if there is at least one tuple s su h that r and s satisfy predi ate ;
i

j i j

otherwise r does not appear in the result.


i

b. Anti-semijoin ( ): The multiset anti-semijoin operator r s is dened


as follows: if a tuple r appears n times in r, it appears n times in the result
of r if there does not exist any tuple s in s su h that r and s satisfy
i

predi ate ; otherwise r does not appear in the result.


j i j

Answer:

FILL IN: CHe k for dupli ate preservation


As in the ase of join algorithms, semijoin and anti-semijoin an be done e -
iently if the join onditions are equijoin onditions. We des ribe below how
to e iently handle the ase of equijoin onditions using sorting and hashing.
With arbitrary join onditions, sorting and hashing annot be used; (blo k)
nested loops join needs to be used instead.
a. Semijoin:

• Semijoin using sorting: Sort both r and s on the join attributes in


. Perform a s an of both r and s similar to the merge algorithm

and add tuples of r to the result whenever the join attributes of the
urrent tuples of r and s mat h.
• Semijoin using hashing: Create a hash index in s on the join at-
tributes in . Iterate over r, and for ea h distin t value of the join
attributes, perform a hash lookup in s. If the hash lookup returns a
value, add the urrent tuple of r to the result.
Note that if r and s are large, they an be partitioned on the join
attributes rst and the above pro edure applied on ea h partition.
If r is small but s is large, a hash index an be built on r and probed
using s; and if an s tuple mat hes an r tuple, the r tuple an be output
and deleted from the hash index.
b. Anti-semijoin:

• Anti-semijoin using sorting : Sort both r and s on the join attributes


in . Perform a s an of both r and s similar to the merge algorithm
and add tuples of r to the result if no tuple of s satises the join
predi ate for the orresponding tuple of r.
• Anti-semijoin using hashing : Create a hash index in s on the join
attributes in . Iterate over r, and for ea h distin t value of the join
attributes, perform a hash lookup in s. If the hash lookup returns a
null value, add the urrent tuple of r to the result.
118 Chapter 15 Query Pro essing

As for semijoin, partitioning an be used if r and s are large. An


index on r an be used instead of an index on s, but then when an s
tuple mat hes an r tuple, the r tuple is deleted from the index. After
pro essing all s tuples, all remaining r tuples in the index are output
as the result of the anti-semijoin operation.
15.11 Suppose a query retrieves only the rst K results of an operation and termi-
nates after that. Whi h hoi e of demand-driven or produ er-driven pipelining
(with bu ering) would be a good hoi e for su h a query? Explain your an-
swer.
Answer:

Demand driven is better, sin e it will only generate the top K results. Produ er
driven may generate a lot more answers, many of whi h would not get used.
15.12 Current generation CPUs in lude an instru tion a he, whi h a hes re ently
used instru tions. A fun tion all then has a signi ant overhead be ause the
set of instru tions being exe uted hanges, resulting in a he misses on the
instru tion a he.
a. Explain why produ er-driven pipelining with bu ering is likely to result
in a better instru tion a he hit rate, as ompared to demand-driven
pipelining.
b. Explain why modifying demand-driven pipelining by generating multiple
results on one all to next(), and returning them together, an improve
the instru tion a he hit rate.

Answer:

Produ er-driven pipelining exe utes the same set of instru tions to generate
multiple tuples by onsuming already generated tuples from the inputs. Thus
instru tion a he hits will be more. In omparison, demand-driven pipelining
swit hes from the instru tions of one fun tion to another for ea h tuple, re-
sulting in more misses.
By generating multiple results at one go, a next(() fun tion would re eive
multiple tuples in its inputs and have a loop that generates multiple tuples for
its output without swit hing exe ution to another fun tion. Thus, the instru -
tion a he hit rate an be expe ted to improve.
15.13 Suppose you want to nd do uments that ontain at least k of a given set of n
keywords. Suppose also you have a keyword index that gives you a (sorted) list
of identiers of do uments that ontain a spe ied keyword. Give an e ient
algorithm to nd the desired set of do uments.
Answer:

Let S be a set of n keywords. An algorithm to nd all do uments that ontain


at least k of these keywords is given in??
Pra ti e Exer ises 119

initialize the list L to the empty list;


for (ea h keyword in S) do

begin

D := the list of do uments identiers orresponding to ;


for (ea h do ument identier d in D) do

if(a re ord R with do ument identier as d is on list L) then

R:referen e ount := R:referen e ount + 1;

else begin

make a new re ord R;


R:do ument id := d ;

R:referen e ount := 1;

add R to L;
;
end

end ;
for (ea h re ord R in L) do

if(R:referen e ount >= k) then

output R;

Figure 15.102 Answer for Exer ise 15.13.

This algorithm al ulates a referen e ount for ea h do ument identier.


A referen e ount of i for a do ument identier d means that at least i of the
keywords in S o ur in the do ument identied by d . The algorithm maintains
a list of re ords, ea h having two elds – a do ument identier, and the refer-
en e ount for this identier. This list is maintained sorted on the do ument
identier eld.
Note that exe ution of the se ond for statement auses the list D to “merge”
with the list L. Sin e the lists L and D are sorted, the time taken for this merge
is proportional to the sum of the lengths of the two lists. Thus the algorithm
runs in time (at most) proportional to n times the sum total of the number of
do ument identiers orresponding to ea h keyword in S.
15.14 Suggest how a do ument ontaining a word (su h as “leopard”) an be in-
dexed su h that it is e iently retrieved by queries using a more general on-
ept (su h as “ arnivore” or “mammal”). You an assume that the on ept
hierar hy is not very deep, so ea h on ept has only a few generalizations (a
on ept an, however, have a large number of spe ializations). You an also
assume that you are provided with a fun tion that returns the on ept for ea h
word in a do ument. Also suggest how a query using a spe ialized on ept an
retrieve do uments using a more general on ept.
Answer:

Add do to index lists for more general on epts also.


120 Chapter 15 Query Pro essing

15.15 Explain why the nested-loops join algorithm (see Se tion 15.5.1) would work
poorly on a database stored in a olumn-oriented manner. Des ribe an alterna-
tive algorithm that would work better, and explain why your solution is better.
Answer:

If the nested-loops join algorithm is used as is, it would require tuples for ea h
of the relations to be assembled before they are joined. Assembling tuples an
be expensive in a olumn store, sin e ea h attribute may ome from a separate
area of the disk; the overhead of assembly would be parti ularly wasteful if
many tuples do not satisfy the join ondition and would be dis arded. In su h
a situation it would be better to rst nd whi h tuples mat h by a essing only
the join olumns of the relations. Sort-merge join, hash join, or indexed nested
loops join an be used for this task. After the join is performed, only tuples that
get output by the join need to be assembled; assembly an be done by sorting
the join result on the re ord identier of one of the relations and a essing
the orresponding attributes, then resorting on re ord identiers of the other
relation to a ess its attributes.
15.16 Consider the following queries. For ea h query, indi ate if olumn-oriented
storage is likely to be bene ial or not, and explain why.
a. Fet h ID, name and dept name of the student with ID 12345.
b. Group the takes relation by year and ourse id , and nd the total number
of students for ea h (year, ourse id ) ombination.

Answer:

FILL IN AND re he k question


CHAPTER
16
Query Optimization
Pra ti e Exer ises
16.1 Download the university database s hema and the large university dataset from
dbbook. om . Create the university s hema on your favorite database, and load
the large university dataset. Use the explain feature des ribed in Note 16.1 on
page 746 to view the plan hosen by the database, in di erent ases as detailed
below.

a. Write a query with an equality ondition on student.name (whi h does


not have an index), and view the plan hosen.
b. Create an index on the attribute student.name, and view the plan hosen
for the above query.
. Create simple queries joining two relations, or three relations, and view
the plans hosen.
d. Create a query that omputes an aggregate with grouping, and view the
plan hosen.
e. Create an SQL query whose hosen plan uses a semijoin operation.
f. Create an SQL query that uses a not in lause, with a subquery using
aggregation. Observe what plan is hosen.
g. Create a query for whi h the hosen plan uses orrelated evaluation (the
way orrelated evaluation is represented varies by database, but most
databases would show a lter or a proje t operator with a subplan or
subquery).
h. Create an SQL update query that updates a single row in a relation. View
the plan hosen for the update query.
121
122 Chapter 16 Query Optimization

i. Create an SQL update query that updates a large number of rows in a re-
lation, using a subquery to ompute the new value. View the plan hosen
for the update query.

Answer:
The answer depends on the database.
FILL IN Suggested queries for ea h exer ise as veried on some database
16.2 Show that the following equivalen es hold. Explain how you an apply them
to improve the e ien y of ertain queries:
a. E
1
Æ ( *
E
2
E
3
) ’ (E 1
Æ * Æ E
2
E
1
E
3
).
b.  ( A F ( E )) ’ A F ( ( )), where  uses only attributes from
E .
A

.  (  E
1
E
2
) ’  ( )  , where  uses only attributes from
E
1
E
2
E
1
.

Answer:

a. E
1
Æ ( *
E
2
E
3
) = (E 1
Æ * Æ E
2
E
1
E
3
).

Let us rename (E Æ (E * E )) as R , (E Æ E ) as R and (E Æ E )


1 2 3 1 1 2 2 1 3

as R . It is lear that if a tuple t belongs to R , it will also belong to R .


3 1 2

If a tuple t belongs to R , t[E 's attributes℄ will belong to E , hen e t


3 3 3

annot belong to R . From these two we an say that


1

Å, Ë
t t R
1
Ù t Ë( * R
2
R
3
)
It is lear that if a tuple t belongs to R * , then
R [ 's attributes℄ Ë E
t R

and t[R 's attributes℄ Ì E . Therefore:


2 3 2 2

2 3

Å, Ë( *
t t R
2
R
3
) Ù t Ë R
1

The above two equations imply the given equivalen e.


This equivalen e is helpful be ause evaluation of the right-hand side
join will produ e many tuples whi h will nally be removed from the
result. The left-hand side expression an be evaluated more e iently.
b.  ( A F ( E )) = A F ( ( E )), where  uses only attributes from A.

 uses only attributes from . Therefore if any tuple in the output of


A t

A F ( ) is ltered out by the sele tion of the left-hand side, all the tuples
E

in E whose value in A is equal to t[A℄ are ltered out by the sele tion of
the right-hand side. Therefore:
Å , Ì  ( A F (
t t E )) Ù t Ì A F ( ( E ))
Using similar reasoning, we an also on lude that
Å , Ì A F ( (
t t E )) Ù t Ì  ( A F ( E ))
Pra ti e Exer ises 123

The above two equations imply the given equivalen e.


This equivalen e is helpful be ause evaluation of the right-hand side
avoids performing the aggregation on groups whi h are going to be re-
moved from the result. Thus the right-hand side expression an be eval-
uated more e iently than the left-hand side expression.
.  ( E
1
 E
2
) =  ( E
1
)  E
2
where  uses only attributes from E . 1

 uses only attributes from E . Therefore if any tuple t in the output of


(E 1
 1

E ) is ltered out by the sele tion of the left-hand side, all the
2

tuples in E whose value is equal to t[E ℄ are ltered out by the sele tion
1 1

of the right-hand side. Therefore:


Å , Ì  (
t t E
1
 E
2
) Ù t Ì  ( E
1
)  E
2

Using similar reasoning, we an also on lude that


Å , Ì  (
t t E
1
)  Ù
E
2
t Ì  ( E
1
 E
2
)
The above two equations imply the given equivalen e.
This equivalen e is helpful be ause evaluation of the right-hand side
avoids produ ing many output tuples whi h are going to be removed
from the result. Thus the right-hand side expression an be evaluated
more e iently than the left-hand side expression.
16.3 For ea h of the following pairs of expressions, give instan es of relations that
show the expressions are not equivalent.
a. A ( * ) and A ( ) * A ( ).
r s r s

b. B< ( A max B as B ( )) and A max B


4 ( )
r
( ) as 
B ( B<4 (r )).
. In the pre eding expressions, if both o urren es of max were repla ed
by min, would the expressions be equivalent?
 
d. (r s) t and r (s t)  
In other words, the natural right outer join is not asso iative.
e. ( 
 E
1
E
2
) and   ( E
1  E
2
), where  uses only attributes from E . 2

Answer:

a. R = ^(1, 2)`, S = ^(1, 3)`

The result of the left-hand side expression is ^(1)`, whereas the result of
the right-hand side expression is empty.
b. R = ^(1, 2), (1, 5)`

The left-hand side expression has an empty result, whereas the right hand
side one has the result ^(1, 2)`.
124 Chapter 16 Query Optimization

. Yes, on repla ing the max by the min, the expressions will be ome equiv-
alent. Any tuple that the sele tion in the rhs eliminates would not pass
the sele tion on the lhs if it were the minimum value and would be elim-
inated anyway if it were not the minimum value.
d. R = ^(1, 2)`, S = ^(2, 3)`, T = ^(1, 4)`. The left-hand expres-
sion gives ^(1, 2, null , 4)` whereas the the right-hand expression gives
^(1, 2, 3, null)`.
e. Let R be of the s hema (A, B) and S of (A, C ). Let R = ^(1, 2)`, S =
^(2, 3)` and let  be the expression C = 1. The left side expression's
result is empty, whereas the right side expression results in ^(1, 2, null )`.
16.4 SQL allows relations with dupli ates (Chapter 3), and the multiset version of
the relational algebra is dened in Note 3.1 on page 80, Note 3.2 on page 97,
and Note 3.3 on page 108. Che k whi h of the equivalen e rules 1 through 7.b
hold for the multiset version of the relational algebra.
Answer:
All the equivalen e rules 1 through 7.b of se tion Se tion 16.2.1 hold for the
multiset version of the relational algebra dened in Chapter 2.
There exist equivalen e rules that hold for the ordinary relational algebra but
do not hold for the multiset version. For example onsider the rule :-

A ã B = A ä *
B (A * B ) * (B * A)

This is learly valid in plain relational algebra. Consider a multiset instan e


in whi h a tuple t o urs 4 times in A and 3 times in B. t will o ur 3 times
in the output of the left-hand side expression, but 6 times in the output of the
right-hand side expression. The reason for this rule to not hold in the multiset
version is the asymmetry in the semanti s of multiset union and interse tion.
16.5 Consider the relations r (A, B, C ), r (C , D, E ), and r (E , F ), with primary keys
1 2 3

A , C, and E, respe tively. Assume that r has 1000 tuples, r has 1500 tuples,
and r has 750 tuples. Estimate the size of r Æ r Æ r , and give an e ient
1 2

3 1 2 3

strategy for omputing the join.


Answer:

• The relation resulting from the join of r , r , and r will be the same no
1 2 3

matter whi h way we join them, due to the asso iative and ommutative
properties of joins. So we will onsider the size based on the strategy of
((r Æ r ) Æ r ). Joining r with r will yield a relation of at most 1000
1 2 3 1 2

tuples, sin e C is a key for r . Likewise, joining that result with r will yield
2 3

a relation of at most 1000 tuples be ause E is a key for r . Therefore, the 3

nal relation will have at most 1000 tuples.


Pra ti e Exer ises 125

• An e ient strategy for omputing this join would be to reate an index


on attribute C for relation r and on E for r . Then for ea h tuple in r , we
2 3 1

do the following:
a. Use the index for r to look up at most one tuple whi h mat hes the
2

C value of r .
1

b. Use the reated index on E to look up in r at most one tuple whi h 3

mat hes the unique value for E in r . 2

16.6 Consider the relations r (A, B, C ), r (C , D, E ), and r (E , F ) of Pra ti e Exer-


1 2 3

ise 16.5. Assume that there are no primary keys, ex ept the entire s hema.
Let V (C , r ) be 900, V (C , r ) be 1100, V (E , r ) be 50, and V (E , r ) be 100.
1 2 2 3

Assume that r has 1000 tuples, r has 1500 tuples, and r has 750 tuples. Es-
timate the size of r Æ r Æ r and give an e ient strategy for omputing
1 2 3

1 2 3

the join.
Answer:
The estimated size of the relation an be determined by al ulating the average
number of tuples whi h would be joined with ea h tuple of the se ond relation.
In this ase, for ea h tuple in r , 1500/V (C , r ) = 15/11 tuples (on the average)
1 2

of r would join with it. The intermediate relation would have 15000/11 tuples.
2

This relation is joined with r to yield a result of approximately 10,227 tuples


(15000/11  750/100 = 10227). A good strategy should join r and r rst,
3

1 2

sin e the intermediate relation is about the same size as r or r . Then r is 1 2 3

joined to this result.


16.7 Suppose that a B+ -tree index on building is available on relation department
and that no other index is available. What would be the best way to handle the
following sele tions that involve negation?
a. › building <
( “Watson”)
(department)
b. › building =
( “Watson”)
(department)
. › building <
( “Watson” â budget < 50000)
(department)

Answer:

a. Use the index to lo ate the rst tuple whose building eld has value “Wat-
son”. From this tuple, follow the pointer hains till the end, retrieving all
the tuples.
b. For this query, the index serves no purpose. We an s an the le sequen-
tially and sele t all tuples whose building eld is anything other than
“Watson”.
. This query is equivalent to the query:
building g 'Watson' á budget < 5000)
(department).
126 Chapter 16 Query Optimization

Using the building index, we an retrieve all tuples with building value
greater than or equal to “Watson” by following the pointer hains from
the rst “Watson” tuple. We also apply the additional riteria of budget <
5000 on every tuple.
16.8 Consider the query:
sele t*
from r , s
where upper(r :A) = upper(s:A);

where “upper” is a fun tion that returns its input argument with all lower ase
letters repla ed by the orresponding upper ase letters.
a. Find out what plan is generated for this query on the database system
you use.
b. Some database systems would use a (blo k) nested-loop join for this
query, whi h an be very ine ient. Brie y explain how hash-join or
merge-join an be used for this query.

Answer:

a. First reate relations r and s, and add some tuples to the two relations,
before nding the plan hosen; or use existing relations in pla e of r and
s. Compare the hosen plan with the plan hosen for a query dire tly

equating r:A = s:B. Che k the estimated statisti s, too. Some databases
may give the same plan, but with vastly di erent statisti s.
(On PostgreSQL, we found that the optimizer used the merge join
plan des ribed in the answer to the next part of this question.)
b. To use hash join, hashing should be done after applying the upper()
fun tion to r:A and s:A. Similarly, for merge join, the relations should
be sorted on the result of applying the upper() fun tion on r:A and s:A.
The hash or merge join algorithms an then be used un hanged.
16.9 Give onditions under whi h the following expressions are equivalent:

A,B agg(C ) (E1 Æ E
2
) and (A agg C (E )) Æ E
( ) 1 2

where agg denotes any aggregation operation. How an the above onditions
be relaxed if agg is one of min or max?
Answer:

The above expressions are equivalent provided E ontains only attributes A


2

and B, with A as the primary key (so there are no dupli ates). It is OK if E 2

does not ontain some A values that exist in the result of E , sin e su h values
1

will get ltered out in either expression. However, if there are dupli ate values
in E :A, the aggregate results in the two ases would be di erent.
2
Pra ti e Exer ises 127

If the aggregate fun tion is min or max, dupli ate A values do not have any
e e t. However, there should be no dupli ates on (A, B); the rst expression
removes su h dupli ates, while the se ond does not.
16.10 Consider the issue of interesting orders in optimization. Suppose you are given
a query that omputes the natural join of a set of relations S. Given a subset
S 1 of S , what are the interesting orders of S 1?

Answer:

The interesting orders are all orders on subsets of attributes that an potentially
parti ipate in join onditions in further joins. Thus, let T be the set of all
attributes of S1 that also o ur in any relation in S * S1. Then every ordering
of every subset of T is an interesting order.
16.11 Modify the FindBestPlan(S) fun tion to reate a fun tion FindBestPlan(S, O),
where O is a desired sort order for S, and whi h onsiders interesting sort
orders. A null order indi ates that the order is not relevant. Hints: An algorithm
A may give the desired order O; if not a sort operation may need to be added

to get the desired order. If A is a merge-join, FindBestPlan must be invoked on


the two inputs with the desired orders for the inputs.
Answer:
FILL IN
16.12 Show that, with n relations, there are (2(n * 1))_(n * 1) di erent join orders.
Hint: A omplete binary tree is one where every internal node has exa tly two
hildren. Use the fa t that the number of di erent omplete binary trees with
n leaf nodes is: 0 1
1 2(n * 1)
n (n * 1)
If you wish, you an derive the formula for the number of omplete binary trees
with n nodes from the formula for the number of binary trees with n nodes.
The number of binary trees with n nodes is:
0 1
1 2n
n +1 n

This number is known as the Catalan number, and its derivation an be found
in any standard textbook on data stru tures or algorithms.
Answer:

Ea h join order is a omplete binary tree (every non-leaf node has exa tly two
hildren) with the relations as the leaves.
 The number of di erent omplete
binary trees with n leaf nodes is n nn** . This is be ause there is a bije tion
1 2(

(
1)

1)

between the number of omplete binary trees with n leaves and number of
binary trees with n * 1 nodes. Any omplete binary tree with n leaves has n * 1
internal nodes. Removing all the leaf nodes, we get a binary tree with n * 1
128 Chapter 16 Query Optimization

nodes. Conversely, given any binary tree with n * 1 nodes, it an be onverted


to a omplete binary tree by adding n leaves in a unique
 way. The number
of binary trees with n * 1 nodes is given by n nn** , known as the Catalan
1 2( 1)

number. Multiplying this by n for the number of permutations of the n leaves,


( 1)

we get the desired result.


16.13 Show that the lowest- ost join order an be omputed in time O(3n ). Assume
that you an store and look up information about a set of relations (su h as
the optimal join order for the set, and the ost of that join order) in onstant
time. (If you nd this exer ise di ult, at least show the looser time bound of
O(2
2n ).)

Answer:
Consider the dynami programming algorithm given in Se tion 16.4. For ea h
subset having k + 1 relations, the optimal join order an be omputed in time
2k+ . That is be ause for one parti ular pair of subsets A and B, we need on-
1

stant time, and there are at most 2k+ * 2 di erent subsets that
1

 A an denote.
n 
Thus, over all the k+ subsets of size k + 1, this ost is k+n 2k+ . Summing
1

over all k from 1 to n * 1 gives the binomial expansion of ((1 + x)n * x) with
1 1

x = 2. Thus the total ost is less than 3 .


n

16.14 Show that, if only left-deep join trees are onsidered, as in the System R opti-
mizer, the time taken to nd the most e ient join order is around n2n . Assume
that there is only one interesting sort order.
Answer:
The derivation of time taken is similar to the general ase, ex ept that instead
of onsidering 2k+ * 2 subsets of size less than or equal to k for A, we only
1

need to onsider k + 1 subsets of size exa tly equal to k. That is be ause the
right-hand operand of the topmost join has to be a single relation. Therefore
the total ost for nding the best join order for all subsets of size k + 1 is
n 
k + 1), whi h is equal to n . Summing over all k from 1 to n * 1
n*
k+
( 1

k
using the binomial expansion of (1 + x)n* with x = 1 gives a total ost of less
1
1

than n2n* .
1

16.15 Consider the bank database of Figure 16.9, where the primary keys are under-
lined. Constru t the following SQL queries for this relational database.
a. Write a nested query on the relation a ount to nd, for ea h bran h
with name starting with B, all a ounts with the maximum balan e at
the bran h.
b. Rewrite the pre eding query without using a nested subquery; in other
words, de orrelate the query, but in SQL.
. Give a relational algebra expression using semijoin equivalent to the
query.
Pra ti e Exer ises 129

d. Give a pro edure (similar to that des ribed in Se tion 16.4.4) for de or-
relating su h queries.

Answer:

a. The nested query is as follows:

sele t .
S a ount number

from a ount S

where .
S bran h name like 'B%' and
S.balan e =
(sele t max(T.balan e)
from a ount T

where T.bran h name = S.bran h name)


b. The de orrelated query is as follows:

reate table t
1
as
sele t bran h name , max(balan e)
from a ount

group by bran h name

sele t a ount number

from a ount ,t 1

where a ount.bran h name like 'B%' and


a ount.bran h name =t1
.bran h name and

a ount.balan e =t
1
.balan e

. FILL IN
d. In general, onsider the queries of the form:

(
bran h bran h name , bran h ity, assets)
ustomer ( ustomer name, ustomer street, ustomer ity)
loan (loan number , bran h name, amount )

borrower ( ustomer name, loan number )

a ount (a ount number , bran h name, balan e )

depositor ( ustomer name, a ount number )

Figure 16.9 Banking database.


130 Chapter 16 Query Optimization

sele t 5
from L
1

where P
1
and
A
1
op

(sele t f(A ) 2

from L
2

where P
2
)
where f is some aggregate fun tion on attributes A
2
and op is some
boolean binary operator. It an be rewritten as

***** FILL IN **** GIVE Relational algebra version *****


reate table t
1
as

sele t f (A ),V 2

from L
2
1
where P
2

group by V

sele t 5
from L
1
,t 1
2
where P
1
and P and
2

A
1
op t :A
1 2

where P ontains predi ates in P without sele tions involving orrela-


1

2 2

tion variables, and P introdu es the sele tions involving the orrelation
2

variables. V ontains all the attributes that are used in the sele tions in-
volving orrelation variables in the nested query.
CHAPTER
17
Transa tions
Pra ti e Exer ises
17.1 Suppose that there is a database system that never fails. Is a re overy manager
required for this system?
Answer:
Even in this ase the re overy manager is needed to perform rollba k of aborted
transa tions for ases where the transa tion itself fails.
17.2 Consider a le system su h as the one on your favorite operating system.

a. What are the steps involved in the reation and deletion of les and in
writing data to a le?
b. Explain how the issues of atomi ity and durability are relevant to the
reation and deletion of les and to writing data to les.

Answer:

There are several steps in the reation of a le. A storage area is assigned to the
le in the le system. (In UNIX, a unique i-number is given to the le and an
i-node entry is inserted into the i-list.) Deletion of le involves exa tly opposite
steps.
For the le system user, durability is important for obvious reasons, but
atomi ity is not relevant generally as the le system doesn't support transa -
tions. To the le system implementor, though, many of the internal le sys-
tem a tions need to have transa tion semanti s. All steps involved in re-
ation/deletion of the le must be atomi , otherwise there will be unreferen e-
able les or unusable areas in the le system.
17.3 Database-system implementers have paid mu h more attention to the ACID
properties than have le-system implementers. Why might this be the ase?
Answer:

131
132 Chapter 17 Transa tions

Database systems usually perform ru ial tasks whose e e ts need to be atomi


and durable, and whose out ome a e ts the real world in a permanent manner.
Examples of su h tasks are monetary transa tions, seat bookings et . Hen e
the ACID properties have to be ensured. In ontrast, most users of le systems
would not be willing to pay the pri e (monetary, disk spa e, time) of supporting
ACID properties.
17.4 What lass or lasses of storage an be used to ensure durability? Why?
Answer:
Only stable storage ensures true durability. Even nonvolatile storage is sus ep-
tible to data loss, albeit less so than volatile storage. Stable storage is only an
abstra tion. It is approximated by redundant use of nonvolatile storage in whi h
data are not only repli ated but distributed phyi ally to redu e to near zero the
han e of a single event asuing data loss.
17.5 Sin e every on i t-serializable s hedule is view serializable, why do we em-
phasize on i t serializability rather than view serializability?
Answer:

Most of the on urren y ontrol proto ols (proto ols for ensuring that only
serializable s hedules are generated) used in pra ti e are based on on i t
serializability—they a tually permit only a subset of on i t serializable s hed-
ules. The general form of view serializability is very expensive to test, and only
a very restri ted form of it is used for on urren y ontrol.
17.6 Consider the pre eden e graph of Figure 17.16. Is the orresponding s hedule
on i t serializable? Explain your answer.
Answer:

T1 T2

T4 T3

T5

Figure 17.16 Pre eden e graph for Pra ti e Exer ise 17.6.
Pra ti e Exer ises 133

There is a serializable s hedule orresponding to the pre eden e graph sin e


the graph is a y li . A possible s hedule is obtained by doing a topologi al
sort, that is, T1 , T2 , T3 , T4 , T5 .
17.7 What is a as adeless s hedule? Why is as adelessness of s hedules desir-
able? Are there any ir umstan es under whi h it would be desirable to allow
non as adeless s hedules? Explain your answer.
Answer:

A as adeless s hedule is one where, for ea h pair of transa tions T and T i j

su h that T reads data items previously written by T , the ommit operation of


j i

T appears before the read operation of T . Cas adeless s hedules are desirable
i j

be ause the failure of a transa tion does not lead to the aborting of any other
transa tion. Of ourse this omes at the ost of less on urren y. If failures
o ur rarely, so that we an pay the pri e of as ading aborts for the in reased
on urren y, non as adeless s hedules might be desirable.
17.8 The lost update anomaly is said to o ur if a transa tion T reads a data item,
j

then another transa tion T writes the data item (possibly based on a previous
k

read), after whi h T writes the data item. The update performed by T has
j k

been lost, sin e the update done by T ignored the value written by T .
j k

a. Give an example of a s hedule showing the lost update anomaly.


b. Give an example s hedule to show that the lost update anomaly is possi-
ble with the read ommitted isolation level.
. Explain why the lost update anomaly is not possible with the repeatable
read isolation level.

Answer:

a. A s hedule showing the lost update anomaly:

T1 T2
read(A)
read(A)
write(A)
write(A)

In the above s hedule, the value written by the transa tion T2 is lost
be ause of the write of the transa tion T1 .
b. Lost update anomaly in read- ommitted isolation level:
134 Chapter 17 Transa tions

T1 T2
lock-S(A)
read(A)
unlock(A)
lock-X(A)
read(A)
write(A)
unlock(A)
commit
lock-X(A)
write(A)
unlock(A)
commit

The lo king in the above s hedule ensures the read- ommitted isolation
level. The value written by transa tion T2 is lost due to T1 's write.
. Lost update anomaly is not possible in repeatable read isolation level.
In repeatable read isolation level, a transa tion T1 reading a data item
X holds a shared lo k on X till the end. This makes it impossible for a
newer transa tion T2 to write the value of X (whi h requires X-lo k) until
T1 nishes. This for es the serialization order T1 , T2 , and thus the value
written by T2 is not lost.

17.9 Consider a database for a bank where the database system uses snapshot iso-
lation. Des ribe a parti ular s enario in whi h a nonserializable exe ution o -
urs that would present a problem for the bank.
Answer:
Suppose that the bank enfor es the integrity onstraint that the sum of the
balan es in the he king and the savings a ount of a ustomer must not be
negative. Suppose the he king and savings balan es for a ustomer are $100
and $200 respe tively.
Suppose that transa tion T1 withdraws $200 from the he king a ount
after verifying the integrity onstraint by reading both the balan es. Suppose
that on urrent transa tion T2 withdraws $200 from the he king a ount af-
ter verifying the integrity onstraint by reading both the balan es.
Sin e ea h of the transa tions he ks the integrity onstraints on its own
snapshot, if they run on urrently, ea h will believe that the sum of the bal-
an es after the withdrawal is $100, and therefore its withdrawal does not vio-
late the integrity onstraint. Sin e the two transa tions update di erent data
items, they do not have any update on i t, and under snapshot isolation both
Pra ti e Exer ises 135

of them an ommit. This is a nonserializable exe ution whi h results into a


serious problem.
17.10 Consider a database for an airline where the database system uses snapshot
isolation. Des ribe a parti ular s enario in whi h a nonserializable exe ution
o urs, but the airline may be willing to a ept it in order to gain better overall
performan e.
Answer:

Consider a web-based airline reservation system. There ould be many on-


urrent requests to see the list of available ights and available seats in ea h
ight and to book ti kets. Suppose there are two users A and B on urrently
a essing this web appli ation, and only one seat is left on a ight.
Suppose that both user A and user B exe ute transa tions to book a seat on
the ight and suppose that ea h transa tion he ks the total number of seats
booked on the ight, and inserts a new booking re ord if there are enough seats
left. Let T3 and T4 be their respe tive booking transa tions, whi h run on ur-
rently. Now T3 and T4 will see from their snapshots that one ti ket is available
and will insert new booking re ords. Sin e the two transa tions do not update
any ommon data item (tuple), snapshot isolation allows both transa tions to
ommit. This results in an extra booking, beyond the number of seats available
on the ight.
However, this situation is usually not very serious sin e an ellations of-
ten resolve the on i t; even if the on i t is present at the time the ight
is to leave, the airline an arrange a di erent ight for one of the passengers
on the ight, giving in entives to a ept the hange. Using snapshot isolation
improves the overall performan e in this ase sin e the booking transa tions
read the data from their snapshots only and do not blo k other on urrent
transa tions.
17.11 The denition of a s hedule assumes that operations an be totally ordered
by time. Consider a database system that runs on a system with multiple pro-
essors, where it is not always possible to establish an exa t ordering between
operations that exe uted on di erent pro essors. However, operations on a
data item an be totally ordered.
Does this situation ause any problem for the denition of on i t serializ-
ability? Explain your answer.
Answer:

The given situation will not ause any problem for the denition of on i t
serializability sin e the ordering of operations on ea h data item is ne essary
for on i t serializability, whereas the ordering of operations on di erent data
items is not important.
136 Chapter 17 Transa tions

T1 T2
read(A)
read(B)
write(B)

For the above s hedule to be on i t serializable, the only ordering require-


ment is read(B) -> write(B). read(A) and read(B) an be in any order.
Therefore, as long as the operations on a data item an be totally ordered,
the denition of on i t serializability should hold on the given multipro essor
system.
CHAPTER
18
Con urren y Control
Pra ti e Exer ises
18.1 Show that the two-phase lo king proto ol ensures on i t serializability and
that transa tions an be serialized a ording to their lo k points.
Answer:

Suppose two-phase lo king does not ensure serializability. Then there exists a
set of transa tions T0 , T1 ::: T *1 whi h obey 2PL and whi h produ e a nonseri-
n

alizable s hedule. A nonserializable s hedule implies a y le in the pre eden e


graph, and we shall show that 2PL annot produ e su h y les. Without loss
of generality, assume the following y le exists in the pre eden e graph: T0 ™
™ ™ ™ ... T *1 ™ T . Let be the time at whi h T obtains its last

™
T
1 T
2 n 0 i i

lo k (i.e. T 's lo k point). Then for all transa tions su h that T T , < .
i i j i j

Then for the y le we have

0 < 1 < 2 < ::: < *1


n
< 0

Sin e 0 < 0 is a ontradi tion, no su h y le an exist. Hen e 2PL annot


produ e nonserializable s hedules. Be ause of the property that for all trans-
a tions su h that Ti
™T ,
j
< , the lo k point ordering of the transa tions
i j

is also a topologi al sort ordering of the pre eden e graph. Thus transa tions
an be serialized a ording to their lo k points.
18.2 Consider the following two transa tions:

137
138 Chapter 18 Con urren y Control

T
34 : read( );
A

read( );
B

if A= 0 then B := B + 1;
write( ).
B

T
35 : read( );
B

read( );
A

if B= 0 then A := A + 1;
write( ).
A

Add lo k and unlo k instru tions to transa tions T31 and T32 so that they ob-
serve the two-phase lo king proto ol. Can the exe ution of these transa tions
result in a deadlo k?
Answer:

a. Lo k and unlo k instru tions:


T
34 : (A)
lo k-S

(A)
read

(B)
lo k-X

(B)
read

A = 0
if

B :=
then B +1
(B)
write

(A)
unlo k

(B)
unlo k

T
35 : lo k-S B ( )
read B ( )
lo k-X (A)
read (A)
if B = 0
then A := A +1
write (A)
unlo k (B)
unlo k (A)

b. Exe ution of these transa tions an result in deadlo k. For example, on-
sider the following partial s hedule:
Pra ti e Exer ises 139

T31 T32
lock-S (A)
lock-S (B)
read(B)
read( A)
lock-X (B)
lock-X ( A)

The transa tions are now deadlo ked.

18.3 What benet does rigorous two-phase lo king provide? How does it ompare
with other forms of two-phase lo king?
Answer:

Rigorous two-phase lo king has the advantages of stri t 2PL. In addition it has
the property that for two on i ting transa tions, their ommit order is their
serializability order. In some systems users might expe t this behavior.
18.4 Consider a database organized in the form of a rooted tree. Suppose that we
insert a dummy vertex between ea h pair of verti es. Show that, if we follow
the tree proto ol on the new tree, we get better on urren y than if we follow
the tree proto ol on the original tree.
Answer:

Consider two nodes A and B, where A is a parent of B. Let dummy vertex D


be added between A and B. Consider a ase where transa tion T2 has a lo k
on B, and T1 , whi h has a lo k on A wishes to lo k B, and T3 wishes to lo k
A. With the original tree, T
1 annot release the lo k on A until it gets the lo k
on B. With the modied tree, T1 an get a lo k on D and release the lo k on
A, whi h allows T to pro eed while T waits for T . Thus, the proto ol allows
3 1 2
lo ks on verti es to be released earlier to other transa tions, instead of holding
them when waiting for a lo k on a hild.
A generalization of the idea based on edge lo ks is des ribed in Bu kley
and Silbers hatz, “Con urren y Control in Graph Proto ols by Using Edge
Lo ks,” Pro . ACM SIGACT-SIGMOD Symposium on the Prin iples of Database

Systems, 1984 .

18.5 Show by example that there are s hedules possible under the tree proto ol that
are not possible under the two-phase lo king proto ol, and vi e versa.
Answer:

Consider the tree-stru tured database graph given below.


140 Chapter 18 Con urren y Control

oA

oB

oC

S hedule possible under tree proto ol but not under 2PL:

T1 T2
lock (A)
lock (B)
unlock (A)
lock (A)
lock (C)
unlock (B)
lock (B)
unlock (A)
unlock (B)
unlock (C)

S hedule possible under 2PL but not under tree proto ol:

T1 T2
lock (A)
lock (B)
lock (C)
unlock (B)
unlock (A)
unlock (C)

18.6 Lo king is not done expli itly in persistent programming languages. Rather,
obje ts (or the orresponding pages) must be lo ked when the obje ts are a -
essed. Most modern operating systems allow the user to set a ess prote tions
(no a ess, read, write) on pages, and memory a ess that violate the a ess
prote tions result in a prote tion violation (see the Unix mprote t ommand,
for example). Des ribe how the a ess-prote tion me hanism an be used for
page-level lo king in a persistent programming language.
Answer:

The a ess prote tion me hanism an be used to implement page- level lo k-


ing. Consider reads rst. A pro ess is allowed to read a page only after it read-
lo ks the page. This is implemented by using mprote t to initially turn o read
Pra ti e Exer ises 141

permissions to all pages, for the pro ess. When the pro ess tries to a ess an
address in a page, a prote tion violation o urs. The handler asso iated with
prote tion violation then requests a read lo k on the page, and after the lo k
is a quired, it uses mprote t to allow read a ess to the page by the pro ess,
and nally allows the pro ess to ontinue. Write a ess is handled similarly.
18.7 Consider a database system that in ludes an atomi in rementoperation, in
addition to the read and write operations. Let V be the value of data item X.
The operation

in rement X ( ) by C

sets the value of X to V + C in an atomi step. The value of X is not available


to the transa tion unless the latter exe utes a read(X).
Assume that in rement operations lo k the item in in rement mode using the
ompatibility matrix in Figure 18.25.
a. Show that, if all transa tions lo k the data that they a ess in the orre-
sponding mode, then two-phase lo king ensures serializability.
b. Show that the in lusion of in rement mode lo ks allows for in reased
on urren y.

Answer:

a. Serializability an be shown by observing that if two transa tions have an


I mode lo k on the same item, the in rement operations an be swapped,

just like read operations. However, any pair of on i ting operations


must be serialized in the order of the lo k points of the orresponding
transa tions, as shown in Exer ise 15.1.
b. The in rement lo k mode being ompatible with itself allows multiple
in rementing transa tions to take the lo k simultaneously, thereby im-
proving the on urren y of the proto ol. In the absen e of this mode, an
mode will have to be taken on a data item by ea h transa tion
ex lusive

that wants to in rement the value of this data item. An ex lusive lo k be-
ing in ompatible with itself adds to the lo k waiting time and obstru ts
the overall progress of the on urrent s hedule.
In general, in reasing the entries in the ompatibility matrix in-
true

reases the on urren y and improves the throughput.


The proof is in Korth, “Lo king Primitives in a Database System,” Journal of
the ACM Volume 30, (1983).

18.8 In timestamp ordering, W-timestamp (Q) denotes the largest timestamp of any
transa tion that exe uted write(Q) su essfully. Suppose that, instead, we de-
ned it to be the timestamp of the most re ent transa tion to exe ute write(Q)
142 Chapter 18 Con urren y Control

su essfully. Would this hange in wording make any di eren e? Explain your
answer.
Answer:

It would make no di eren e. The write proto ol is su h that the most re ent
transa tion to write an item is also the one with the largest timestamp to have
done so.
18.9 Use of multiple-granularity lo king may require more or fewer lo ks than an
equivalent system with a single lo k granularity. Provide examples of both sit-
uations, and ompare the relative amount of on urren y allowed.
Answer:

If a transa tion needs to a ess a large set of items, multiple granularity lo k-


ing requires fewer lo ks, whereas if only one item needs to be a essed, the
single lo k granularity system allows this with just one lo k. Be ause all the
desired data items are lo ked and unlo ked together in the multiple granularity
s heme, the lo king overhead is low, but on urren y is also redu ed.
18.10 For ea h of the following proto ols, des ribe aspe ts of pra ti al appli ations
that would lead you to suggest using the proto ol, and aspe ts that would sug-
gest not using the proto ol:
• Two-phase lo king
• Two-phase lo king with multiple-granularity lo king.
• The tree proto ol
• Timestamp ordering
• Validation
• Multiversion timestamp ordering
• Multiversion two-phase lo king

Answer:

• Two-phase lo king: Use for simple appli ations where a single granularity
is a eptable. If there are large read-only transa tions, multiversion proto-
ols would do better. Also, if deadlo ks must be avoided at all osts, the
tree proto ol would be preferable.
• Two-phase lo king with multiple granularity lo king: Use for an appli a-
tion mix where some appli ations a ess individual re ords and others
a ess whole relations or substantial parts thereof. The drawba ks of 2PL
mentioned above also apply to this one.
• The tree proto ol: Use if all appli ations tend to a ess data items in an
order onsistent with a parti ular partial order. This proto ol is free of
Pra ti e Exer ises 143

deadlo ks, but transa tions will often have to lo k unwanted nodes in or-
der to a ess the desired nodes.
• Timestamp ordering: Use if the appli ation demands a on urrent exe-
ution that is equivalent to a parti ular serial ordering (say, the order of
arrival), rather than any serial ordering. But on i ts are handled by roll
ba k of transa tions rather than waiting, and s hedules are not re over-
able. To make them re overable, additional overheads and in reased re-
sponse time have to be tolerated. Not suitable if there are long read-only
transa tions, sin e they will starve. Deadlo ks are absent.
• Validation: If the probability that two on urrently exe uting transa tions
on i t is low, this proto ol an be used advantageously to get better on-
urren y and good response times with low overheads. Not suitable under
high ontention, when a lot of wasted work will be done.
• Multiversion timestamp ordering: Use if timestamp ordering is appropri-
ate but it is desirable for read requests to never wait. Shares the other
disadvantages of the timestamp ordering proto ol.
• Multiversion two-phase lo king: This proto ol allows read-only transa -
tions to always ommit without ever waiting. Update transa tions follow
2PL, thus allowing re overable s hedules with on i ts solved by waiting
rather than roll ba k. But the problem of deadlo ks omes ba k, though
read-only transa tions annot get involved in them. Keeping multiple ver-
sions adds spa e and time overheads though, therefore plain 2PL may be
preferable in low- on i t situations.

18.11 Explain why the following te hnique for transa tion exe ution may provide
better performan e than just using stri t two-phase lo king: First exe ute the
transa tion without a quiring any lo ks and without performing any writes
to the database as in the validation-based te hniques, but unlike the validation
te hniques do not perform either validation or writes on the database. Instead,
rerun the transa tion using stri t two-phase lo king. (Hint: Consider waits for
disk I/O.)
Answer:

A transa tion waits on (a) disk I/O and (b) lo k a quisition. Transa tions gen-
erally wait on disk reads and not on disk writes as disk writes are handled
by the bu ering me hanism in asyn hronous fashion and transa tions update
only the in-memory opy of the disk blo ks.
The te hnique proposed essentially separates the waiting times into two
phases. The rst phase—where transa tion is exe uted without a quiring any
lo ks and without performing any writes to the database—a ounts for almost
all the waiting time on disk I/O as it reads all the data blo ks it needs from
144 Chapter 18 Con urren y Control

disk if they are not already in memory. The se ond phase—the transa tion re-
exe ution with stri t two-phase lo king—a ounts for all the waiting time on
a quiring lo ks. The se ond phase may, though rarely, involve a small waiting
time on disk I/O if a disk blo k that the transa tion needs is ushed to memory
(by bu er manager) before the se ond phase starts.
The te hnique may in rease on urren y as transa tions spend almost no
time on disk I/O with lo ks held and hen e lo ks are held for a shorter time.
In the rst phase, the transa tion reads all the data items required—and not
already in memory—from disk. The lo ks are a quired in the se ond phase
and the transa tion does almost no disk I/O in this phase. Thus the transa tion
avoids spending time in disk I/O with lo ks held.
The te hnique may even in rease disk throughput as the disk I/O is not
stalled for want of a lo k. Consider the following s enario with stri t two-phase
lo king proto ol: A transa tion is waiting for a lo k, the disk is idle, and there
are some items to be read from disk. In su h a situation, disk bandwidth is
wasted. But in the proposed te hnique, the transa tion will read all the required
items from the disk without a quiring any lo k, and the disk bandwidth may
be properly utilized.
Note that the proposed te hnique is most useful if the omputation involved
in the transa tions is less and most of the time is spent in disk I/O and waiting
on lo ks, as is usually the ase in disk-resident databases. If the transa tion is
omputation intensive, there may be wasted work. An optimization is to save
the updates of transa tions in a temporary bu er, and instead of reexe uting
the transa tion, to ompare the data values of items when they are lo ked with
the values used earlier. If the two values are the same for all items, then the
bu ered updates of the transa tion are exe uted, instead of reexe uting the
entire transa tion.
18.12 Consider the timestamp-ordering proto ol, and two transa tions, one that
writes two data items p and q, and another that reads the same two data items.
Give a s hedule whereby the timestamp test for a write operation fails and
auses the rst transa tion to be restarted, in turn ausing a as ading abort
of the other transa tion. Show how this ould result in starvation of both trans-
a tions. (Su h a situation, where two or more pro esses arry out a tions, but
are unable to omplete their task be ause of intera tion with the other pro-
esses, is alled a
livelo k .)
Answer:

Consider two transa tions T1 and T2 shown below.


Pra ti e Exer ises 145

T1 T2
write (p)
read (p)
read (q)
write (q)

Let TS(T1 ) < TS(T2 ), and let the timestamp test at ea h operation ex ept
write(q) be su essful. When transa tion T1 does the timestamp test for
write(q), it nds that TS(T1 ) < R-timestamp(q), sin e TS(T1 ) < TS(T2 ) and
R-timestamp(q) = TS(T2 ). Hen e the write operation fails, and transa tion T1
rolls ba k. The as ading results in transa tion T2 also being rolled ba k as it
uses the value for item p that is written by transa tion T1 .
If this s enario is exa tly repeated every time the transa tions are restarted,
this ould result in starvation of both transa tions.
18.13 Devise a timestamp-based proto ol that avoids the phantom phenomenon.
Answer:

In the text, we onsidered two approa hes to dealing with the phantom phe-
nomenon by means of lo king. The oarser granularity approa h obviously
works for timestamps as well. The B+ -tree index- based approa h an be
adapted to timestamping by treating index bu kets as data items with times-
tamps asso iated with them, and requiring that all read a esses use an index.
We now show that this simple method works. Suppose a transa tion T wants
to a ess all tuples with a parti ular range of sear h key values, using a B+ -
i

tree index on that sear h key. T will need to read all the bu kets in that index
i

whi h have key values in that range. It an be seen that any delete or insert of
a tuple with a key value in the same range will need to write one of the index
bu kets read by T . Thus the logi al on i t is onverted to a on i t on an
i

index bu ket, and the phantom phenomenon is avoided.


18.14 Suppose that we use the tree proto ol of Se tion 18.1.5 to manage on urrent
a ess to a B+ -tree. Sin e a split may o ur on an insert that a e ts the root, it
appears that an insert operation annot release any lo ks until it has ompleted
the entire operation. Under what ir umstan es is it possible to release a lo k
earlier?
Answer:

Note: The tree proto ol of Se tion Se tion 18.1.5 whi h is referred to in this
question is di erent from the multigranularity proto ol of Se tion 18.3 and
the B+ -tree on urren y proto ol of Se tion 18.10.2.
One strategy for early lo k releasing is given here. Going down the tree from
the root, if the urrently visited node's hild is not full, release lo ks held on
all nodes ex ept the urrent node, then request an X-lo k on the hild node.
146 Chapter 18 Con urren y Control

After getting it, release the lo k on the urrent node, and then des end to the
hild. On the other hand, if the hild is full, retain all lo ks held, request an
X-lo k on the hild, and des end to it after getting the lo k. On rea hing the
leaf node, start the insertion pro edure. This strategy results in holding lo ks
only on the full index tree nodes from the leaf upward, until and in luding the
rst non-full node.
An optimization to the above strategy is possible. Even if the urrent node's
hild is full, we an still release the lo ks on all nodes but the urrent one. But
after getting the X-lo k on the hild node, we split it right away. Releasing the
lo k on the urrent node and retaining just the lo k on the appropriate split
hild, we des end into it, making it the urrent node. With this optimization,
at any time at most two lo ks are held, of a parent and a hild node.
18.15 The snapshot isolation proto ol uses a validation step whi h, before perform-
ing a write of a data item by transa tion T , he ks if a transa tion on urrent
with T has already written the data item.
a. A straightforward implementation uses a start timestamp and a ommit
timestamp for ea h transa tion, in addition to an update set, that, is the
set of data items updated by the transa tion. Explain how to perform
validation for the rst- ommitter-wins s heme by using the transa tion
timestamps along with the update sets. You may assume that validation
and other ommit pro essing steps are exe uted serially, that is, for one
transa tion at a time,
b. Explain how the validation step an be implemented as part of ommit
pro essing for the rst- ommitter-wins s heme, using a modi ation of
the above s heme, where instead of using update sets, ea h data item
has a write timestamp asso iated with it. Again, you may assume that
validation and other ommit pro essing steps are exe uted serially.
. The rst-updater-wins s heme an be implemented using timestamps as
des ribed above, ex ept that validation is done immediately after a quir-
ing an ex lusive lo k, instead of being done at ommit time.
i. Explain how to assign write timestamps to data items to implement
the rst-updater-wins s heme.
ii. Show that as a result of lo king, if the validation is repeated at om-
mit time the result would not hange.
iii. Explain why there is no need to perform validation and other ommit
pro essing steps serially in this ase.

Answer:

a. Validation test for rst- ommitter-wins s heme: Let StartTS(T ), i

CommitTS(T ) and be the timestamps asso iated with a transa tion T


i i
Pra ti e Exer ises 147

and the update set for T be update set(T ). Then for all transa tions T
i i k

with CommitTS(T ) < CommitTS(T ), one of the following two ondi-


k i

tions must hold:


• If CommitTS(T ) < StartTS(T ), T ompletes its exe ution before
k k k

T started, the serializability is maintained.

• StartTS(T ) < CommitTS(T ) < CommitTS(T ), and update set(T )


i

i k i i

and update set(T ) do not interse t


k

b. Validation test for rst- ommitter-wins s heme with W-timestamps for


data items: If a transa tion T writes a data item Q, then the W-
i

timestamp(Q) is set to CommitTS(T ). For the validation test of a trans-


i

a tion T to pass, the following ondition must hold:


• For ea h data item Q written by T , W-timestamp(Q) < StartTS(T ).
i

i i

. First-updater-wins s heme:
i. For a data item Q written by T , the W-timestamp is assigned the
i

timestamp when the write o urred in T i

ii. Sin e the validation is done after a quiring the ex lusive lo ks and
the ex lusive lo ks are held till the end of the transa tion, the data
item annot be modied in between the lo k a quisition and ommit
time. So, the result of the validation test for a transa tion would be
the same at the ommit time as that at the update time.
iii. Be ause of the ex lusive lo king, at most one transa tion an a quire
the lo k on a data item at a time and do the validation testing. Thus,
two or more transa tions annot do validation testing for the same
data item simultaneously.
18.16 Consider fun tions insert lat hfree () and delete lat hfree (), shown in Figure
18.23.
a. Explain how the ABA problem an o ur if a deleted node is reinserted.
b. Suppose that adja ent to head we store a ounter nt. Also suppose that
DCAS((head, nt), (oldhead, old nt), (newhead, new nt)) atomi ally per-
forms a ompare-and-swap on the 128 bit value (head, nt). Modify the in-
sert lat hfree() and delete lat hfree() to use the DCAS operation to avoid

the ABA problem.


. Sin e most pro essors use only 48 bits of a 64 bit address to a tually
address memory, explain how the other 16 bits an be used to implement
a ounter, in ase the DCAS operation is not supported.

Answer:

a. Let the head of the list be pointer n1, and the next three elements be n2
and n3. Suppose pro ess P 1 whi h is performing a delete, reads pointer
148 Chapter 18 Con urren y Control

n1 as head and n2 as newhead, but before it exe utes CAS(head , n1, n2),
pro ess P 2 deletes n1, then deletes n2 and then inserts n1 ba k at the
head.
The CAS would repla e n1 by a pointer to n2, sin e the head is still
n1. However, node n2 has meanwhile been deleted and is garbage. Thus,

the list is now in onsistent.


b. The following ode

(
atomi read head , nt) {
repeat

= head
oldhead

= nt
old nt

result = DCAS((head, nt ), (oldhead, old nt ), (oldhead, old nt ))

until(result == su ess)
return (oldhead, old nt)
}

(
insert lat hfree head , value) {
node = new node

node *> value = value


repeat

(oldhead, old nt) = atomi read (head, nt)


node*>next = oldhead

new nt = old nt +1

result = DCAS(head, (oldhead, old nt ), (node, new nt ))

until (result == su ess)


}

( ){
delete lat hfree head

/* This fun tion is not quite safe; see explanation in text. */


repeat

(oldhead, old nt) = atomi read (head, nt)


newhead = oldhead *>next

new nt = old nt +1

result = DCAS(head, (oldhead, old nt ), (newhead, new nt ))

until (result == su ess)


}

The atomi read fun tion ensures that the 128 bit address, ounter pair is
read atomi ally, by using the DCAS instru tion to ensure that the values
are still same (the DCAS instru tion stores the same values ba k if it
su eeds, so there is no hange in the value). If the DCAS fails, we may
Pra ti e Exer ises 149

have read an old pointer and a new value, or vi e versa, requiring the
values to be read again.
The ABA problem would be avoided by the modied ode for in-
sert lat hfree() and delete lat hfree(), sin e although the reinsert of the

n1 by P 2 would result in the head having the same pointer n1 as earlier,

ounter nt would be di erent from old nt, resulting in the CAS opera-
tion of P 1 failing.
. Most pro essors use only the last 48 bits of a 64 bit address to a ess
memory (whi h an support 256 Terabytes of memory). The rst 16 bits
of a 64 bit value an then be used as a ounter, and the last 48 bits as
the address, with the ounter and the address extra ted using bit-and
operations before being used, and using bit-and and bit-or operations to
re onstru t the 64 bit value from a pointer and a ounter. If a hardware
implementation does not support DCAS, this ould be used as an alter-
native to a DCAS, although it still runs a the small risk of the ounter
wrapping around if there are exa tly 64K other operations on the list
between the read of the head and the CAS operation.
CHAPTER
19
Re overy System
Pra ti e Exer ises
19.1 Explain why log re ords for transa tions on the undo-list must be pro essed in
reverse order, whereas redo is performed in a forward dire tion.
Answer:
Within a single transa tion in undo-list, suppose a data item is updated more
than on e, say from 1 to 2, and then from 2 to 3. If the undo log re ords are
pro essed in forward order, the nal value of the data item will be in orre tly
set to 2, whereas by pro essing them in reverse order, the value is set to 1. The
same logi also holds for data items updated by more than one transa tion on
undo-list.
Using the same example as above, but assuming the transa tion ommitted,
it is easy to see that if redo pro essing pro esses the re ords in forward order,
the nal value is set orre tly to 3, but if done in reverse order, the nal value
is set in orre tly to 2.
19.2 Explain the purpose of the he kpoint me hanism. How often should he k-
points be performed? How does the frequen y of he kpoints a e t:
• System performan e when no failure o urs?
• The time it takes to re over from a system rash?
• The time it takes to re over from a media (disk) failure?

Answer:

Che kpointing is done with log-based re overy s hemes to redu e the time
required for re overy after a rash. If there is no he kpointing, then the entire
log must be sear hed after a rash, and all transa tions must be undone/redone
from the log. If he kpointing is performed, then most of the log re ords prior
to the he kpoint an be ignored at the time of re overy.
Another reason to perform he kpoints is to lear log re ords from stable
storage as it gets full.
151
152 Chapter 19 Re overy System

Sin e he kpoints ause some loss in performan e while they are being
taken, their frequen y should be redu ed if fast re overy is not riti al. If we
need fast re overy, he kpointing frequen y should be in reased. If the amount
of stable storage available is less, frequent he kpointing is unavoidable.
Che kpoints have no e e t on re overy from a disk rash; ar hival dumps
are the equivalent of he kpoints for re overy from disk rashes.
19.3 Some database systems allow the administrator to hoose between two forms
of logging: normal logging , used to re over from system rashes, and ar hival

logging, used to re over from media (disk) failure. When an a log re ord be
deleted, in ea h of these ases, using the re overy algorithm of Se tion 19.4?
Answer:
Normal logging : The following log re ords annot be deleted, sin e they may
be required for re overy:

a. Any log re ord orresponding to a transa tion whi h was a tive during
the most re ent he kpoint (i.e., whi h is part of the < he kpoint L>
entry)
b. Any log re ord orresponding to transa tions started after the re ent
he kpoint

All other log re ords an be deleted. After ea h he kpoint, more re ords be-
ome andidates for deletion as per the above rule.
Deleting a log re ord while retaining an earlier log re ord would result in
gaps in the log and would require more omplex log pro essing. Therefore in
pra ti e, systems nd a point in the log where all earlier log re ords an be
deleted, and they delete that part of the log. Often, the log is broken up into
multiple les, and a le is deleted when all log re ords in the le an be deleted.

Ar hival logging : Ar hival logging retains log re ords that may be needed for
re overy from media failure (su h as disk rashes). Ar hival dumps are the
equivalent of he kpoints for re overy from media failure. The pre eding
rules for deletion an be used for ar hival logs, but based on the last ar hival
dump instead of the last he kpoint. The frequen y of ar hival dumps would
be less than he kpointing, sin e a lot of data have to be written. Thus more
log re ords would need to be retained with ar hival logging.
19.4 Des ribe how to modify the re overy algorithm of Se tion 19.4 to implement
savepoints and to perform rollba k to a savepoint. (Savepoints are des ribed
in Se tion 19.9.3.)
Answer:
A savepoint an be performed as follows:
Pra ti e Exer ises 153

a. Output onto stable storage all log re ords for that transa tion whi h are
urrently in main memory.
b. Output onto stable storage a log re ord of the form <savepoint i >, where
T

I
Tis the transa tion identier.

To roll ba k a urrently exe uting transa tion partially to a parti ular save-
point, exe ute undo pro essing for that transa tion until the savepoint is
rea hed. Redo log re ords are generated as usual during the undo phase above.
It is possible to perform repeated undo to a single savepoint by writing a fresh
savepoint re ord after rolling ba k to that savepoint. The above algorithm an
be extended to support multiple savepoints for a single transa tion by giving
ea h savepoint a name. However, on e undo has rolled ba k past a savepoint,
it is no longer possible to undo up to that savepoint.
19.5 Suppose the deferred modi ation te hnique is used in a database.

a. Is the old value part of an update log re ord required any more? Why or
why not?
b. If old values are not stored in update log re ords, transa tion undo is
learly not feasible. How would the redo phase of re overy have to be
modied as a result?
. Deferred modi ation an be implemented by keeping updated data
items in lo al memory of transa tions and reading data items that have
not been updated dire tly from the database bu er. Suggest how to e -
iently implement a data item read, ensuring that a transa tion sees its
own updates.
d. What problem would arise with the above te hnique if transa tions per-
form a large number of updates?

Answer:

a. The old-value part of an update log re ord is not required. If the trans-
a tion has ommitted, then the old value is no longer ne essary as there
would be no need to undo the transa tion. And if the transa tion was
a tive when the system rashed, the old values are still safe in the stable
storage be ause they haven't been modied yet.
b. During the redo phase, the undo list need not be maintained any more,
sin e the stable storage does not re e t updates due to any un ommitted
transa tion.
. A data item read will rst issue a read request on the lo al memory of
the transa tion. If it is found there, it is returned. Otherwise, the item is
154 Chapter 19 Re overy System

loaded from the database bu er into the lo al memory of the transa tion
and then returned.
d. If a single transa tion performs a large number of updates, there is a
possibility of the transa tion running out of memory to store the lo al
opies of the data items.
19.6 The shadow-paging s heme requires the page table to be opied. Suppose the
page table is represented as a B+ -tree.
a. Suggest how to share as many nodes as possible between the new opy
and the shadow opy of the B+ -tree, assuming that updates are made
only to leaf entries, with no insertions or deletions.
b. Even with the above optimization, logging is mu h heaper than a
shadow opy s heme, for transa tions that perform small updates. Ex-
plain why.

Answer:

a. To begin with, we start with the opy of just the root node pointing to
the shadow opy. As modi ations are made, the leaf entry where the
modi ation is made and all the nodes in the path from that leaf node
to the root are opied and updated. All other nodes are shared.
b. For transa tions that perform small updates, the shadow-paging s heme
would opy multiple pages for a single update, even with the above op-
timization. Logging, on the other hand, just requires small re ords to
be reated for every update; the log re ords are physi ally together in
one page or a few pages, and thus only a few log page I/O operations
are required to ommit a transa tion. Furthermore, the log pages writ-
ten out a ross subsequent transa tion ommits are likely to be adja ent
physi ally on disk, minimizing disk arm movement.
19.7 Suppose we (in orre tly) modify the re overy algorithm of Se tion 19.4 to
note log a tions taken during transa tion rollba k. When re overing from a
system rash, transa tions that were rolled ba k earlier would then be in luded
in undo-list and rolled ba k again. Give an example to show how a tions taken
during the undo phase of re overy ould result in an in orre t database state.
(Hint: Consider a data item updated by an aborted transa tion and then up-
dated by a transa tion that ommits.)
Answer:

Consider the following log re ords generated with the (in orre tly) modied
re overy algorithm:

1. < 1
T start>
Pra ti e Exer ises 155

2. < 1 , A, 1000, 900>


T

3. < 2 start>
T

4. < 2 , A, 1000, 2000>


T

5. < 2 ommit>
T

A rollba k a tually happened between steps 2 and 3, but there are no log
re ords re e ting the same. Now, this log data is pro essed by the re overy
algorithm. At the end of the redo phase, 1 would get added to the undo-list,
T

and the value of A would be 2000. During the undo phase, sin e 1 is present
T

in the undo-list, the re overy algorithm does an undo of statement 2, and A


takes the value 1000. The update made by 2 , though ommited, is lost.
T

The orre t sequen e of logs is as follows:

1. < 1 start>
T

2. < 1 , A, 1000, 900>


T

3. < 1 , A, 1000>
T

4. < 1 abort>
T

5. < 2 start>
T

6. < 2 , A, 1000, 2000>


T

7. < 2 ommit>
T

This would make sure that T


1 would not get added to the undo-list after the
redo phase.
19.8 Disk spa e allo ated to a le as a result of a transa tion should not be released
even if the transa tion is rolled ba k. Explain why, and explain how ARIES
ensures that su h a tions are not rolled ba k.
Answer:
If a transa tion allo ates a page to a relation, even if the transa tion is rolled
ba k, the page allo ation should not be undone be ause other transa tions
may have stored re ords in the same page. Su h operations that should not
be undone are alled nested top a tions in ARIES. They an be modeled as
operations whose undo a tion does nothing. In ARIES su h operations are
implemented by reating a dummy CLR whose UndoNextLSN is set su h that
the transa tion rollba k skips the log re ords generated by the operation.
19.9 Suppose a transa tion deletes a re ord, and the free spa e generated thus is
allo ated to a re ord inserted by another transa tion, even before the rst trans-
a tion ommits.
a. What problem an o ur if the rst transa tion needs to be rolled ba k?
b. Would this problem be an issue if page-level lo king is used instead of
tuple-level lo king?
156 Chapter 19 Re overy System

. Suggest how to solve this problem while supporting tuple-level lo king,


by logging post- ommit a tions in spe ial log re ords, and exe uting
them after ommit. Make sure your s heme ensures that su h a tions
are performed exa tly on e.

Answer:

a. If the rst transa tion needs to be rolled ba k, the tuple deleted by that
transa tion will have to be restored. If undo is performed in the usual
physi al manner using the old values of data items, the spa e allo ated to
the new tuple would get overwritten by the transa tion undo, damaging
the new tuples, and asso iated data stru tures on the disk blo k. This
means that a logi al undo operation has to be performed, i.e., an insert
has to be performed to undo the delete, whi h ompli ates re overy.
On a related note, if the se ond transa tion inserts with the same key,
integrity onstraints might be violated on rollba k.
b. If page-level lo king is used, the free spa e generated by the rst trans-
a tion is not allo ated to another transa tion till the rst one ommits.
So this problem will not be an issue if page-level lo king is used.
. The problem an be solved by deferring freeing of spa e until after the
transa tion ommits. To ensure that spa e will be freed even if there is
a system rash immediately after ommit, the ommit log re ord an be
modied to ontain information about freeing of spa e (and other sim-
ilar operations) whi h must be performed after ommit. The exe ution
of these operations an be performed as a transa tion and log re ords
generated, following by a post- ommit log re ord whi h indi ates that
post- ommit pro essing has been ompleted for the transa tion.
During re overy, if a ommit log re ord is found with post- ommit
a tions, but no post- ommit log re ord is found, the e e ts of any partial
exe ution of post- ommit operations are rolled ba k during re overy,
and the post- ommit operations are reexe uted at the end of re overy.
If the post- ommit log re ord is found, the post- ommit a tions are not
reexe uted. Thus, the a tions are guaranteed to be exe uted exa tly on e.
The problem of lashes on primary key values an be solved by hold-
ing key-level lo ks so that no other transa tion an use the key until the
rst transa tion ompletes.

19.10 Explain the reasons why re overy of intera tive transa tions is more di ult
to deal with than is re overy of bat h transa tions. Is there a simple way to deal
with this di ulty? (Hint: Consider an automati teller ma hine transa tion
in whi h ash is withdrawn.)
Answer:
Pra ti e Exer ises 157

Intera tive transa tions are more di ult to re over from than bat h transa -
tions be ause some a tions may be irrevo able. For example, an output (write)
statement may have red a missile or aused a bank ma hine to give money to
a ustomer. The best way to deal with this is to try to do all output statements
at the end of the transa tion. That way if the transa tion aborts in the middle,
no harm will be have been done.
Output operations should ideally be done atomi ally; for example, ATM
ma hines often ount out notes and deliver all the notes together instead of
delivering notes one at a time. If output operations annot be done atomi ally,
a physi al log of output operations, su h as a disk log of events, or even a video
log of what happened in the physi al world an be maintained to allow perform
re overy to be performed manually later, for example, by rediting ash ba k
to a ustomer's a ount.
19.11 Sometimes a transa tion has to be undone after it has ommitted be ause it
was erroneously exe uted—for example, be ause of erroneous input by a bank
teller.

a. Give an example to show that using the normal transa tion undo me h-
anism to undo su h a transa tion ould lead to an in onsistent state.
b. One way to handle this situation is to bring the whole database to a state
prior to the ommit of the erroneous transa tion ( alled re-
point-in-time

overy). Transa tions that ommitted later have their e e ts rolled ba k


with this s heme.
Suggest a modi ation to the re overy algorithm of Se tion 19.4 to
implement point-in-time re overy using database dumps.
. Later nonerroneous transa tions an be reexe uted logi ally, if the up-
dates are available in the form of SQL but annot be reexe uted using
their log re ords. Why?

Answer:

a. Consider the a bank a ount with balan e $100. Consider two trans-
A

a tions 1 and 2 , ea h depositing $10 in the a ount. Thus the bal-


T T

an e would be $120 after both these transa tions are exe uted. Let the
transa tions exe ute in sequen e: 1 rst and then 2 . The log re ords
T T

orresponding to the updates of by transa tions 1 and 2 would be


A T T

< 1 , , 100, 110 > and < 2 , , 110, 120 > respe tively.
T A T A

Say we wish to undo transa tion 1 . The normal transa tion undo
T

me hanism will repla e the value in question— in this example—with


A

the old-value eld in the log re ord. Thus if we undo transa tion 1 using
T

the normal transa tion undo me hanism, the resulting balan e will be
158 Chapter 19 Re overy System

$100 and we will, in e e t, undo both transa tions, whereas we intend


to undo only transa tion 1 . T

b. Let the erroneous transa tion be T


e
.
• Identify the latest ar hival dump, say , before the log re ord <
D T
e
,
START >. Restore the database using the dump.
• Redo all log re ords starting from the dump to the log re ord
D

< T
e
, COMMIT >. Some transa tion—apart from transa tion e — T

would be a tive at the ommit time of transa tion e . Let 1 be the


T S

set of su h transa tions.


• Roll ba k e and the transa tions in the set 1 . This ompletes point-
T S

in-time re overy.
In ase logi al redo is possible, later transa tions an be rex-
e uted logi ally, assuming log re ords ontaining logi al redo in-
formation were written for every transa tion. To perform logi al
redo of later transa tions, s an the log further starting from the log
re ord < e , COMMIT > to the end of the log. Note the transa tions
T

that were started after the ommit point of e . Let the set of su h
T

transa tions be 2 . Reexe ute the transa tions in set 1 and 2 log-
S S S

i ally.

. Consider again an example from the rst item. Let us assume that both
transa tions are undone and the balan e is reverted ba k to the original
value $100.
Now we wish to redo transa tion 2 . If we redo the log re ord < 2 , ,
T T A

110, 120 > orresponding to transa tion 2 , the balan e will be ome
T

$120 and we will, in e e t, redo both transa tions, whereas we intend to


redo only transa tion 2 . T

19.12 The re overy te hniques that we des ribed assume that blo ks are written
atomi ally to disk. However, a blo k may be partially written when power fails,
with some se tors written, and others not yet written.
a. What problems an partial blo k writes ause?
b. Partial blo k writes an be dete ted using te hniques similar to those
used to validate se tor reads. Explain how.
. Explain how RAID 1 an be used to re over from a partially written
blo k, restoring the blo k to either its old value or to its new value.

Answer:
FILL IN
Pra ti e Exer ises 159

19.13 The Ora le database system uses undo log re ords to provide a snapshot view
of the database under snapshot isolation. The snapshot view seen by transa -
tion i re e ts updates of all transa tions that had ommitted when i started
T T

and the updates of i ; updates of all other transa tions are not visible to i .
T T

Des ribe a s heme for bu er handling whereby transa tions are given a
snapshot view of pages in the bu er. In lude details of how to use the log to
generate the snapshot view. You an assume that operations as well as their
undo a tions a e t only one page.
Answer:

First, determine if a transa tion is urrently modifying the bu er. If not, then
return the urrent ontents of the bu er. Otherwise, examine the re ords in
the undo log pertaining to this bu er. Make a opy of the bu er, then for
ea h relevant operation in the undo log, apply the operation to the bu er opy
starting with the most re ent operation and working ba kwards until the point
at whi h the modifying transa tion began. Finally, return the bu er opy as
the snapshot bu er.
CHAPTER
20
Database-System Ar hite tures

Pra ti e Exer ises


20.1 Is a multiuser system ne essarily a parallel system? Why or why not?
Answer:
No. A single pro essor with only one ore an run multiple pro esses to man-
age mutiple users. Most modern systems are parallel, however.
20.2 Atomi instru tions su h as ompare-and-swap and test-and-set also exe ute a
memory fen e as part of the instru tion on many ar hite tures. Explain what
is the motivation for exe uting the memory fen e, from the viewpoint of data
in shared memory that is prote ted by a mutex implemented by the atomi
instru tion. Also explain what a pro ess should do before releasing a mutex.
Answer:
FILL IN MORE
The memory fen e ensures that the pro ess that gets the mutex will see all
updates that happened before the instru tion, as long as pro esses exe ute
a fen e before releasing the mutex. Thus, even if the data was updated on a
di erent ore, the pro ess that a quires the mutex is guaranteed to see the
latest value of the data.
20.3 Instead of storing shared stru tures in shared memory, an alternative ar hi-
te ture would be to store them in the lo al memory of a spe ial pro ess and
a ess the shared data by interpro ess ommuni ation with the pro ess. What
would be the drawba k of su h an ar hite ture?
Answer:

The drawba ks would be that two interpro ess messages would be required
to a quire lo ks, one for the request and one to onrm grant. Interpro ess
ommuni ation is mu h more expensive than memory a ess, so the ost of
lo king would in rease. The pro ess storing the shared stru tures ould also
be ome a bottlene k.

161
162 Chapter 20 Database-System Ar hite tures

The benet of this alternative is that the lo k table is prote ted better from
erroneous updates sin e only one pro ess an a ess it.
20.4 Explain the distin tion between a lat h and a lo k as used for transa tional
on urren y ontrol.
Answer:
Lat hes are short-duration lo ks that manage a ess to internal system data
stru tures. Lo ks taken by transa tions are taken on database data items and
are often held for a substantial fra tion of the duration of the transa tion.
Lat h a quisition and release are not overed by the two-phase lo king proto-
ol.
20.5 Suppose a transa tion is written in C with embedded SQL, and about 80 per-
ent of the time is spent in the SQL ode, with the remaining 20 per ent spent
in C ode. How mu h speedup an one hope to attain if parallelism is used
only for the SQL ode? Explain.
Answer:
Sin e the part whi h annot be parallelized takes 20% of the total running time,
the best speedup we an hope for is 5. In Amdahl's law: (1*p)+1 (p_n) , p = 4_5
and n is arbitrarily large. So, 1 * p = 1_5 and p_n aproa hes zero.
20.6 Consider a pair of pro esses in a shared memory system su h that pro ess
A updates a data stru ture, and then sets a ag to indi ate that the update is
ompleted. Pro ess B monitors the ag, and starts pro essing the data stru -
ture only after it nds the ag is set.
Explain the problems that ould arise in a memory ar hite ture where
writes may be reordered, and explain how the sfen e and lfen e instru tions
an be used to ensure the problem does not o ur.
Answer:
The goal here is that the onsumer pro ess B should see the data stru ture state
after all updates have been ompleted. But out of order writes to main memory
an result in the onsumer pro ess seeing some but not all the updates to the
data stru ture, even after the ag has been set.
To avoid this problem, the produ er pro ess A should issue an sfen e af-
ter the updates, but before setting the ag. It an optionally issue an sfen e
after setting the ag, to push the update to memory with minimum delay. The
onsumer pro ess B should orrespondingly issue an lfen e after the ag has
been found to be set, before a essing the datastru ture.
20.7 In a shared-memory ar hite ture, why might the time to a ess a memory lo-
ation vary depending on the memory lo ation being a essed?
Answer:
Pra ti e Exer ises 163

In a NUMA ar hite ture, a pro essor an a ess its own memory faster than it
an a ess shared memory asso iated with another pro essor due to the time
taken to transfer data between pro essors.
20.8 Most operating systems for parallel ma hines (i) allo ate memory in a lo al
memory area when a pro ess requests memory, and (ii) avoid moving a pro-
ess from one ore to another. Why are these optimizations important with a
NUMA ar hite ture?
Answer:
In a NUMA ar hite ture, a pro essor an a ess its own memory faster that it
an a ess shared memory asso iated with another pro essor due to the time
taken to transfer data between pro essors. Thus, if the data of a pro ess resides
in lo al memory, the pro ess exe ution would be faster than if the memory is
non-lo al.
Further, if a pro ess moves from one ore to another, it may lose the ben-
ets of lo al allo ation of memory, and be for ed to arry out many memory
a esses from other ores. To avoid this problem, most operating systems avoid
moving a pro ess from one ore to another wherever possible.
20.9 Some database operations su h as joins an see a signi ant di eren e in
speed when data (e.g., one of the relations involved in a join) ts in mem-
ory as ompared to the situation where the data do not t in memory. Show
how this fa t an explain the phenomenon of superlinear speedup, where an
appli ation sees a speedup greater than the amount of resour es allo ated to
it.
Answer:
We illustrate this by an example. Suppose we double the amount of main mem-
ory and that as a result, one of the relations now ts entirely in main memory.
We an now use a nested-loop join with the inner-loop relation entirely in main
memory and in ur disk a esses for reading the input relations only one time.
With the original amount of main memory, the best join strategy may have had
to read a relation in from disk more than on e.
20.10 What is the key distin tion between homogeneous and federated distributed
database systems?
Answer:

The key diferen e is the degree of ooperation among the systems and the
degree of entralized ontrol. Homogeneous systems share a global s hema,
run the same database-system software and a tively ooperate on query pro-
essing. Federated systems may have distin t s hemas and software, and may
ooperate in only a limited manner.
164 Chapter 20 Database-System Ar hite tures

20.11 Why might a lient hoose to subs ribe only to the basi infrastru ture-as-a-
servi e model rather than to the servi es o ered by other loud servi e mod-
els?
Answer:
A lient may wish to ontrol its own appli ations and thus may not wish to
subs ribe to a software-as-a-servi e model; or the lient might wish further to
be able to hoose and manage its own database system and thus not wish to
subs ribe to a platform-as-a-servi e model.
20.12 Why do loud- omputing servi es support traditional database systems best by
using a virtual ma hine, instead of running dire tly on the servi e provider's
a tual ma hine, assuming that data is on external storage?
Answer:
By using a virtual ma hine, if a physi al ma hine fails, virtual ma hines run-
ning on that physi al ma hine an be restarted qui kly on one or more other
physi al ma hines, improving availability. (Assuming of ourse that data re-
mains a essible, either by storing multiple opies of data, or by storing data
in an highly available external storage system.)
CHAPTER
21
Parallel and Distributed Storage
Pra ti e Exer ises
21.1 In a range sele tion on a range-partitioned attribute, it is possible that only
one disk may need to be a essed. Des ribe the benets and drawba ks of this
property.
Answer:
If there are few tuples in the queried range, then ea h query an be pro essed
qui kly on a single disk. This allows parallel exe ution of queries with redu ed
overhead of initiating queries on multiple disks.
On the other hand, if there are many tuples in the queried range, ea h query
takes a long time to exe ute as there is no parallelism within its exe ution. Also,
some of the disks an be ome hot spots, further in reasing response time.
Hybrid range partitioning, in whi h small ranges (a few blo ks ea h) are
partitioned in a round-robin fashion, provides the benets of range partitioning
without its drawba ks.
21.2 Re all that histograms are used for onstru ting load-balan ed range parti-
tions.

a. Suppose you have a histogram where values are between 1 and 100, and
are partitioned into 10 ranges, 1–10, 11–20, § , 91–100, with frequen-
ies 15, 5, 20, 10, 10, 5, 5, 20, 5, and 5, respe tively. Give a load-balan ed
range partitioning fun tion to divide the values into ve partitions.
b. Write an algorithm for omputing a balan ed range partition with par- p

titions, given a histogram of frequen y distributions ontaining ranges.


n

Answer:

a. A partitioning ve tor whi h gives 5 partitions with 20 tuples in ea h


partition is: [21, 31, 51, 76℄. The 5 partitions obtained are 1 * 20, 21 * 30,
31 * 50, 51 * 75, and 76 * 100. The assumption made in arriving at this
165
166 Chapter 21 Parallel and Distributed Storage

partitioning ve tor is that within a histogram range, ea h value is equally


likely.
b. Let the histogram ranges be alled 1 , 2 , § , h , and the partitions h h h

1 , 2 , § , p . Let the frequen ies of the histogram ranges be


p p p

1 , 2 , § , h . Ea h partition should ontain


n n n _ tuples, where N p

N = hi=1 i . n

To onstru t the load-balan ed partitioning ve tor, we need to de-


termine the value of the 1th tuple, the value of the 2th tuple, and so on,
k k

where 1 = _ , 2 = 2 _ , et ., until p*1 . The partitioning ve tor will


k N p k N p k

then be [ 1 , 2 , § , p*1 ℄. The value of the ith tuple is determined as fol-


k k k k

lows: First determine the histogram range j in whi h it falls. Assuming h

all values in a range are equally likely, the ith value will be k

 k
s
j + e
j * j
s < ij

n j

where
s
j : rst value in h
j
e
j : last value in h
j
k
ij : k
i * jl*=11 l
n

21.3 Histograms are traditionally onstru ted on the values of a spe i attribute
(or set of attributes) of a relation. Su h histograms are good for avoiding data
distribution skew but are not very useful for avoiding exe ution skew. Explain
why.
Now suppose you have a workload of queries that perform point lookups.
Explain how you an use the queries in the workload to ome up with a parti-
tioning s heme that avoids exe ution skew.
Answer:

FILL
21.4 Repli ation:
a. Give two reasons for repli ating data a ross geographi ally distributed
data enters.
b. Centralized databases support repli ation using log re ords. How is
the repli ation in entralized databases di erent from that in paral-
lel/distributed databases?

Answer:

a. By repli ating a ross data enters, even if a data enter fails, for example
due to a power outage or a natural disaster, the data would still be avail-
Pra ti e Exer ises 167

able from another data enter. By keeping the data enters geographi-
ally separated, the han es of a single natural disaster su h as an earth-
quake or a storm a e ting both the data enters at the same time are
minimized.
b. Centralized databases typi ally support only full database repli ation us-
ing log re ords (although some support logi al repli ation allowing repli-
ation to be restri ted to some relations). However, they do not support
partitioning, or the ability to repli ate di erent parts of the database at
di erent nodes; the latter helps minimize the load in rease at a repli a
when a node fails by spreading the load a ross multiple nodes.
21.5 Parallel indi es:
a. Se ondary indi es in a entralized database store the re ord identier.
A global se ondary index too ould potentially store a partition num-
ber holding the re ord, and a re ord identier within the partition. Why
would this be a bad idea?
b. Global se ondary indi es are implemented in a way similar to lo al se -
ondary indi es that are used when re ords are stored in a B+ -tree le
organization. Explain the similarities between the two s enarios that re-
sult in a similar implementation of the se ondary indi es.

Answer:

a. Any updated su h as splitting or moving a partition, whi h is required


to balan e load, would require a large number of updates to se ondary
indi es.
b. In both ases re ords may move (a ross nodes, or to a di erent lo ation
within the node) whi h would require a large number of updates to se -
ondary indi es if they stored dire t pointers. The indire tion through the
lustering index key / partitioning key allows re ord movement without
any updates to the se ondary index.
21.6 Parallel database systems store repli as of ea h data item (or partition) on
more than one node.
a. Why is it a good idea to distribute the opies of the data items allo ated
to a node a ross multiple other nodes, instead of storing all the opies
in the same node (or set of nodes).
b. What are the benets and drawba ks of using RAID storage instead of
storing an extra opy of ea h data item?

Answer:
168 Chapter 21 Parallel and Distributed Storage

a. The opies of the data items at a node should be partitioned a ross mul-
tiple other nodes, rather than stored in a single node, for the following
reasons:
• To better distribute the work whi h should have been done by the
failed node, among the remaining nodes.
• Even when there is no failure, this te hnique an to some extent deal
with hot-spots reated by read-only transa tions.
b. RAID level 0 itself stores an extra opy of ea h data item (mirroring).
Thus this is similar to mirroring performed by the database itself, ex ept
that the database system does not have to bother about the details of
performing the mirroring. It just issues the write to the RAID system,
whi h automati ally performs the mirroring.
RAID level 5 is less expensive than mirroring in terms of disk spa e
requirement, but writes are more expensive, and rebuilding a rashed
disk is more expensive.
21.7 Partitioning and repli ation.
a. Explain why range-partitioning gives better ontrol on tablet sizes than
hash partitioning. List an analogy between this ase and the ase of B+ -
tree indi es versus hash indi es.
b. Some systems rst perform hashing on the key, and then use range par-
titioning on the hash values. What ould be a motivation for this hoi e,
and what are its drawba ks as ompared to performing range partition
dire tion on the key?
. It is possible to horizontally partition data, and then perform verti al
partitioning lo ally at ea h node. It is also possible to do the onverse,
where verti al partitioning is done rst, and then ea h partition is then
horizontally partitioned independently. What are are the benets of the
rst option over the se ond one?

Answer:

a. Hash partitioning does not permit any ontrol on individual tablet sizes,
unlike range partitioning whi h allows overfull partitions to be split quite
easily. B+ -tree indi es use range partitioning, allowing a leaf node to be
split if it is overfull. In ontrast, it is not easy to split a hash bu ket in a
hash index if the bu ket is overfull.
Some approa hes similar to those used for dynami hashing (su h as
linear hashing or extendable hashing) have been proposed to allow over-
full hash bu kets to be split while leaving other hash bu kets untou hed,
but range partitioning provides a simpler solution.
Pra ti e Exer ises 169

b. Hashing allows keys of various types to be mapped to a single data type,


simplifying the job of partitioning the data. The drawba k is that range
queries annot be supported using hashing (without performing a full
table s an), whereas dire t range-partitioning allows e ient support for
range queries.
. The rst option allows re onstru tion of re ords at a single node if a
query only a esses re ords at that node. With the se ond option, the
verti al fragments orresponding to one re ord may potentially be resid-
ing on di erent nodes, requiring extra ommuni ation to get the verti al
fragments together.
21.8 In order to send a request to the master repli a of a data item, a node must
know whi h repli a is the master for that data item.
a. Suppose that between the time the node identies whi h node is the
master repli a for a data item, and the time the request rea hes the iden-
tied node, the mastership has hanged, and a di erent node is now the
master. How an su h a situation be dealt with?
b. While the master repli a ould be hosen on a per-partition basis, some
systems support a , where the re ords of a par-
per-re ord master repli a

tition (or tablet) are repli ated at some set of nodes, but ea h re ord's
master repli a an be on any of the nodes from within this set of nodes,
independent of the master repli a of other re ords. List two benets of
keeping tra k of master on a per-re ord basis.
. Suggest how to keep tra k of the master repli a for ea h re ord, when
there are a large number of re ords.

Answer:

a. If a node re eives a request for a data item when it is not the master, it an
send an error reply with the reason for the error to the requesting node.
The requesting node an then nd the urrent master and resend the
request to the urrent master. Alternatively, the old master an forward
the message to the new master, whi h an reply to the requesting node.
b. Tra king mastership on a per-re ord basis allows the master to be lo ated
in a geographi al region where most requests for the data item o ur, for
example the region where the user resides. Reads an then be satised
without any ommuni ation with other regions, whi h is generally mu h
slower due to speed-of-light delays. Further, writes an also be done lo-
ally, and repli ated asyn hronously to the other repli as.
. Ea h re ord an have an extra hidden eld that stores the master repli a
of that re ord. In ase the information is outdated, all the repli as of the
170 Chapter 21 Parallel and Distributed Storage

data item an be a essed to nd the nodes listed as masters for that data
item; those nodes an be onta ted to nd the urrent master.
CHAPTER
22
Parallel and Distributed Query
Pro essing
Pra ti e Exer ises
22.1 What form of parallelism (interquery, interoperation, or intraoperation) is
likely to be the most important for ea h of the following tasks?
a. In reasing the throughput of a system with many small queries
b. In reasing the throughput of a system with a few large queries when the
number of disks and pro essors is large

Answer:

a. When there are many small queries, interquery parallelism gives good
throughput. Parallelizing ea h of these small queries would in rease the
initiation overhead, without any signi ant redu tion in response time.
b. With a few large queries, intraquery parallelism is essential to get fast
response times. Given that there are large numbers of pro essors and
disks, only intraoperation parallelism an take advantage of the parallel
hardware, for queries typi ally have few operations, but ea h one needs
to pro ess a large number of tuples.
22.2 Des ribe how partial aggregation an be implemented for the ount and avg
aggregate fun tions to redu e data transfer.
Answer:

FILL
22.3 With pipelined parallelism, it is often a good idea to perform several operations
in a pipeline on a single pro essor, even when many pro essors are available.
a. Explain why.
171
172 Chapter 22 Parallel and Distributed Query Pro essing

b. Would the arguments you advan ed in part a hold if the ma hine has a
shared-memory ar hite ture? Explain why or why not.
. Would the arguments in part a hold with independent parallelism? (That
is, are there ases where, even if the operations are not pipelined and
there are many pro essors available, it is still a good idea to perform
several operations on the same pro essor?)

Answer:

a. The speedup obtained by parallelizing the operations would be o set by


the data transfer overhead, as ea h tuple produ ed by an operator would
have to be transferred to its onsumer, whi h is running on a di erent
pro essor.
b. In a shared-memory ar hite ture, transferring the tuples is very e ient.
So the above argument does not hold to any signi ant degree.
. Even if two operations are independent, it may be that they both supply
their outputs to a ommon third operator. In that ase, running all three
on the same pro essor may be better than transferring tuples a ross pro-
essors.

22.4 Consider join pro essing using symmetri fragment and repli ate with range
partitioning. How an you optimize the evaluation if the join ondition is of
the form Ý r:A * s:B Ý f k, where k is a small onstant? Here, Ý x Ý denotes
the absolute value of x. A join with su h a join ondition is alled a band join.
Answer:
Relation r is partitioned into n partitions, r0 , r1 , § , rn*1 , and s is also parti-
tioned into n partitions, s0 , s1 , § , sn*1 . The partitions are repli ated and as-
signed to pro essors as shown in ??
Ea h fragment is repli ated on three pro essors only, unlike in the general
ase where it is repli ated on n pro essors. The number of pro essors required
is now approximately 3n, instead of n2 in the general ase. Therefore, given the
same number of pro essors, we an partition the relations into more fragments
with this optimization, thus making ea h lo al join faster.
22.5 Suppose relation r is stored partitioned and indexed on A, and s is stored par-
titioned and indexed on B. Consider the query:
r C ount(s D) ( (A 5 (r )) Ær B=s B s )
: : > : :

a. Give a parallel query plan using the ex hange operator, for omputing
the subtree of the query involving only the sele t and join operators.
b. Now extend the above to ompute the aggregate. Make sure to use pre-
aggregation to minimize the data transfer.
Pra ti e Exer ises 173

s0 s1 s2 s3 . . . . sn 1

.
.
r0 P 0, 0 P 0, 1 .

r1 P 1, 0 P 1, 1 P 1, 2

r2 P 2, 1 P 2, 2 P 2, 3
.
. . . .
. . . .
. .

rn 1 . . . . . . Pn 1,
n 1

Figure 22.101 The three levels of data abstra tion.

. Skew during aggregation is a serious problem. Explain how pre-


aggregation as above an also signi antly redu e the e e t of skew dur-
ing aggregation.

Answer:

a. This is a small variant of an example from the hapter.


b. This one is very straightforward, sin e it is already the example in the
hapter
. Pre-aggregation an greatly redu e the size of the data sent to the nal
aggregation step. So even if there is skew, the absolute data sizes are
smaller, resulting in signi ant redu tion in the impa t of the skew.
22.6 Suppose relation r is stored partitioned and indexed on A, and s is stored parti-
tioned and indexed on B. Consider the join r Ær B=s B s. Suppose s is relatively
: :

small, but not small enough to make asymmetri fragment-and-repli ate join
the best hoi e, and r is large, with most r tuples not mat hing any s tuple. A
hash-join an be performed but with a semijoin lter used to redu e the data
transfer. Explain how semijoin ltering using Bloom lters would work in this
parallel join setting.
Answer:
174 Chapter 22 Parallel and Distributed Query Pro essing

Sin e s is small, it makes sense to send a Bloom lter on s:B to all partitions of r.
Then we use the Bloom lter to nd r tuples that may mat h some s tuple, and
repartition the mat hing r tuples on r:B, sending them to the nodes ontaining
s (whi h is already partitioned on s:B). Then the join an be performed at ea h
site storing s tuples. The Bloom lter an signi antly redu e the number of r
tuples transferred.
Note that repartitioning s does not make sense sin e it is already partitioned
on the join attribute, unlike r.
22.7 Suppose you want to ompute r  r:A=s:A s.

a. Suppose s is a small relation, while r is stored partitioned on r:B. Give


an e ient parallel algorithm for omputing the left outer join.
b. Now suppose that r is a small relation, and s is a large relation, stored
partitioned on attribute s:B. Give an e ient parallel algorithm for om-
puting the above left outer join.

Answer:

a. Repli ating s to all nodes, and omputing the left outerjoin indepen-
dently at ea h node would be a good option in this ase.
b. The best te hnique in this ase is to repli ate r to all nodes, and ompute
r Æ si at ea h node i. Then, we send ba k the list of r tuples that had
mat hes at site i ba k to a single node, whi h takes the union of the
returned r tuples from ea h node i. Tuples in r that are absent in this
union are then padded with nulls and added to the output.

22.8 Suppose you want to ompute A,B sum(C ) on a relation s whi h is stored par-
titioned on s:B. Explain how you would do it e iently, minimizing/avoiding
repartitioning, if the number of distin t s:B values is large, and the distribution
of number of tuples with ea h s:B value is relatively uniform.
Answer:

The aggregate an be omputed lo ally at ea h node, with no repartitioning


at all, sin e partitioning on s:B implies partitioning on s:A, s:B. To understand
why, partitioning on (A, B) requires that tuples with the same value for (A, B)
must be in the same partition. Partitioning on just B, ignoring A, also satises
this requirement.
Of ourse not partitioning at all also satises the requirement, but that
defeats the purpose of parallel query pro essing. As long as the number of
distin t s:B values is large enough and the number of tuples with ea h s:B value
are relatively uniform and not highly skewed, using the existing partitioning on
s:B will give good performan e.
Pra ti e Exer ises 175

22.9 MapRedu e implementations provide fault toleran e, where you an reexe ute
only failed mappers or redu ers. By default, a partitioned parallel join exe u-
tion would have to be rerun ompletely in ase of even one node failure. It is
possible to modify a parallel partitioned join exe ution to add fault toleran e
in a manner similar to MapRedu e, so failure of a node does not require full
reexe ution of the query, but only a tions related to that node. Explain what
needs to be done at the time of partitioning at the sending node and re eiving
node to do this.
Answer: This is an appli ation of ideas from MapRedu e to join pro essing.
There are two steps: rst the data is repartitioned, and then join is performed,
orresponding to the map and redu e steps.
A failure during the repartition an be handled by reexe uting the work
of the failed node. However, the destination must ensure that tuples are not
pro essed twi e. To do so, it an store all re eived tuples in lo al disk, and
start pro essing only after all tuples have been re eived. If the sender fails
meanwhile, and a new node takes over, the re eivers an dis ard all tuples
re eived from the failed sender, and re eive them again. This part is not too
expensive.
Failures during the nal join omputation an be handled similar to re-
du er failure, by getting the data again from the partitioners. However, in the
MapRedu e paradigm tuples to be sent to redu ers are stored on disk at the
mappers, so they an be resent if required. This an also be done with parallel
joins, but there is now a signi ant extra ost of writing the tuples to disk.
Another option is to nd the tuples to be sent to the failed join node by
res anning the input. But now, all partitioners have to reread their entire input,
whi h makes the pro ess very expensive, similar in ost to rerunning the join.
As a result this is not viewed as useful.
22.10 If a parallel data-store is used to store two relations r and s and we need to join
r and s, it may be useful to maintain the join as a materialized view. What are
the benets and overheads in terms of overall throughput, use of spa e, and
response time to user queries?
Answer:
Performing a join on a loud data-storage system an be very expensive, if
either of the relations to be joined is partitioned on attributes other than the
join attributes, sin e a very large amount of data would need to be transferred
to perform the join. However, if r Æ s is maintained as a materialized view,
it an be updated at a relatively low ost ea h time ea h time either r or s is
updated, instead of in urring a very large ost when the query is exe uted.
Thus, queries are benetted at some ost to updates.
176 Chapter 22 Parallel and Distributed Query Pro essing

With the materialized view, overall throughput will be mu h better if the


join query is exe uted reasonably often relative to updates, but may be worse
if the join is rarely used, but updates are frequent.
The materialized view will ertainly require extra spa e, but given that disk
apa ities are very high relative to IO (seek) operations and transfer rates, the
extra spa e is likely to not be an major overhead.
The materialized view will obviously be very useful to evaluate join queries,
redu ing time greatly by redu ing data transfer a ross ma hines.
22.11 Explain how ea h of the following join algorithms an be implemented using
the MapRedu e framework:
a. Broad ast join (also known as asymmetri fragment-and-repli ate join).
b. Indexed nested loop join, where the inner relation is stored in a parallel
data-store.
. Partitioned join.

Answer:
FILL
CHAPTER
23
Parallel and Distributed
Transa tion Pro essing
Pra ti e Exer ises
23.1 What are the key di eren es between a lo al-area network and a wide-area
network, that a e t the design of a distributed database?
Answer:
Data transfer is mu h faster, and ommuni ation laten y is mu h lower on
a lo al-area network (LAN) than on a wide-area network (WAN). Proto ols
that require multiple rounds of ommuni ation maybe a eptable in a lo al
area network, but distributed databases designed for wide-area networks try to
minimize the number of su h rounds of ommuni ation.
Repli ation to a lo al node for redu ing laten y is quite important in a wide-
area network, but less so in a lo al area network.
Network link failure and network partition are also more likely in a wide-area
network than in a lo al area network, where systems an be designed with
more redundan y to deal with failures. Proto ols designed for wide-area net-
works should handle su h failures without reating any in onsisten ies in the
database.
23.2 To build a highly available distributed system, you must know what kinds of
failures an o ur.
a. List possible types of failure in a distributed system.
b. Whi h items in your list from part a are also appli able to a entralized
system?

Answer:

a. The types of failure that an o ur in a distributed system in lude


i. Site failure.
177
178 Chapter 23 Parallel and Distributed Transa tion Pro essing

ii. Disk failure.


iii. Communi ation failure, leading to dis onne tion of one or more
sites from the network.
b. The rst two failure types an also o ur on entralized systems.
23.3 Consider a failure that o urs during 2PC for a transa tion. For ea h possible
failure that you listed in Exer ise 23.2a, explain how 2PC ensures transa tion
atomi ity despite the failure.
Answer:
A proof that 2PC guarantees atomi ommits/aborts in spite of site and link
failures follows. The main idea is that after all sites reply with a <ready > T

message, only the oordinator of a transa tion an make a ommit or abort


de ision. Any subsequent ommit or abort by a site an happen only after it
as ertains the oordinator's de ision, either dire tly from the oordinator or
indire tly from some other site. Let us enumerate the ases for a site aborting,
and then for a site ommitting.

a. A site an abort a transa tion (by writing an <abort


T T > log re ord)
only under the following ir umstan es:

i. It has not yet written a <ready > log re ord. In this ase, the oor-
T

dinator ould not have got, and will not get, a <ready > or < ommit
T

T> message from this site. Therefore, only an abort de ision an be


made by the oordinator.
ii. It has written the <ready > log re ord, but on inquiry it found out
T

that some other site has an <abort > log re ord. In this ase it is
T

orre t for it to abort, be ause that other site would have as ertained
the oordinator's de ision (either dire tly or indire tly) before a tu-
ally aborting.
iii. It is itself the oordinator. In this ase also no site ould have om-
mitted, or will ommit in the future, be ause ommit de isions an
be made only by the oordinator.

b. A site an ommit a transa tion (by writing a < ommit


T T > log re ord)
only under the following ir umstan es:

i. It has written the <ready > log re ord, and on inquiry it found out
T

that some other site has a < ommit > log re ord. In this ase it
T

is orre t for it to ommit, be ause that other site would have as er-
tained the oordinator's de ision (either dire tly or indire tly) before
a tually ommitting.
Pra ti e Exer ises 179

ii. It is itself the oordinator. In this ase no other parti ipating site an
abort or would have aborted be ause abort de isions are made only
by the oordinator.

23.4 Consider a distributed system with two sites, A and . Can site
B A distinguish
among the following?

• B goes down.
• The link between A and B goes down.
• Bis extremely overloaded and response time is 100 times longer than nor-
mal.

What impli ations does your answer have for re overy in distributed systems?
Answer:

Site annot distinguish between the three ases until ommuni ation has
A

resumed with site . The a tion whi h it performs while is ina essible must
B B

be orre t irrespe tive of whi h of these situations has a tually o urred, and
it must be su h that an re-integrate onsistently into the distributed system
B

on e ommuni ation is restored.


23.5 The persistent messaging s heme des ribed in this hapter depends on time-
stamps. A drawba k is that they an dis ard re eived messages only if they are
too old, and may need to keep tra k of a large number of re eived messages.
Suggest an alternative s heme based on sequen e numbers instead of time-
stamps, that an dis ard messages more rapidly.
Answer:
We an have a s heme based on sequen e numbers similar to the s heme based
on timestamps. We tag ea h message with a sequen e number that is unique
for the (sending site, re eiving site) pair. The number is in reased by 1 for ea h
new message sent from the sending site to the re eiving site.
The re eiving site stores and a knowledges a re eived message only if it has re-
eived all lower-numbered messages also; the message is stored in the re eived-

messages relation.
The sending site retransmits a message until it has re eived an a k from the
re eiving site ontaining the sequen e number of the transmitted message or a
higher sequen e number. On e the a knowledgment is re eived, it an delete
the message from its send queue.
The re eiving site dis ards all messages it re eives that have a lower sequen e
number than the latest stored message from the sending site. The re eiving
site dis ards from re eived-messages all but the (number of the) most re ent
message from ea h sending site (message an be dis arded only after being
pro essed lo ally).
180 Chapter 23 Parallel and Distributed Transa tion Pro essing

Note that this s heme requires a xed (and small) overhead at the re eiving
site for ea h sending site, regardless of the number of messages re eived. In
ontrast, the timestamp s heme requires extra spa e for every message. The
timestamp s heme would have lower storage overhead if the number of mes-
sages re eived within the timeout interval is small ompared to the number of
sites, whereas the sequen e number s heme would have lower overhead other-
wise.
23.6 Explain the di eren e between data repli ation in a distributed system and the
maintenan e of a remote ba kup site.
Answer:

In remote ba kup systems, all transa tions are performed at the primary site
and the entire database is repli ated at the remote ba kup site. The remote
ba kup site is kept syn hronized with the updates at the primary site by send-
ing all log re ords. Whenever the primary site fails, the remote ba kup site
takes over pro essing.
The distributed systems o er greater availability by having multiple opies of
the data at di erent sites, whereas the remote ba kup systems o er lesser avail-
ability at lower ost and exe ution overhead. Di erent data items may be repli-
ated at di erent nodes.
In a distributed system, transa tion ode an run at all the sites, whereas in a
remote ba kup system it runs only at the primary site. The distributed system
transa tions needs to follow two-phase ommit or other onsensus proto ols
to keep the data in onsistent state, whereas a remote ba kup system does not
follow two-phase ommit and avoids related overhead.
23.7 Give an example where lazy repli ation an lead to an in onsistent database
state even when updates get an ex lusive lo k on the primary (master) opy if
data were read from a node other than the master.
Answer:

Consider the balan e in an a ount, repli ated at sites. Let the urrent bal-
N

an e be $100 – onsistent a ross all sites. Consider two transa tions 1 and
T

T
2 ea h depositing $10 in the a ount. Thus the balan e would be $120 after
both these transa tions are exe uted. Let the transa tions exe ute in sequen e:
T
1 rst and then T
2 . Suppose the opy of the balan e at one of the sites, say
s, is not onsistent – due to lazy repli ation strategy – with the primary opy
after transa tion 1 is exe uted, and let transa tion 2 read this opy of the
T T

balan e. One an see that the balan e at the primary site would be $110 at the
end.
23.8 Consider the following deadlo k-dete tion algorithm. When transa tion i , atT

site 1 , requests a resour e from j , at site 3 , a request message with time-


S T S

stamp is sent. The edge ( i , j , ) is inserted in the lo al wait-for graph of


n T T n
Pra ti e Exer ises 181

S
1 . The edge ( i , j , ) is inserted in the lo al wait-for graph of
T T n
3 only if j S T

has re eived the request message and annot immediately grant the requested
resour e. A request from i to j in the same site is handled in the usual man-
T T

ner; no timestamps are asso iated with the edge ( i , j ). A entral oordinator
T T

invokes the dete tion algorithm by sending an initiating message to ea h site


in the system.
On re eiving this message, a site sends its lo al wait-for graph to the o-
ordinator. Note that su h a graph ontains all the lo al information that the
site has about the state of the real graph. The wait-for graph re e ts an instan-
taneous state of the site, but it is not syn hronized with respe t to any other
site.
When the ontroller has re eived a reply from ea h site, it onstru ts a
graph as follows:
• The graph ontains a vertex for every transa tion in the system.
• The graph has an edge ( i , j ) if and only if:
T T

° There is an edge ( i , j ) in one of the wait-for graphs.


T T

° An edge ( i , j , ) (for some ) appears in more than one wait-for


T T n n

graph.
Show that, if there is a y le in the onstru ted graph, then the system is in a
deadlo k state, and that, if there is no y le in the onstru ted graph, then the
system was not in a deadlo k state when the exe ution of the algorithm began.

™ ™5™ ™
Answer:
Let us say a y le i T
j m
T
i exists in the graph built by
T T

the ontroller. The edges in the graph will either be lo al edgem ( k , l ) or T T

distributed edges of the form ( k , l , ). Ea h lo al edge ( k , l ) denitely


T T n T T

implies that k is waiting for l . Sin e a distributed edge ( k , l , ) is inserted


T T T T n

into the graph only if k 's request has rea hed l and l annot immediately
T T T

release the lo k, k is indeed waiting for l . Therefore every edge in the y le


T T

indeed represents a transa tion waiting for another. For a detailed proof that
this implies a deadlo k, refer to Stuart et al. [1984℄.
We now prove the onverse impli ation. As soon as it is dis overed that k is T

waiting for l :
T

a. A lo al edge ( k , l ) is added if both are on the same site.


T T

b. The edge ( k , l , ) is added in both the sites, if


T T n T
k and l
T are on di erent
sites.
Therefore, if the algorithm were able to olle t all the lo al wait-for graphs at
the same instant, it would denitely dis over a y le in the onstru ted graph,
in ase there is a ir ular wait at that instant. If there is a ir ular wait at the
instant when the algorithm began exe ution, none of the edges parti ipating in
182 Chapter 23 Parallel and Distributed Transa tion Pro essing

that y le an disappear until the algorithm nishes. Therefore, even though


the algorithm annot olle t all the lo al graphs at the same instant, any y le
whi h existed just before it started will be dete ted.
23.9 Consider the hain-repli ation proto ol, des ribed in Se tion 23.4.3.2, whi h
is a variant of the primary- opy proto ol.
a. If lo king is used for on urren y ontrol, what is the earliest point when
a pro ess an release an ex lusive lo k after updating a data item?
b. While ea h data item ould have its own hain, give two reasons it would
be preferable to have a hain dened at a higher level, su h as for ea h
partition or tablet.
. How an onsensus proto ols be used to ensure that the hain is
uniquely determined at any point in time?

Answer:

a. The lo k an be released only after the update has been re orded at the
tail of the hain, sin e further reads will read the tail. Two phase lo king
may also have to be respe ted.
b. The overhead of re ording hains per data item would be high. Even
more so, in ase of failures, hains have to be updated, whi h would
have an even greater overhead if done per item.
. All nodes in the hain have to agree on the hain membership and or-
der. Consensus an be used to ensure that updates to the hain are done
in a fault-tolerant manner. A fault-tolerant oordination servi e su h as
ZooKeeper or Chubby ould be used to ensure this onsensus, by updat-
ing metadata that is repli ated using onsensus; the oordination servi e
hides the details of onsensus, and allows storage and update of (a lim-
ited amount of) metadata.
23.10 If the primary opy s heme is used for repli ation, and the primary gets dis-
onne ted from the rest of the system, a new node may get ele ted as primary.
But the old primary may not realize it has got dis onne ted, and may get re-
onne ted subsequently without realizing that there is a new primary.
a. What problems an arise if the old primary does not realize that a new
one has taken over?
b. How an leases be used to avoid these problems?
. Would su h a situation, where a parti ipant node gets dis onne ted and
then re onne ted without realizing it was dis onne ted, ause any prob-
lem with the majority or quorum proto ols?
Pra ti e Exer ises 183

Answer:

a. The old primary may re eive read requests and reply to them, serving
old data that is missing subsequent updates.
b. Leases an be used so that at the end of the lease, the primary knows
that it if it did not su essfuly renew the lease, it should stop serving
requests. If it is dis onne ted, it would be unable to renew the lease.
. This situation would not ause a problem with the majority proto ol
sin e the write set (or write quorum) and the read set (read quorum)
must have at least one node in ommon, whi h would serve the latest
value.

23.11 Consider a federated database system in whi h it is guaranteed that at most


one global transa tion is a tive at any time, and every lo al site ensures lo al
serializability.

a. Suggest ways in whi h the federated database system an ensure that


there is at most one a tive global transa tion at any time.
b. Show by example that it is possible for a nonserializable global s hedule
to result despite the assumptions.

Answer:

a. We an have a spe ial data item at some site on whi h a lo k will have
to be obtained before starting a global transa tion. The lo k should be
released after the transa tion ompletes. This ensures the single a tive
global transa tion requirement. To redu e dependen y on that parti -
ular site being up, we an generalize the solution by having an ele tion
s heme to hoose one of the urrently up sites to be the oordinator and
requiring that the lo k be requested on the data item whi h resides on
the urrently ele ted oordinator.
b. The following s hedule involves two sites and four transa tions. 1 and T

2 are lo al transa tions, running at site 1 and site 2 respe tively.


T
G1 T

and G2 are global transa tions running at both sites. 1 , 1 are data
T X Y

items at site 1, and 2 , 2 are at site 2.


X Y
184 Chapter 23 Parallel and Distributed Transa tion Pro essing

T1 T2 TG1 TG2
write(Y1 )
read(Y 1)
write(X2 )
read(X 2)
write(Y2 )
read(Y 2 )
write(X 1)
read(X 1)

In this s hedule, G2 starts only after G1 nishes. Within ea h site, there


™ ™
T T

is lo al serializability. In site 1, G2 G1 is a serializability


™ ™
T T T
1
order. In site 2, G1
T
2T
G2 is
T a serializability order. Yet the global
s hedule s hedule is nonserializable.
23.12 Consider a federated database system in whi h every lo al site ensures lo al
serializability, and all global transa tions are read only.
a. Show by example that nonserializable exe utions may result in su h a
system.
b. Show how you ould use a ti ket s heme to ensure global serializability.

Answer:

a. The same system as in the answer to Exer ise 23.11 is assumed, ex ept
that now both the global transa tions are read-only. Consider the follow-
ing s hedule:

T1 T2 TG1 TG2
read(X 1)
write(X1 )
read(X 1)
read(X2 )
write(X2 )
read(X2 )

Though there is lo al serializability in both sites, the global s hedule is


not serializable.

b. Sin e lo al serializability is guaranteed, any y le in the systemwide


pre eden e graph must involve at least two di erent sites and two dif-
ferent global transa tions. The ti ket s heme ensures that whenever two
Pra ti e Exer ises 185

global transa tions a ess data at a site, they on i t on a data item (the
ti ket) at that site. The global transa tion manager ontrols ti ket a ess
in su h a manner that the global transa tions exe ute with the same se-
rializability order in all the sites. Thus the han e of their parti ipating
in a y le in the systemwide pre eden e graph is eliminated.
23.13 Suppose you have a large relation ( , , ) and a materialized view
r A B C

v = A sum(B) ( ). View maintenan e an be performed as part of ea h trans-


r

a tion that updates , on a parallel/distributed storage system that supports


r

transa tions a ross multiple nodes. Suppose the system uses two-phase om-
mit along with a onsensus proto ol su h as Paxos, a ross geographi ally dis-
tributed data enters.
a. Explain why it is not a good idea to perform view maintenan e as part of
the update transa tion, if some values of attribute are “hot” at ertain
A

points in time, that is, many updates pertain to those values of . A

b. Explain how operation lo king (if supported) ould solve this problem.
. Explain the tradeo s of using asyn hronous view maintenan e in this
ontext.

Answer:

This is a very bad idea from the viewpoint of throughput. Most transa tions
would now update a few aggregate re ords, and updates would get serialized
on the lo k. The problem that due to Paxos delays plus 2PC delays, ommit
pro essing will take a long time (hundreds of millise onds) and there would
be very high ontention on the lo k. Transa tion throughput would de rease
to tens of transa tions per se ond, even if transa tions do not on i t on any
other items.
If the storage system supported operation lo king, that ould be an alterna-
tive to improve on urren y, sin e view maintenan e an be done using opera-
tion lo ks that do not on i t with ea h other. Transa tion throughput would
be greatly in reased.
Asyn hronous view maintenan e would avoid the bottlene k and lead to
mu h better throughput, but at the risk of reads of the view seeing stale data.
A2: Individual Assignment 2
Summary
Assignment goals
This assignment tests your knowledge in the following aspects:

• Developing web pages with valid HTML & correcting any HTML errors
• Organizing content in HTML and creating accessible web pages in HTML
• Styling web pages using CSS
• Ensuring separation of concerns between content, presentation, and behaviour layers in web
pages.

By completing this assignment, you stand to gain the following skills:

• Correcting HTML errors in web pages and developing valid and accessible web pages
• Styling web pages using CSS
• Validating HTML web pages and CSS stylesheets

Assignment overview
A2 requires you to correct errors in a set of web pages given to you and style these web pages to
make them appear as specified in this document.

What does "correcting errors" mean? The HTML code given to you has errors. Your work in this
assignment is to

• validate this HTML,


• correct all errors and warnings,
• make sure that the pages are working as specified and demonstrated in the reference, and
• style the web pages to make sure that they appear as specified in the reference PDFs.
Instructions
Step 1: Download the starter code
Download the starter code from Brightspace. This gives you all the files and folders, organized as
required into sub-folders, etc., that you are required to use to develop a solution for this assignment.

Step 2: Fix all HTML errors (and warnings)


Open the web pages in your browser and observe what you see.

Now, validate these web pages using the W3C Validator (use the "Validate by direct input" method to
copy/paste the code and validate).

Fix all HTML errors and warnings.

These errors and warnings indicate that the web pages given to you are not accessible, are not
optimized for searching on the web, and therefore, are not usable for the users.

After you have fixed the errors, you will see these web pages appear as shown in the reference images
folder.

Step 3: Foundations and styling


1. Create a folder named css (all lower-case letters) inside the folder named A2.
a. In this folder, create a file named main.css
b. This file (main.css) must include ALL style definitions for this assignment.
You MUST NOT use multiple CSS files -- just one.
2. Link the style file -- main.css -- in all your HTML files correctly.
3. First, style the fonts to appear the way you see in the reference PDFs in the final html folder:
a. Remember to use the selectors appropriately to set the fonts to all elements displayed
on the viewport.
4. Style the links as shown in the reference PDF.
a. Note that some links are styled differently than others. Be mindful of this aspect.

Step 4 Backgrounds, borders and spacing


1. All these styles must be applied using appropriate selectors in the external CSS file (main.css).
a. DO NOT use style attribute or the style element to style content in this assignment. This
will lead to a penalty as explained in the rubric.
2. Set the background colours and text/link colours to appear like what you see in the reference
PDF.
a. Assume for the actual colour values.
b. The colour does not have to be the exact value -- if the colours are like the reference, it
is okay.
3. Set appropriate borders:
a. Some of the elements on the page have bottom borders. Based on our discussions in the
class and information you see in zyBooks, try to implement borders for these elements
as specified.
b. Assume about the thickness of the border.
4. Space the elements appropriately:
a. Use margins and padding appropriately to space elements.
b. As you can see in the reference image, elements are spaced out from each other. This
spacing make content looks neatly organized.
c. Assume specific spacing values between 1em and 3em and space the elements. You will
not need more than 3em of spacing between elements.
d. Remember to also use height and min-height properties appropriately to define the
heights of elements and containers.

Step 5: Floating and layout organization


All these styles must be applied using appropriate selectors in the external CSS file (main.css).

1. DO NOT use style attribute or the style element to style content in this assignment. This will lead
to a penalty as explained in the rubric.

As you can see in the reference PDFs, you are required to implement a simple layout structure for this
website.

1. The layout is different for the homepage, and for the article pages.
2. In the homepage,
a. Make the <aside> element with the people information below the header have an
approximate width and set it appear in the centre.
b. Then, style the list inside the <aside> so that the list items appear next to each other,
with no "bullet" icons.
c. Style the "highlights" section to have about 70% of the width of its parent container and
the "covid-info" section to have the remaining width on the same line. These two
sections must appear next to each other.
d. Adjust the spacing and widths of the four "featured articles" in a way that they appear
next to each other, and they are properly spaced -- as in the reference.
e. Make sure that the "about" part of the page is properly spaced from the other content
parts on the page.
3. In the article’s pages,
a. Make sure that the header and footer are consistently styled as in the homepage.
b. Make sure that the articles are also consistently styled between all article pages.
c. If any images are used, style it so that the image is on the left-hand side and the content
is on the right side of the image. You MUST NOT use any additional elements like
<div> to create this appearance. Use the HTML given in the article pages (after you
fix errors) to style it correctly.
d. For the file named human-cyborg-comm.html, find an appropriate image and include it
with reference. Style this to appear like the image you see in the file named main-
reactor.html
4. FAQs: "Am I able to use CSS properties like Flexbox or Grids for this assignment?"
• No. You must only the CSS properties we covered in lectures 8 and 9 (zybook chapters 4.1,
4.2, 4.3, 4.4, 4.6, 4.7, 4.11) -- this assignment is to help you understand the foundations of
styling and controlling element placement using CSS basics. We will use Flex and Grids in
future assignments.
• If you use anything other than the content covered in lectures 8 and 9 in this assignment,
you will lose 50% of your grade

Step 6: Validate your code


1. Validate ALL your HTML files - fix ALL errors and warnings.
a. Validate them via: https://validator.w3.org/#validate_by_input
b. Save the HTML validation files as PDF in the folder named files, and with the following
names:
i. W3C-validation-index.pdf
ii. W3C-validation-bad-motivator.pdf
iii. W3C-validation-human-cyborg-comm.pdf
iv. W3C-validation-main-reactor.pdf
v. W3C-validation-wookies.pdf
2. Validate your CSS file - fix ALL errors and warnings:
a. Validate it via: https://jigsaw.w3.org/css-validator/#validate_by_input
b. Save the CSS validation file as PDF in the folder named files, with the following name:
i. W3C-validation-main-CSS.pdf
3. To save validation these validation web pages, i.e., the output HTML files from the Validator (like
the validator results) as PDF, you can use the "print" option in your browser and save the page
as a PDF.

Step 7 No additional layout-related HTML -no <div>


This is not an option or a suggestion: DO NOT USE <div> elements in your code.

• You do not need to.


• If you do, you will lose 50% of your grade

Step 8: Organize your code correctly and use citations


1. Make sure that you organize your code well.
a. Your code must be readable and properly indented.
b. Use comments appropriately.
2. Do not use comments for each line of the code.
a. Use comments to describe parts of your web page and what each part does.
3. Citations:
a. If you have learned any concept from content other than those provided in the course,
include a citation in both the comments in the code and in the README.md file that
is in your submission.

Step 9: Organize your code into folders as specified


If you have created folders, etc., as specified above, your folder structure must look like one of the
following:

Folder structure for this website:

A2
|-- css
| |---- main.css
|-- img
| |---- <the two images used, with citations>
|-- articles
| |---- <four article HTML files>
|-- files
| |---- <W3C validation PDFs - for both HTML and CSS>
|
|-- README.md
|-- index.html

Step 10: Save the folder in .ZIP format


1. Compress the folder named A1 into ZIP format, i.e., to get A1.zip
2. Submit A1.zip on Brightspace, in the assignment submission dropbox.
Marking Rubric
Criteria Meets Moderately meets Somewhat meets Does not meet Criterion
expectations expectations expectations expectations Score
10 points 5 points 3 points yet
0 points
Organize Demonstrates Demonstrates good Demonstrates Work does not /10
web pages as excellent understanding of some meet
specified understanding of web page layout understanding of expectations
web page layout styling web page layout yet
styling styling
Implement CSS floats, Organizes most Organizes some Work does not /10
layout styling overflow and clear content aspects as content aspects meet
using CSS properties used specified, other as specified, uses expectations
Floats appropriately to layout styling may many other yet
create layout be used (e.g., layouts styling
appearance as position) in addition (e.g., position,
specified. to float display - flex or
grid)
Style header, The three The three elements The three Work does not /10
navigation elements at the at the top part of elements at the meet
and aside as top part of the the page appear and top part of the expectations
in the page appear and work mostly as page appear and yet
reference work as specified specified, some work somewhat
images in the sample. appearance may as specified,
break/not work as many
specified. appearances may
break/not work
as specified.
Use fonts, Fonts, spacing and Fonts, spacing and Fonts, spacing Work does not /10
spacing and borders used borders used mostly and borders used meet
borders appropriately to appropriately. In somewhat expectations
appropriately display content some cases, spacing appropriately. In yet
and/or borders may many cases,
not be as in the spacing and/or
reference. borders may not
be as in the
reference.
Make code Code is readable, Code is mostly Code is Work does not /10
readable & indented and readable, indented somewhat meet
include comments are and comments are readable. expectations
comments used as specified used as specified Comments may yet
appropriately or may not be
consistently
used.
Fix all HTML All HTML errors Most HTML errors Some HTML Work does not /10
errors and and warnings in and warnings are errors and meet
warnings in the starter code fixed warnings are expectations
the starter are fixed (at least 15 errors fixed yet
code (There are 23 in are fixed) (at least 10 errors
total) are fixed)
Criteria Meets expectations Does not meet Criterion Score
5 points expectations yet
0 points

Use semantic elements Included in the Generic containers used /5


and containers only for submission as expected
content sections; no
generic container (i.e.,
div) elements
Make image(s) accessible Images used are Images used are not /5
accessible. accessible.
Make content containers Semantic Semantic containers are /5
accessible elements/containers are not accessible.
accessible.
Style image in CSS Image is styled using CSS, Image is styled in HTML, /5
not in HTML not in CSS.

Criteria Meets expectations Does not meet Criterion Score


5 points expectations yet
0 points

Only one CSS style file Included in the Multiple style files /5
used to style pages. No submission as expected used, or style
use of style attribute or attributes used.
style element.
Use valid HTML HTML is valid HTML is not valid /5
Use valid CSS CSS is valid CSS is not valid /5

Criteria Meets expectations Does not meet Criterion Score


10 points expectations yet
0 points

Citations and notes Includes citations for any images used. Citations or note /5
If no images used, then includes a note not included.
in the readme file saying as such.
If images are used and are generated or
created by the student, includes a note
in the readme file that says that the
images were created or developed by
the student and are the original work of
the student.

Total / 100

You might also like