0% found this document useful (0 votes)
78 views47 pages

9 Design Theory

The document discusses foundational concepts in database design theory including functional dependencies, normal forms, and decompositions. It covers topics such as the goals of normalization including removing redundancy and expressing constraints, rules for functional dependencies including splitting/combining and transitivity, identifying keys and prime attributes, and using closure tests to determine if a functional dependency is implied by a set of given dependencies. The overall document provides an introduction to key theoretical concepts for systematically improving database schemas through normalization.

Uploaded by

Miranda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views47 pages

9 Design Theory

The document discusses foundational concepts in database design theory including functional dependencies, normal forms, and decompositions. It covers topics such as the goals of normalization including removing redundancy and expressing constraints, rules for functional dependencies including splitting/combining and transitivity, identifying keys and prime attributes, and using closure tests to determine if a functional dependency is implied by a set of given dependencies. The overall document provides an introduction to key theoretical concepts for systematically improving database schemas through normalization.

Uploaded by

Miranda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Design Theory for Relational DBs:

Functional Dependencies,
Decompositions, Normal Forms
Introduction to Databases
Manos Papagelis

Thanks to Ryan Johnson, John Mylopoulos, Arnold Rosenbloom


and Renee Miller for material in these slides
2

Database Design Theory


• Guides systematic improvements to database schemas
• General idea:
– Express constraints on the data
– Use these to decompose the relations
• Ultimately, get a schema that is in a “normal form”
– guarantees certain desirable properties
– “normal” in the sense of conforming to a standard
• The process of converting a schema to a normal form is called
normalization

2
3

Goal #1: remove redundancy


• Consider this schema
Student Name Student Email Course Instructor
Xiao xiao@gmail CSC333 Smith
Xiao xiao@gmail CSC444 Brown
Jaspreet jaspreet@gmail CSC333 Smith

• What if…
– Xiao changes email addresses? (update anomaly)
– Xiao drops CSC444? (deletion anomaly)
– Need to create a new course, CSC222 (insertion anomaly)

Multiple relations => exponentially worse


4

Goal #2: expressing constraints


• Consider the following sets of schemas:
Students(utorid, name, email)
vs.
Students(utorid, name)
Emails(utorid, address)
• Consider also:
House(street, city, value, owner, propertyTax)
vs.
House(street, city, value, owner)
TaxRates(city, value, propertyTax)

Dependencies, constraints are domain-dependent


Overview
• Part I: Functional Dependencies
• Part II: Decompositions
• Part III: Normal Forms
6

PART 1:
FUNCTIONAL DEPENDENCIES
7

Functional dependencies
• Let X, Y be sets of attributes from relation R
• X -> Y is an assertion about tuples in R
– Any tuples in R which agree in all attributes of X must also agree in all
attributes of Y
• “X functionally determines Y”
– Or, “The values of attributes Y are a function of those in X”
– Not necessarily an easy function to compute, mind you
=> Consider X -> h, where h is the hash of attributes in X
• Notational conventions
– “a”, “b”, “c” – specific attributes
– “A”, “B”, “C” – sets of (unnamed) attributes
– abc -> def – same as {a,b,c} -> {d,e,f}

Most common to see singletons (X -> y or abc -> d)


8

Rules and principles about FDs


• Rules
– The splitting/combining rule
– Trivial FDs
– The transitive rule
• Algorithms related to FDs
– the closure of a set of attributes of a relation
– a minimal basis of a relation
9

The Splitting/Combining rule of FDs


• Attributes on right independent of each other
– Consider a,b,c -> d,e,f
– “Attributes a, b, and c functionally determine d, e, and f”
=> No mention of d relating to e or f directly
• Splitting rule (Useful to split up right side of FD)
– abc -> def becomes abc -> d, abc -> e and abc -> f
• No safe way to split left side
– abc -> def is NOT the same as ab -> def and c -> def!
• Combining rule (Useful to combine right sides):
– if abc -> d, abc -> e, abc -> f holds, then abc -> def holds
10

Splitting FDs – example


• Consider the relation and FD
– EmailAddress(user, domain, firstName, lastName)
– user, domain -> firstName, lastName
• The following hold
– user, domain -> firstName
– user, domain -> lastName
• The following do NOT hold!
– user -> firstName, lastName
– domain -> firstName, lastName

Gotcha: “doesn’t hold” = “not all tuples” != “all tuples not”


11

Trivial FDs
• Not all functional dependencies are useful
– A -> A always holds
– abc -> a also always holds (right side is subset of left side)
• FD with an attribute on both sides is “trivial”
– Simplify by removing L ∩ R from R
abc -> ad becomes abc -> d
– Or, in singleton form, delete trivial FDs
abc -> a and abc -> d becomes just abc -> d
12

Transitive rule
• The transitive rule holds for FDs
– Consider the FDs: a -> b and b -> c; then a->c holds
– Consider the FDs: ad -> b and b -> cd; then ad->cd holds or
just ad->c (because of the trivial dependency rule)
13

Identifying functional dependencies


• FDs are domain knowledge
– Intrinsic features of the data you’re dealing with
– Something you know (or assume) about the data
• Database engine cannot identify FDs for you
– Designer must specify them as part of schema
– DBMS can only enforce FDs when told to
• DBMS cannot safely “optimize” FDs either
– It has only a finite sample of the data
– An FD constrains the entire domain
14

Coincidence or FD?
ID Email City Country Surname
1983 [email protected] Toronto Canada Fairgrieve
8624 [email protected] London Canada Samways
9141 [email protected] Winnipeg Canada Samways
1204 [email protected] Aachen Germany Lakemeyer

• What if we try to infer FDs from the data?


– ID -> email, city, country, surname
– email -> city, country, surname
– city -> country
– surname -> country

Domain knowledge required to validate FDs


15

Keys and FDs


• Consider relation R with attributes A
• Superkey
– Any S  A s.t. S -> A
=> Any subset of A which determines all remaining attributes in A
• Candidate key (or key)
– C  A s.t. C -> A and X -> A does not hold for any X C
=> A superkey which contains no other superkeys
=> Remove any attribute and you no longer have a key
• Primary key
– The candidate key we use to identify the relation
=> Always exists, only one allowed, doesn’t matter which C we use
• Prime attribute
–  candidate key C s.t. xC (attribute that participates in at least one key)
17

FD: relaxes the concept of a “key”


• Functional dependency: X -> Y
• Superkey: X -> R
• A superkey must include all remaining attributes
of the relation on the RHS (Right-Hand-Side)
• An FD can involve just a subset of them
• Example:
Houses(street, city, value, owner, tax)
– street,city -> value,owner,tax (both FD and key)
– city,value -> tax (FD only)
18

Cyclic functional dependencies?


• Attributes on right side of one FD may appear
on left side of another!
– Simple example: assume relation (A, B) & FDs: A -> B, B -> A
– What does this say about A and B?
• Example
– studentID -> email email -> studentID
19

Geometric view of FDs


• Let D be the domain of tuples in R
– Every possible tuple is a point in D
• FD X on R restricts tuples in R to a subset of D
– Points in D which violate X cannot be in R
• Example: D(x,y,z)
– xy -> z
(-1, -1, 2)
=> z = abs(x) + abs(y) (1,1,0) (0,0,1)
– z -> x,y (1, 1, 2)
=> x=y=abs(z)/2 (1, 1, -2) (0, 0, 0) (2, 2, -4)
(2, 2, 4)
(1,-1,-2) (3,2,1)
(1, 2, 3)
20

Inferring functional dependencies


• Problem
– Given FDs X1 -> a1, X2 -> a2, etc.
– Does some FD Y -> B (not given) also hold?
• Consider the dependencies
A -> B, B -> C
Does A -> C hold?

Intuitively, A -> C also holds


The given FDs entail (imply) it (transitivity rule)

How to prove it in the general case?


21

Closure test for FDs


• Given attribute set A and FD set F
– Denote AF+ as the closure of A relative to F
=> AF+ = set of all FDs given or implied by A
• Computing the [transitive] closure of A
– Start: AF+ = A, F’ = F
– While X  F’ s.t. LHS(X)  AF+ :
AF+ = AF+ U RHS(X)
F’ = F’ - X
– At end: A -> B B  AF+
22

Closure test – example


• Consider R(a,b,c,d,e,f)
with FDs ab -> c, ac -> d, c -> e, ade -> f
• Find A+ if A = ab or find {a,b}+

a b c d e f a b c d e f

a b c d e f a b c d e f

{a,b}+={a,b,c,d,e,f} or ab -> cdef -- ab is a candidate key!


23

Example : Closure Test


R(A, B, C, D, E) X XF+

F: AB -> C A {A, D, E}
A -> D AB {A, B, C, D, E}
D -> E AC {A, C, B, D, E}
AC -> B B {B}
D {D, E}

Is AB -> E entailed by F? Yes


Is D -> C entailed by F? No

Result: XF+ allows us to determine all FDs of the form


X -> Y entailed by F
24

Discarding redundant FDs


• Minimal basis: opposite extreme from closure
• Given a set of FDs F, want to minimize F’ s.t.
– F’  F
– F’ entails X XF
• Properties of a minimal basis F’
– RHS is always singleton
– If any FD is removed from F’, F’ is no longer a minimal basis
– If for any FD in F’ we remove one or more attributes from
the LHS of X  F’, the result is no longer a minimal basis
25

Constructing a minimal basis


• Straightforward but time-consuming
1. Split all RHS into singletons
2. X  F’, test whether J = (F’-X)+ is still equivalent to F+

=> Might make F’ too small


3. i  LHS(X) X  F’, let LHS(X’)=LHS(X)-i
Test whether (F’-X+X’)+ is still equivalent to F+
=> Might make F’ too big
4. Repeat (2) and (3) until neither makes progress
26

Minimal Basis: Example


• Relation R: R(A, B, C, D)
• Defined FDs:
– F = {A->AC, B->ABC, D->ABC}

Find the minimal Basis M of F


27

Minimal Basis: Example (cont.)


1st Step
– H = {A->A, A->C, B->A, B->B, B->C, D->A, D->B, D->C}
2nd Step
– A->A, B->B: can be removed as trivial
– A->C: can’t be removed, as there is no other LHS with A
– B->A: can’t be removed, because for J=H-{B->A} is B+=BC
– B->C: can be removed, because for J=H-{B->C} is B+=ABC
– D->A: can be removed, because for J=H-{D->A} is D+=DBA
– D->B: can’t be removed, because for J=H-{D->B} is D+=DC
– D->C: can be removed, because for J=H-{D->C} is D+=DBAC
Step outcome => H = {A->C, B->A, D->B}
28

Minimal Basis: Example (cont.)


3rd Step
– H doesn’t change as all LHS in H are single attributes
4th Step
– H doesn’t change

Minimal Basis: M = H = {A->C, B->A, D->B}


29

Minimal Basis: Example 2


• Relation R: R(A, B, C)
• Defined FDs:
– A->B, A->C, B->C, B->A, C->A, C->B
– AB->, AC-B, BC->A
– A->BC
– A->A
• Possible Minimal Bases:
– {A->B, B->A, B->C, C->B} or
– {A->B, B->C, C->A}
– …
34

PART II:
SCHEMA DECOMPOSITION
35

FDs and redundancy


• Given relation R and FDs F
– R often exhibits anomalies due to redundancy
– F identifies many (not all) of the underlying problems
• Idea
– Use F to identify “good” ways to split relations
– Split R into 2+ smaller relations having less redundancy
– Split up F into subsets which apply to the new relations
(compute the projection of functional dependencies)
36

Schema decomposition
• Given relation R and FDs F
– Split R into Ri s.t. i Ri  R (no new attributes)
– Split F into Fi s.t. i F entails Fi (no new FDs)
– Fi involves only attributes in Ri
• Caveat: entirely possible to lose information
– F+ may entail FD X which is not in (Ui Fi)+
=> Decomposition lost some FDs
– Possible to have R  i Ri
=> Decomposition lost some relationships
• Goal: minimize anomalies without losing info
We’ll revisit information loss later
37

Splitting relations – example


• Consider the following relation:
Student Name Student Email Course Instructor
Xiao xiao@gmail CSC333 Smith
Xiao xiao@gmail CSC444 Brown
Jaspreet jaspreet@gmail CSC333 Smith
• One possible decomposition
– Students(email, name)
Taking(studentEmail, courseName)
Courses(name, instructor)
38

Gotcha: lossy join decomposition


• Consider a relation with one more tuple
Student Name Student Email Course Instructor
Xiao xiao@gmail CSC333 Smith
Xiao xiao@gmail CSC444 Brown
Jaspreet jaspreet@gmail CSC333 Smith
Mary mary@gmail CSC444 Rosenburg

• Students Taking Courses has bogus tuples!


– Mary is not taking Brown’s section of CSC444
– Xiao is not in Rosenburg’s section of CSC444
Why did this happen? How to prevent it?
39

Information loss with decomposition


• Decompose R into S and T
– Consider FD a->b, with a only in S and b only in T
• FD loss
– Attributes a and b no longer in same relation
=> Must join T and S to enforce a->b (expensive)
• Join loss
– LHS and RHS no longer in same relation, no other connection
– Neither (S ∩ T) -> S nor (S ∩ T) -> T in F+
=> Joining T and S produces bogus tuples (irreparable)
• In our example:
– ({email,course} ∩ {course,instructor}) = {course}
– course -/-> instructor and course -/-> email
42

Projecting FDs
• Once we’ve split a relation we have to refactor
our FDs to match
– Each FDs must only mention attributes from one relation
• Similar to geometric projection
– Many possible projections (depends on how we slice it)
– Keep only the ones we need (minimal basis)
43

FD projection algorithm
• Start with Fi = Ø
• For each subset X of Ri
– Compute X+
– For each attribute a in X+
• If a is in Ri
– add X -> a to Fi

• Compute the minimal basis of Fi


• Projection is expensive
– Suppose R1 has n attributes
– How many subsets of R1 are there?
44

Making projection more efficient


• Ignore trivial dependencies
– No need to add X -> A if A is in X itself
• Ignore trivial subsets
– The empty set or the set of all attributes (both are subsets of
X)
• Ignore supersets of X if X + = R
– They can only give us “weaker” FDs (with more on the LHS)

44
45

Example: Projecting FD’s


• ABC with FD’s A->B and B->C
– A +=ABC ; yields A->B, A->C
• We ignore A->A as trivial
• We ignore the supersets of A, AB + and AC +, because they can only give us
“weaker” FDs (with more on the LHS)
– B +=BC ; yields B->C
– C +=C ; yields nothing.
– BC +=BC ; yields nothing.

45
46

Example -- Continued
• Resulting FD’s: A->B, A->C, and B->C
• Projection onto AC : A->C
– Only FD that involves a subset of {A,C}
• Projection on BC: B->C
– Only FD that involves subset of {B, C}

46
47

PART III:
NORMAL FORMS
48

Motivation for normal forms


• Identify a “good” schema
– For some definition of “good”
– Avoid anomalies, redundancy, etc.
• Many normal forms
– 1st
– 2nd
– 3rd
– Boyce-Codd
– ... and several more we won’t discuss…

BCNF  3NF  2NF  1NF (focus on 3NF/BCNF)


49

1st normal form (1NF)


• No multi-valued attributes allowed
– Imagine storing a list/set of things in an attribute
=> Not really even expressible in RA
• Counterexample
– Course(name, instructor, [student,email]*)
– Redundancy in non-list attributes

Name Instructor Student Name Student Email


CSCC43 Johnson Xiao xiao@gmail
Jaspreet jaspreet@utsc
Mary mary@utsc
CSCD08 Rosenburg Jaspreet jaspreet@utsc
51

2nd normal form (2NF)


• Non-prime attributes depend on candidate keys
– Consider non-prime (ie. not part of a key) attribute ‘a’
– Then FD X s.t. X -> a and X is a candidate key
• Counterexample
– Movies(title, year, star, studio, studioAddress, salary)
– FD: title, year -> studio; studio -> studioAddress; star->salary
Title Year Star Studio StudioAddr Salary
Star Wars 1977 Hamill Lucasfilm 1 Lucas Way $100,000
Star Wars 1977 Ford Lucasfilm 1 Lucas Way $100,000
Star Wars 1977 Fisher Lucasfilm 1 Lucas Way $100,000
Patriot Games 1992 Ford Paramount Cloud 9 $2,000,000
Last Crusade 1989 Ford Lucasfilm 1 Lucas Way $1,000,000
53

3rd normal form (3NF)


• Non-prime attr. depend only on candidate keys
– Consider FD X -> a
– Either a  X OR X is a superkey OR a is prime (part of a key)
=> No transitive dependencies allowed
• Counterexample:
– studio -> studioAddr
(studioAddr depends on studio which is not a candidate key)

Title Year Studio StudioAddr


Star Wars 1977 Lucasfilm 1 Lucas Way
Patriot Games 1992 Paramount Cloud 9
Last Crusade 1989 Lucasfilm 1 Lucas Way
55

3NF, dependencies, and join loss


• Theorem: always possible to convert a schema to join-
lossless, dependency-preserving 3NF
• Caveat: always possible to create schemas in 3NF for
which these properties do not hold
• Join loss example 1:
– MovieInfo(title, year, studioName)
– StudioAddress(title, year, studioAddress)
=> Cannot enforce studioName -> studioAddress
• Join loss example 2:
– Movies(title, year, star)
– StarSalary(star, salary)
=> Cannot enforce Movies StarSalary yields bogus tuples (irreparable)
57

Boyce-Codd normal form (BCNF)


• One additional restriction over 3NF
– All non-trivial FD have superkey LHS
• Counterexample
– CanadianAddress(street, city, province, postalCode)
– Candidate keys: {street, postalCode}, {street, city, province}
– FD: postalCode -> city, province

– Satisfies 3NF: city, province both non-prime


– Violates BCNF: postalCode is not a superkey
=> Possible anomalies involving postalCode

Do we care? How often do postal codes change?


61

Limits of decomposition
• Pick two…
– Lossless join
– Dependency preservation
– Anomaly-free
• 3NF
– Always allows join lossless and dependency preserving
– May allow some anomalies
• BCNF
– Always excludes anomalies
– May give up one of join lossless or dependency preserving

Use domain knowledge to choose 3NF vs. BCNF

You might also like