2
Database Design Theory
• Guides systematic improvements to database schemas
Functional dependencies, • General idea:
decompositions, normal forms – Express constraints on the data
– Use these to decompose the relations
• Ultimately, get a schema that is in a “normal form” that
Introduction to databases guarantees certain desirable properties
CSCC43 Winter 2011 • “Normal” in the sense of conforming to a standard
• The process of converting a schema to a normal form is called
Ryan Johnson normalization
Thanks to Arnold Rosenbloom and Renee Miller
2
for material in these slides
3 4
Goal #1: redundancy, redunancy Goal #2: expressing constraints
• Consider this schema • Consider the following sets of schemas:
Student Name Student Email Course Instructor Students(utorid, name, email)
vs.
Xiao xiao@gmail CSCC43 Johnson
Students(utorid, name)
Xiao xiao@gmail CSCD08 Bretscher Emails(utorid, address)
Jaspreet jaspreet@utsc CSCC43 Johnson • Consider also:
• What if… House(street, city, value, owner, propertyTax)
– Xiao changes email addresses? (update anomaly) vs.
– Xiao drops CSCD08? (deletion anomaly) House(street, city, value, owner)
– UTSC creates a new course, CSCC44 (insertion anomaly) TaxRates(city, value, propertyTax)
Multiple relations => exponentially worse Dependencies, constraints are domain-dependent
1
6
Functional dependencies
• Let X, Y be sets of attributes from relation R
• X -> Y is an assertion about tuples in R
– Any tuples which agree in all attributes of X must also agree in all
Part I: attributes of Y
• “X functionally determines Y”
Functional dependencies – Or, “The values of attributes Y are a function of those in X”
– Not necessarily an easy function to compute, mind you
=> Consider X -> h, where h is the hash of attributes in X
• Notational conventions
– “a”, “b”, “c” – specific attributes
– “A”, “B”, “C” – sets of (unnamed) attributes
– abc -> def – same as {a,b,c} -> {d,e,f}
Most common to see singletons (X -> y or abc -> d)
7 8
Splitting FDs Splitting FDs – example
• Attributes on right independent of each other • Consider the relation
– Consider a,b,c -> d,e,f – EmailAddress(user, domain, firstName, lastName)
– “Attributes a, b, and c functionally determine d, e, and f” – user,domain -> firstName, lastName
=> No mention of d relating to e or f directly • The following hold
• Useful to split up right side of FD – user,domain -> firstName
– abc -> def becomes abc -> d, abc -> e and abc -> f – user,domain -> lastName
• No safe way to split left side • The following do NOT hold!
– abc -> def is NOT the same as ab -> def and c -> def! – user -> firstName,lastName
– domain -> firstName,lastName
Gotcha: “doesn’t hold” = “not all tuples” != “all tuples not”
2
9 10
Trivial FDs Identifying functional dependencies
• Not all functional dependencies are useful • FDs are domain knowledge
– A -> A always holds – Intrinsic features of the data you’re dealing with
– abc -> a also always holds – Something you know (or assume) about the data
• FD with an attribute on both sides is “trivial” • Database engine cannot identify FDs for you
– Simplify by removing L ∩ R from R – Designer must specify them as part of schema
abc -> ad becomes abc -> d – DBMS can only enforce FDs when told to
– Or, in singleton form, delete trivial FDs • DBMS cannot safely “optimize” FDs either
abc -> a and abc -> d becomes just abc -> d – It has only a finite sample of the data
– An FD constrains the entire domain
11 12
Coincidence or FD? Keys and FDs
ID Email City Country Surname • Consider relation R with attributes A
1983
[email protected] Toronto Canada Fairgrieve
8624
[email protected] London Canada Samways
• Superkey
– Any S ⊆ A s.t. S -> A
9141
[email protected] Winnipeg Canada Samways
=> Any subset of A which determines all remaining attributes in A
1204
[email protected] Aachen Germany Lakemeyer
• Candidate key
• What if we try to infer FDs from the data? – C ⊆ A s.t. C -> A and X -> A does not hold for any X ⊂ C
– ID -> email, city, country, surname => A superkey which contains no other superkeys
– email -> city, country, surname => Remove any attribute and you no longer have a key
– city -> country • Primary key
– surname -> country – The candidate key we use to identify the relation
=> Always exists, only one allowed, doesn’t matter which C we use
• Prime attribute
Domain knowledge required to validate FDs – ∃ candidate key C s.t. x ∈ C
3
13 14
Candidate keys vs. superkeys FD: relaxes the concept of a “key”
• Consider these relations • Functional dependency: X -> Y
Students(ID, surname, name, email, address, major)
Houses(street, city, value, owner, tax) • Superkey: X -> R
• What are the candidate keys? • A superkey must include all remaining attributes
– Students: ID, what else?
– Houses: ? of the relation on the RHS
• What other superkeys exist? • An FD can involve just a subset of them
– Students: ID,surname ID,name ID,name,surname …
– Houses: ? • Example:
• Prime attributes? – Houses(street, city, value, owner, tax)
– Students: ? – street,city -> value,owner,tax (both FD and key)
– Houses: ?
– city,value -> tax (FD only)
15 16
Cyclic functional dependencies? Geometric view of FDs
• Attributes on right side of one FD may appear • Let D be the domain of tuples in R
on left side of another! – Every possible tuple is a point in D
– Simplest example: A -> B B -> A • FD X on R restricts tuples in R to a subset of D
– What does this say about A and B? – Points in D which violate X cannot be in R
• Example • Example: D(x,y,z)
– street,city -> value city,value -> tax – xy -> z
– studentID -> email email -> studentID => z = abs(x) + abs(y) (-1, -1, 2)
(1,1,0) (0,0,1)
– z -> x,y (1, 1, 2)
=> x=y=abs(z)/2 (1, 1, -2) (2, 2, -4)
(2, 2, 4) (0, 0, 0)
(1,-1,-2) (3,2,1)
(1, 2, 3)
4
17 18
Inferring functional dependencies Closure test for FDs
• Problem • Given attribute set A and FD set F
– Given FDs X1 -> a1, X2 -> a2, etc. – Denote AF+ as the closure of A relative to F
– Does some FD Y -> B (not given) also hold? => AF+ = set of all FDs given or implied by A
• Consider the dependencies • Computing the [transitive] closure of A
A -> B B -> C – Start: AF+ = A, F’ = F
Intuitively, A -> C also holds – While ∃X ∈ F’ s.t. LHS(X) ⊆ AF+ :
The given FDs entail (imply) it AF+ = AF+ U RHS(X)
F’ = F’ - X
– At end: A -> B ∀B ∈ AF+
How to prove it in the general case?
19 20
Closure test – example Discarding redundant FDs
• Consider R(a,b,c,d,e,f) • Minimal basis: opposite extreme from closure
with FDs ab -> c, ac -> d, c -> e, ade -> f • Given a set of FDs F, want to minimize F’ s.t.
• Find A+ if A = ab – F’ ⊆ F
– F’ entails X ∀X∈F
a b c d e f a b c d e f • Properties of a minimal basis
– RHS is always singleton
– Removing any FD from F’ loses information
– Removing any attribute from any X∈F loses information
a b c d e f a b c d e f
ab -> cdef -- ab is a candidate key!
5
21
Constructing a minimal basis
• Straightforward but time-consuming
1. Split all RHS into singletons
2. ∀X ∈ F’, test whether (F’-X)+ is still equivalent to F+
=> Might make F’ too small
Part II:
3. ∀i ∈ LHS(X) ∀X ∈ F’, let LHS(X’)=LHS(X)-i Schema decomposition
Test whether (F’-X+X’)+ is still equivalent to F+
=> Might make F’ too big
4. Repeat (2) and (3) until neither makes progress
23 24
FDs and redundancy Schema decomposition
• Given relation R and FDs F • Given relation R and FDs F
– R often exhibits anomalies due to redundancy – Split R into Ri s.t. ∀i Ri ⊂ R (no new attributes)
– F identifies many (not all) of the underlying problems – Split F into Fi s.t. ∀i F entails Fi (no new FDs)
• Idea – Fi involves only attributes in Ri
– Use F to identify “good” ways to split relations • Caveat: entirely possible to lose information
– Split R into 2+ smaller relations having less redundancy – F+ may entail FD X which is not in (Ui Fi)+
– Split up F into subsets which apply to the new relations => Decomposition lost some FDs
– Possible to have R ⊂ i Ri
=> Decomposition lost some relationships
• Goal: minimize anomalies without losing info
We’ll revisit information loss in a moment
6
25 26
Splitting relations – example Gotcha: lossy join decomposition
• Consider the following relation: • Consider a relation with one more tuple
Student Name Student Email Course Instructor Student Name Student Email Course Instructor
Xiao xiao@gmail CSCC43 Johnson Xiao xiao@gmail CSCC43 Johnson
Xiao xiao@gmail CSCD08 Bretscher Xiao xiao@gmail CSCD08 Bretscher
Jaspreet jaspreet@utsc CSCC43 Johnson Jaspreet jaspreet@utsc CSCC43 Johnson
• One possible decomposition Mary mary@utsc CSCD08 Rosenburg
– Students(email, name)
Courses(name, instructor) • Students Taking Courses has bogus tuples!
Taking(studentEmail, courseName) – Mary is not taking Bretscher’s section of D08
– Xiao is not in Rosenburg’s section of D08
Why did this happen? How to prevent it?
27 28
Ensuring lossless joins Projecting FDs
• If we decompose R into S and T • Once we’ve split a relation we have to refactor
• Either (S ∩ T) -> S or (S ∩ T) -> T must be in F+ our FDs to match
– Each FDs must only mention attributes from one relation
• In our example:
– ({email,course} ∩ {course,instructor}) = {course} • Similar to geometric projection
– course -/-> instructor (one-many relationship) – Many possible projections (depends on how we slice it)
– Keep only the ones we need (minimal basis)
7
29 30
FD projection algorithm Making projection more efficient
• Start with Fi = Ø • Ignore trivial dependencies
• For each subset X of Ri – No need to add X -> A if A is in X itself
– Compute X+ • Ignore trivial subsets
– For each attribute a in X+ – The empty set or of the set of all attributes (both are subsets
• If a is in Ri of X)
– add X -> a to Fi
• Ignore supersets of X if X + = R
• Compute the minimal basis of Fi – They can only give use “weaker” FDs (with more on the LHS)
• Projection is expensive
– Suppose R1 has n attributes
– How many subsets of R1 are there?
– How many times do we consider each attribute?
30
31 32
Example: Projecting FD’s Example -- Continued
• ABC with FD’s A ->B and B ->C. Project onto • Resulting FD’s: A ->B, A ->C, and
AC. B ->C.
– A +=ABC ; yields A ->B, A ->C. • Projection onto AC : A ->C.
• We do not need to compute AB + or AC +.
– Only FD that involves a subset of {A,C }.
– B +=BC ; yields B ->C.
– C +=C ; yields nothing. • Projection on BC: B ->C
– BC +=BC ; yields nothing. – Only FD that involves subset of {B, C}.
31 32
8
34
Motivation for normal forms
• Identify a “good” schema
– For some definition of “good”
Part III: – Avoid anomalies, redundancy, etc.
• Several known normal forms
Normal forms – 1st
– 2nd
– 3rd
– Boyce-Codd
– ... and several more we won’t discuss…
BCNF ⊆ 3NF ⊆ 2NF ⊆ 1NF (focus on 3NF/BCNF)
35 36
1st normal form (1NF) 2nd normal form (2NF)
• No multi-valued attributes allowed • Non-prime attributes depend on candidate keys
– Imagine storing a list/set of things in an attribute – Consider non-prime attribute ‘a’
=> Not really even expressible in RA – Then ∃FD X s.t. X -> a and X is a candidate key
• Counterexample • Counterexample
– Course(name, instructor, [student,email]*) – Movies(title, year, star, studioName, studioAddress, salary)
– Redundancy in non-list attributes – FD: title, year -> studioName, studioAddress
Title Year Star StudioName StudioAddr Salary
Name Instructor Student Name Student Email
Star Wars 1977 Hamill Lucasfilm 1 Lucas Way $100,000
CSCC43 Johnson Xiao xiao@gmail
Star Wars 1977 Ford Lucasfilm 1 Lucas Way $100,000
Jaspreet jaspreet@utsc Star Wars 1977 Fisher Lucasfilm 1 Lucas Way $100,000
Mary mary@utsc Patriot Games 1992 Ford Paramount Cloud 9 $2,000,000
CSCD08 Rosenburg Jaspreet jaspreet@utsc Last Crusade 1989 Ford Lucasfilm 1 Lucas Way $1,000,000
9
37 38
3rd normal form (3NF) 3NF, dependencies, and join loss
• Non-prime attr. depend only on candidate keys • Theorem: always possible to convert a schema to join-
– Consider FD X -> a
lossless, dependency-preserving 3NF
– Either a ∈ X OR X is a superkey OR a is prime • Caveat: still possible to create schemas in 3NF for
which these properties do not hold
=> No transitive dependencies allowed
• Lost dependencies
• Counterexample: – MovieInfo(title, year, studioName)
– studioName -> studioAddr – StudioAddress(title, year, studioAddress)
=> Unable to enforce studioName -> studioAddress
Title Year StudioName StudioAddr
• Lossy joins
Star Wars 1977 Lucasfilm 1 Lucas Way
– Movies(title, year, star)
Patriot Games 1992 Paramount Cloud 9 – StarSalary(star, salary)
Last Crusade 1989 Lucasfilm 1 Lucas Way => Movies StarSalary yields bogus tuples
39 40
Boyce-Codd normal form (BCNF) Limits of decomposition
• One additional restriction over 3NF • Pick two…
– All non-trivial FD have superkey LHS – Lossless join
• Counterexample – Dependency preservation
– CanadianAddress(street, city, province, postalCode) – Anomaly-free
– Candidate keys: {street, postalCode}, {street, city, province} • 3NF
– FD: postalCode -> city, province – Always allows join lossless and dependency preserving
– Satisfies 3NF: city, province both non-prime – May allow some anomalies
– Violates BCNF: postalCode is not a superkey • BCNF
=> Possible anomalies involving postalCode – Always excludes anomalies
– May give up one of join lossless or dependency preserving
Do we care? How often do postal codes change? Use domain knowledge to choose 3NF vs. BCNF
10