chp06 07 RDB Design - Compressed
chp06 07 RDB Design - Compressed
Entity-Relationship Modelling
Contents
Chapter 6. Entity-Relationship Modelling 1
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Entities, attributes and values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Entity representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Attribute Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Primary key data elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Candidate keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Foreign keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Entity-Relationship Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Relationships between two entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Recursive and ternary relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Relationship participation condition (membership class) . . . . . . . . . . . . . . . . . . . . . . . 5
Mandatory and optional relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Weak and strong entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Design issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Specialization/generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Converting entity relationship models into relation schemes . . . . . . . . . . . . . . . . . . . . . 8
Converting one-to-one relationships into relations . . . . . . . . . . . . . . . . . . . . . . . . 9
Converting one-to-many relationships into relations . . . . . . . . . . . . . . . . . . . . . . . 11
Converting many-to-many relationships into relations . . . . . . . . . . . . . . . . . . . . . . . 11
Mapping weak entities to relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Mapping specialization/generalization to relations . . . . . . . . . . . . . . . . . . . . . . . . 12
Objectives
At the end of this chapter you should be able to:
• Identify the relationships between entities, and develop the model further by identifying the attributes.
• Map an entity-relationship model into tables suitable for Relational database implementation.
Introduction
Entity-Relationship Modelling is widely used in database design. The first step is identifying the entities
involved. The approach progresses from there to a detailed model of the entities, their attributes and
relationships. The approach can be supplemented by methods which are more formal in their approach, covered
in the next chapter. There are many different diagram notations for ER (entity-relationship) models; in this
course we will use the most popular notation only.
1
Why do we need ER models? A database is one of the most important assets of any organisation; so a good database
that can be used effectively for many years is vital. Organisations are complex, database users seldom know
what their database needs are, and misunderstandings between databases users and database designers are common.
ER models provide a simple, clear notation that are easily created by designers and understood by users. This
enables good communication and helps manage complexity; while also documenting requirements, suggestions and
decisions effectivelyn and facilitating the identification of gaps and errors.
Organisations such as businesses, government departments, supermarkets, universities and hospitals require
information to carry out their tasks. This could be categorised in a number of ways, e.g.
People
Each of these can be regarded as an entity. An entity instance is a specific example of an entity. For example,
John Smith is an entity instance of an employee entity.
Attributes
An entity has attributes associated with it that represent characteristics of the entity. The following are
typical attributes:
Entity: House
Entity: Employee
Values
Using the House attributes shown above, the following is an example of two sets of values, for two different
House instances. Every occurrence of an entity will have its own set of values for attributes it possesses.
Entity representation
In an entity-relationship diagram, each entity is represented by a rectangle, and each attribute by an ellipse
(oval) as shown in figure 1.
Attribute Types
The ER notation allows attribute details to be captured as well. The different types of attribute are:
• Single-valued: each entity has only one such value (e.g. a person has only one surname)
2
Figure 1: An example Customer entity with its attributes
• Simple/Atomic (one value) or composite (divided into sub-parts) e.g. Name = surname + first_name. Components
of a composite attribute are linked to the oval representing the composite attribute, rather than to the
rectangle representing the entity.
• Multi-valued: a set of values is possible for 1 entity (e.g. a person can have several telephone-numbers).
Multivalued attributes are shown using a double-oval.
• Optional attributes: unknown, not applicable or missing values are possible for this attribute. This is
depicted using a dotted line from the oval, rather than a solid line.
• Derived: the value of the attribute can be derived from other attributes and does not have to be stored;
it can be computed when required ( e.g. nett pay is computed as gross – tax, rather than stored). A dotted
oval is used to indicate an attribute is derived.
In the example of figure 2 we see a Book can have many authors, a Person may or may not be allocated to a
department, name is a composite attribute, and charge is an attribute that can be calculated from other data.
IDno and ISBN are underlined - this means they are primary keys for Person and Book respectively. Keys are
explained further in the next section.
Note that relationship “borrows” has attributes deposit and day - these are not attributes of the book, nor
attributes of the Person borrowing it, but attributes of the relationship i.e. of that instance when that
particular person borrowed that particular book. Thus those attribute are associated with the relationship, and
not with the entities. In contrast, IDno and ISBN are not attributes of “borrow”. The fact that a particular
person borrowed a specific book is the reason we have this relationship in our model, so attributes of Person
(like IDno) and attributes of Book (like ISBN) are not needed as attributes of “borrows” and must not be shown
as relationship attributes.
There are usually specific attributes (or perhaps just one attribute) of a particular entity that, when known,
enables us to discover the value of other attributes of that entity. The attribute(s) which possess this quality
are known as keys, because they are able to ‘unlock’ the values of the other attributes of that instance. Why
do we need a key? Suppose we had two members of staff with the same (or similar) names, such as Linda Clark
and Lydia Clark. It would be a simple mistake to record something in the file of Linda Clark that should be
kept in the file for Lydia Clark. It would be even more difficult to tell them apart if the name was just an
initial and surname. Some names may be spelt differently, but sound similar (e.g. Clark and Clarke), and pose
a further risk of identifying the wrong person.
Key
The addition of a staff number as the primary key would enable us to be sure that when we refer to any staff
member, we identify the correct person. In this way 11057 Clark can be distinguished from 28076 Clark.
3
Figure 2: Keys of participating entities are not attributes of the relationship, only of those entities
• The payroll number (primary key) of a member of staff enables us to find out the name, job title and
address for that individual.
• The account number (primary key) enables us to find out whether the balance of that account is overdrawn.
• The item code (primary key) in a stationery catalogue enables us to order a particular item in a particular
size and colour.
Candidate keys
Where there is more than one set of attributes (or more than one attribute) which could be chosen as the primary
key for an entity, each of these is known as a candidate key. E.g. a company might choose either an employee’s
staff number or an employee’s national identity number as the primary key, as each will uniquely identify one
individual. Staff number and national identity number are candidate keys until one is selected as the primary
key. At times we may refer to a collection of attributes that includes the primary key (for example, staff
number together with staff name); this group of attributes is known as a superkey. When we need to connect
together different items of data (for example, customers and items, in order to produce orders and invoices),
we can do this by including the primary key of one entity as a data item in another entity; for example, we
would include the primary key of Customer in the Order entity to link customers to the Orders they have placed.
Foreign keys
When a copy of the primary key for one entity is included among the attributes of another entity, the copy of
the primary key held in the second entity is known as a foreign key. A foreign key enables a link to be made
between different entities.
Entity-Relationship Modelling
Relationships between two entities
The relationships that exist between two entities can be categorised as:
• one-to-one
• one-to-many
• many-to-many
4
For example, if a theatre has only one manager, and the manager manages only that one theatre, then the
relationship between the theatre and its manager is one-to-one. The relationship between a seat and a theatre
is one-to-many, as each seat is in exactly one theatre, and each theatre has many seats. The relationship
between an audience member and an actor is many-to-many, as each audience member sees many actors on stage,
and each actor is seen by many audience members. When a relationship involves only one entity, the line from
the diamond to the rectangle has an arrowhead where it touches the entity. When there is no arrowhead, the
association is “to-many”. (Aside: the reason for doing it this way, is that “to-one” relationships are far
rarer, and so one ends up having to draw far fewer arrowheads!) In figure(3), reading along the arrow, we see
that a Road starts-at one Town and a Road ends-at one Town; while a Town can have many Roads that start-at that
town, and many Roads that end-at that town.
Figure 3: There can be several different relationships between the same kinds of entity
The relationships we have seen so far have all been between two entities; this does not have to be the case.
E.g. an entity Staff could have a relationship with itself, as one staff member can supervise other staff. This
is known as a recursive relationship, and represented in an entity-relationship diagram as shown in figure 4.
Figure 4: Relationships can associate entities of the same type, just as in the real world
It is also possible to have three (or even more!) entities participating in a single relationship, although
this tends to be more complex and thus less often used. In figure 5, hiring someone involves 3 entities:
5
Figure 5: Any number of entity types can participate in a single relationship
is! Organisations often have a relationship between Department entities and Staff entities. If participation
of both entities is mandatory, then every department must have staff and every staff member must be assigned
to a department. If participation of both is optional, some departments may have no staff (e.g. during a
reorganisation) and some staff may have no department (e.g. very senior staff). Usually participation by
department is mandatory (why have a department if nobody works there?) but participation by staff is optional
(there is at least one staff member with oversight of all departments rather than being assigned to one specific
department). A double line is used for mandatory participation (again, because this is rarer), and a single
line for optional participation. In the examples in figure 6, a passport must belong to a person, but a person
may not have a passport. A school must have pupils, and a pupil must be enrolled at a school. A house must
have owners (note, to-many is possible, e.g. owned by two friends/spouses), but a person may not be the owner
of any house.
In figure 7, Child is a weak entity; its existence is dependent on the associated worker entity. The double-line
on the entity and the relationship linking it to worker indicates this.
• The organisation would not keep data on children unless they were associated with one of its workers.
• Different workers may have a child with the same name (e.g. Joe). So, each child is only unique in the
context of the worker they are a dependant of.
6
Figure 6: Participating entities must be shown as optional/mandatory and single-/multi-valued
7
Design issues
It is important to remember that ER models are a notation for describing the real world as clearly and thoroughly
as possible; their role in database design is secondary to this. ER models are only a starting point for database
design, and any problems that might arise from poor ER models can be fixed later using normal form theory. The
decision to use an attribute rather than an entity for some items such as Location can be confusing. Generally,
entity is the better choice only if the item will have attributes of its own and/or participate in relationships
of its own. The fact that verbs tend to be modelled as relationships is useful as a starting point, but a few
(not many) may well be worth replacing as an entity instead, particularly if the relationship is many-to-many.
To decide whether a relationship (e.g. Person Attends Performance) should be represented as an entity instead,
consider whether such an entity would have attributes and relationships of its own. For example, a Ticket entity
could be used to represent the Person-Performance association; this entity would then be related (i) to the
Person who bought it and (ii) to the Performance involved. Ticket would have its own attributes such as price
and seat-number, and might reasonably be related to other entities in the future such as Payment. Finally, we
note that many designers avoid using relationships involving more than two types of entity, as cardinality and
membership class is complex. Such relationships can be replaced by multiple binary relationships instead. E.g.
a ternary relationship between Person, Ticket and Performance can be replaced by separate binary relationships
instead.
Specialization/generalization
Some entities have relationships that form a hierarchy. E.g. a shipping company can have different types
of ships. The relationship that exists between the concept of Ship and the specific types of ships forms a
hierarchy. Ship is called the superclass. The various types of ships are called subclasses.
A subclass is said to inherit from a superclass. A subclass inherits all the relationships and attributes of
all higher classes in the hierarchy. In addition to inherited attributes and relationships, a subclass can have
its own attributes/relationships. The process of creating a superclass for a group of subclasses is called
generalization. The process of creating subclasses from a general concept is called specialization.
Specialization: A means of identifying sub-groups within an entity which have attributes that are not shared
by all the entities (top-down).
Generalization: Multiple entities are synthesized into a higher-level entity, based on common features (bottom-
up).
To demonstrate generalization, imagine that an Artefact is one of the examples of the African cultural items.
Another type of a cultural item is an Artist. It is clear to see that a cultural item is a superclass of an
artefact and artist. This generalization relationship can be represented in the ER diagram in figure 9.
8
Figure 8: Point, Province, Country and Sub-region are particular kinds of Location
Figure 9: Artist & Artefact are types of Cultural Item; the general concept of a Cultural Item is also important
• Derived attributes are not required; they must be computed instead, otherwise they represent a duplicate
copy of information stored elsewhere in the database which is unsafe, as inconsistency can occur
• Composite attributes do not become attributes, only their component attributes are required
• For a one-to-one relationship : include the primary key of one of the entities in the relation for the
other entity
• For any other relationship : create a relation to represent the relationship, comprising the primary key
of each participating entity along with any attributes of that relationship
• For each ISA (generalization / specialisation) hierarchy, create a relation for the ‘superclass’ entity
and one for each ‘subclass’ entity - unless the specialisation is disjoint and complete, in which case
there is no need for a separate relation for the ‘superclass’ entity.
Suppose a company keeps data on the parts it buys and the suppliers it buys them from.
If this is one-to-one and mandatory for both entities, a single relation can represent the information of both
entities as well as the relationship that exists between them. This is because every supplier must have exactly
one part associated with it, and vice versa. However a one-to-one relationship that is mandatory for both
entities is very rare indeed. If either (or both) of the entities has optional participation in the one-to-one
relationship, each entity will require its own relation. Thus in our example there will be a Suppliers relation
and a Parts relation. The relationship can then be represented in either one of these relations. (Aside: we’ll
see in a later that chapter that it is important not to represent the relationship in both tables, because it
is unsafe to store a fact in two different places). If we choose to store the relationship in the Parts table,
a foreign key attribute is added to the Parts table, which references the one Supplier from whom we buy this
Part (by means of the primary key for Supplier). Alternatively, if we store the relationship in the Suppliers
table, a foreign key attribute is added to the Suppliers table, which references the one Part we buy from that
Supplier (by means of the primary key for Part). An example is shown in figure 10.
9
Figure 10: Representing a one-to-one relationship in a relational database
10
Converting one-to-many relationships into relations
Consider instead the scenario where a Part can have at most one Supplier, but a Supplier can supply many
Parts. Since it is bad practice to store the same fact in more than one place, it is desirable to represent
the Part-Supplier relationship in only one of the relations. Since a Supplier can sell us many Parts, and an
attribute must be single-valued, the relationship cannot be represented in the Suppliers relation. Can it be
represented in the Parts relation? Yes, this is possible and is the most commonly-used approach. The ID of
the one Supplier of each Part is stored in that Parts tuple, using a foreign key value to identify who is its
Supplier. An alternative is to introduce a third relation to represent the relationship itself, as in figure
11. This relation will have two attributes, one containing Supplier foreign key, and the other containing Part
foreign key. In choosing whether or not to create a separate table for a one-to-many relationship such as this,
a number of factors should be considered:
• is it likely that real life will change and at some point in the future the one-to-many relationship will
become many-to-many? This occurs surprisingly often!
• is it acceptable to show this relationship by giving the 2 foreign keys involved (e.g. Part ID and Supplier
ID)? If not, is it acceptable to need joins to show Part name and Supplier name instead?
Suppose instead there is a many-to-many relationship between Parts and Suppliers: each Supplier sells many Parts,
and each Part has many Suppliers. Many-to-many relationships are always represented by a separate relation
containing the foreign keys of the participating entities. Thus, in addition to a relation representing
Part entities and another relation representing Supplier entities, a third relation stores the Part-Supplier
associations (see figure 12). Since the relationship is many-to-many, any Part ID value can occur multiple
11
times in this relation, once for each Supplier of that Part. Similarly, any Supplier ID can occur multiple
times in this relation, once for each Part that Supplier sells. Thus the primary key for this two-attribute
relation is Part ID together with Supplier ID, i.e. the primary key comprises both its attributes!
Each weak entity is represented by its own relation, just like any other entity. The only difference is that
it must include an extra attribute to indicate the determining entity it is associated with. This foreign key
attribute then also forms part of the primary key. For our earlier example, the Child relation would have 3
attributes: Name, DoB and WorkID. These give the child’s name, date of birth, and the ID of the worker whose
child they are. The primary key would comprise the 2 attributes name and WorkID (to identify a particular child,
we need to know their name as well as exactly whose child it is). In figure 13, Payment is a weak entity, so
paytime alone cannot be a unique identifier (primary key) for the Payments relation. However, once we know
which Loan that payment was paying off, the Payment entity is then uniquely identified.
In figure 14, we return to the Suppliers and Parts scenario, and show how attributes are handled. Deposit is
not stored since it can be derived, and lastOrder is not stored since it is a composite attribute, so only its
consitutent parts are needed. Finally, since colour is multivalued, it requires a separate relation.
Specialization/generalization can be mapped to relations in three ways. To choose the appropriate method,
consider the disjoint/overlapping and the total/partial nature of the ISA relationship. If a superclass has
more than one subclass, then this specialisation is:
12
Figure 13: Designing relations for a weak entity
13
Figure 14: Designing relations when there are derived, composite and multivalued attributes
• Disjoint if an instance can be a member of only one of the subclasses. Example: postgrads or undergrads –
a student cannot be both.
• Overlapping if an instance may be a member of more than one subclass. Example: student and staff – some
people are both.
• Total if each superclass (higher-level) instance must belong to one of these subclasses (lower-level
entities). Example: a student must be either a postgrad or an undergrad.
• Partial if some superclass instances may not belong to any of the subclasses (lower-level entities).
Example: some people at UCT, such as honorary professors, are neither student nor staff.
Consider designing a database for the student, postgraduate and undergraduate relationship in figure 15. A
student in the university has a registration number and a name. Only postgraduate students have supervisors.
Undergraduates accumulates points through their coursework.
14
Method 1
This method is preferred when inheritance is partial or overlapping. This might occur if some students are
taking courses for non-degree purposes and hence are neither PostGrad nor Undergrad, or if some students are
allowed to do PostGrad and Undergrad degrees at the same time. Unless either of these two rather unlikely
situations applies, it is not worthwhile having the Student relation, so another method below should be used.
Method 2
Only subclasses are mapped to tables, and there is no table representing the superclass. The attributes in the
superclass are instead placed in the tables representing all the subclasses.
This method is preferred when inheritance is disjoint and complete, e.g. every student is either PostGrad or
UnderGrad and nobody is both.
Method 3
Only the superclass is mapped to a table, there are no tables representing subclasses. The attributes in the
subclasses are placed in the superclass table instead.
This method will introduce null values. When we insert an undergraduate record in the table, the supervisor
column value will be null. In the same way, when we insert a postgraduate record in the table, the points value
will be null. This is seldom the best design.
Review questions
1. Explain the difference between entities and attributes. Give 2 examples of each in the context of a
hospital database.
2. Distinguish between the terms ‘primary key’ and ‘candidate key’, using an example relation that might
exist in a hospital database to illustrate your answer.
3. At a conference, each delegate (attendee) is given a copy of the proceedings, which is a booklet containing
all the papers being presented at the conference. Draw an entity-relationship diagram showing the
relationship between a delegate and a copy of the proceedings (you may omit attributes).
4. Many papers may be presented at a conference. Each paper will be presented once only by one delegate.
Many delegates may attend the presentation of a paper. Papers may be grouped into sessions (two sessions
in the morning and three in the afternoon). What do you think is the relationship between (a) a presenter
and a paper? (b) a paper and a session?
5. A conference session will be attended by a number of delegates. Each delegate may choose which sessions
to attend. Draw an entity-relationship diagram showing the relationship between conference delegates and
sessions (you may omit attributes).
6. Design an entity-relationship model for the following scenario: “Authors are responsible for writing plays
that are performed in theatres. Every time a play is performed, the author will be paid a royalty (a sum
of money for each performance). Plays are performed in a number of theatres; each theatre has maximum
auditorium size, and many people attend each performance of a play. Many of the theatres have afternoon
and evening performances. Actors are booked to perform roles in the plays; agents make these bookings and
take a percentage of the fee paid to the actor as commission. The roles in the plays can be classified
as leading or minor roles, speaking or non-speaking, and male or female.” Then translate your data model
into relations, underlining attributes in the primary keys.
15
Chapter 7. Data Normalisation
Contents
Chapter 7. Data Normalisation 2
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Determinacy diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Composite determinants and partial dependencies . . . . . . . . . . . . . . . . . . . . . . . . . 3
Multiple determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Finding keys using functional dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Un-normalised data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
First normal form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Second normal form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Third normal form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
Chapter 7. Data Normalisation
Objectives
At the end of this chapter you should be able to:
• Convert un-normalised data into first normal form relations, so that data items contain only single, simple
values.
Introduction
Un-normalised data often contains undesirable redundancy, and associated costs in terms of storage, time and
a potentially inconsistent database. Normalisation can also guarantee that certain create, update and delete
anomalies can be avoided.
Context
Normalisation complements the entity-relationship design technique. The the two methods can be used to cross-
check the extent to which a design satisfies requirements.
Determinacy diagrams
Determinant
When the value of one attribute allows us to identify the value of another attribute, this first attribute is
called a determinant. This is true for groups of attributes as well, so if A is the determinant of B, A and B
may either be single attributes, or more than one attribute. In figure 1, it can be seen that the performer-id
is associated with only one performer name (this is a to-one relationship). We can say that performer-id
functionally determines performer-name, and this is shown by the arrow. In addition, the type and location of
any particular performer are also uniquely determined by performer-id.
There are several possibilities for considering how the fee to a performer for a booking at a venue might be
calculated:
2
• fee negotiated with performer
The method by which the fee is calculated will affect how the data is modelled. The determinacy diagrams may be
different depending on the particular method of calculating the fee. For example, suppose the fee depends on the
performer-type i.e. on whether the performer is an actor, dancer, singer etc. The different types of performer
need to be identified, and a fee specified in each case. The value of the fee does not depend directly on the
performer-id, but is linked to the type of performer. This is called a transitive dependency and is depicted as
in figure 2, because it is an indirect association through the intermediate Performer-type. This is in contrast
to direct dependencies from Performer-id to Performer-name, to Performer-type and to Performer-location, as in
figure 1.
Suppose instead that the fee depends on both the performer and the agent. If more than one value is required
to determine the value of another attribute, the combination of values is known as a composite determinant (see
figure 3). Where every attribute in a composite determinant is required in order to uniquely determine the
value of an attribute, that attribute is said to be fully functionally dependent on that composite determinant.
Attributes that depend only on Performer-id (name, type and location of each performer) are linked directly to
Performer-id. Similarly, Name and Location of each agent are directly linked to Agent-id. In contrast, fee is
linked to the composite determinant. Performer and agent names and locations, and Performer-type, are partially
dependent on the composite determinant. If the value of an attribute does not depend on an entire composite
determinant, but only part of it, that relationship is known as a partial dependency.
Multiple determinants
An event may have a unique identification number, and also a unique name. The relationship between event-id
and event-name is one-to-one. Dependencies between attributes event-id, event-name and event-type are shown in
figure 4.
3
Figure 3: Here Performer-id and Agent-id together are the composite determinant of Fee
4
Figure 4: Event-id functionally determines all the attributes, and Event-name does so too
Attribute closure
The closure of X, written X+, is all the attributes functionally determined by X. That is, X+ is all the values
that follow uniquely from X. Attribute closure is used to find keys, and is also used to see if a functional
dependency is true or false.
• Repeat the above step until no more changes to Closure are possible
Can we answer the following two questions for relation R and its FDs below?
R(S, C, P, M, G, L, T)
• We must find if the closure of SL is the set of all attributes in R: if yes it’s a key, otherwise it is
not a key
• Using 2nd FD, SL functionally determines C, so we add C to the Closure, so Closure = {SLC}
• Using 1st FD, SC functionally determines PMG, so we add PMG to the Closure, so Closure = {SLCPMG}
• No more attributes can be added because no subset of the Closure functionally determines other attributes,
so (SL)+ is SLCPMG
Is SL a key for R? No, because the closure of SL is not equal to all the attributes in R
5
Yes, because PG is in (SL)+
Normalisation
Normalisation is the process which is used to ensure that data is structured in a logical and robust format.
Normalised data avoids anomalies that may occur in insertion, updates or deletion, and makes better use of
storage space. The most common transformations are from un-normalised data, through first and second to third
normal form. More advanced forms, e.g. Boyce-Codd, fourth and fifth normal forms, are not covered in this
course.
Un-normalised data
The table in figure 5 has details of performers, their agents, performance venues and booking dates. In this
example, the fee paid to a performer depends on the performer-type (e.g. the fee to all actors is 85). Headings
are shortened as shown below:
• P-id: performer-id
• Perf-name: performer-name
• Perf-type: performer-type
• Perf-Loc’n: performer-location
• A-id: agent-id
• Agent-Loc’n: agent-location
• V-id: venue-id
• Venue-Loc’n: venue-location
• E-id: event-id
6
Problems with un-normalised data
Some performers have more than one booking, at multiple venues, and have more than one agent. Multi-valued
attributes are not allowed in relational databases.
The initial step in normalisation converts un-normalised data into first normal form. This means that we must
extract every tuple that has more than one value for an attribute, and replace it with several separate tuples
where each attribute has at most one value associated with it. A relation is in first normal form if every
attribute value in every tuple is atomic. This means it is in first normal form if there is only one value at
the intersection of each row and column. We can convert our table into first normal form (1NF) as shown in
figure 6. This is sometimes known as ‘flattening’ the table. Multi-valued attributes have been replaced, so
that each line in the table has the same format, with only one value in each column of each row (i.e. one value
for each attribute in each tuple). Where more than one booking was made for a performer, each booking is now a
separate entry.
The relation in first normal form can exhibit some problems when we try to insert new tuples, update existing
values or delete tuples. If we wish to insert details for a new performer, agent, venue or event we cannot do
so, because the 1NF relation requires information about a performer, an agent, a venue, an event and a booking
before we are able to insert a tuple, since primary key attributes cannot be null. There is also a problem
updating a 1NF table. For example, if there is more than one entry in the relation for the same performer
(e.g. a performer who has several bookings), any change to that performer’s details must be done in all such
entries. Suppose a performer moved to another location. In first normal form, the full details for a performer
are repeated every time a booking is made, so each such entry must be updated to reflect the change in that
person’s location. Otherwise, the data in the relation will become inconsistent. A similar situation arises
with changes to agent, venue or event details. Deletion can also cause problems in a 1NF relation. Suppose
a tuple is deleted because an event is cancelled. If this event were the only one in the relation involving
a particular performer, and/or the only one involving a particular agent, and/or the only one involving a
particular venue, then the details of that performer, agent and/or venue would be lost! These problems show
that we need to store information about performers, agents, venues and events independently of each other, so
7
that we do not risk losing data, or being unable to insert data, or cause inconsistencies by failing to update
all copies of the same fact. The solution is to convert the relation in first normal form into a number of
separate relations in higher normal form.
Converting a relation from first normal form into second normal form (2NF) starts with finding its candidate
keys. For a relation to be in second normal form, all attributes must be fully functionally dependent on the
primary key. Data items which are only partially dependent on the primary key need to be extracted and placed
in new relations.
1. Performer-id unqiuely determines the performer’s name, type and location: i.e. Performer-id → Performer-
name, Performer-type, Performer-location
2. Agent-id unqiuely determines the agent’s name and location: i.e. Agent-id → Agent-name, Agent-location
3. Venue-id unqiuely determines the venue’s name and location: i.e. Venue-id → Venue-name, Venue-location
4. Event-id unqiuely determines the event’s name and type: i.e. Event-id → Event-name, Event-type
5. Event-name unqiuely determines the event’s id and type: i.e. Event-name → Event-id, Event-type
6. The composite determinant comprising the four attributes Performer-id, Agent-id, Venue-id and Event-id
uniquely determine the Booking-date: i.e. Performer-id, Agent-id, Venue-id, Event-id → Booking-date
7. The Booking-date uniquely determines the Fee, i.e. all fees are paid according to the date on which the
booking was made (all bookings made on the same date will be involve the exact same fee): Booking-date →
Fee.
To find the primary key for our table we can find the closure of some attributes and attribute sets. Is
Performer-id a key for the table? Using (1) above, its closure is {Performer-id, Performer-name, Performer-
type, Performer-location}. This closure does not give all attributes in the table, so Performer-id is not a
candidate key. Similarly, the closure of Agent-id is only {Agent-id, Agent-name, Agent-location}; the closure of
Venue-id is only {Venue-id, Venue-name, Venue-location}, the closure of Performer-type is only {Performer-type,
Fee}; and the closure of both Event-id and Event-name is only {Event-id, Event-name, Event-type}. Consider next
whether {Performer-id, Agent-id, Venue-id, Event-id} is perhaps a candidate key for this table:
• The closure of {Performer-id, Agent-id, Venue-id, Event-id} starts off as the set of those 4 attributes
themselves.
• Using (1) above we can add Performer-name, Performer-type and Perfomer-location.
• Similarly, using (2) above we can add Agent-name and Agent-location;
• using (3) above we can add Venue-name and Venue-location;
• using (4) above we can add Event-name;
• using (5) above we can add Event-type;
• using (6) above we can add Booking-date;
• and using (7) above we can add Fee to the closure of {Performer-id, Agent-id, Venue-id, Event-id} as well.
• This means every attribute of the relation is now in the closure of {Performer-id, Agent-id, Venue-id,
Event-id}, so that 4-attribute composite determinant is a candidate key.
• As an exercise, see for yourself that the only other candidate key is {Performer-id, Agent-id, Venue-id,
Event-name}.
An attribute is a prime attribute if it is part of some candidate key for a relation. An attribute that is
not part of any candidate key for a relation is called a non-prime attribute. In our example, the only prime
attributes are Performer-id, Agent-id, Venue-id, Event-id and Event-name. All other attributes are non-prime.
A relation is in 2nd normal form (2NF) if and only if all its non-prime attributes are fully functionally
dependent on the key. More formally: A relation R is in 2NF if and only if no functional dependency X → Y holds
in R, such that X is a proper subset of some candidate key for R and Y is a non-prime attribute. To transform
a 1NF relation into 2NF, we take every dependency X → Y that violates 2NF and do the following:
8
Figure 7: Functional dependencies in our working example
9
• remove Y from the relation
We therefore replace our relation with a relation from which all attributes partially dependent on the key have
been removed, and create additional new relations as described above. Effectively 2NF identifies situations
where a table representing a relationship contains attributes (details) of the participating entities. Clearly,
these are not attributes of the relationship, so they do not belong in that table. Converting to 2NF ensures
that separate tables exist for all such entities, and that their attributes are correctly placed there where
they belong! In our example, the resulting tables are:
For our performer case study, the single relation in first normal form (1NF) is thus transformed into five
relations in second normal form: performers, agents, venues, events and bookings. The creation of an independent
new relation for performers has the following benefits, which resolve the problems encountered with the single
relation in first normal form:
• A single amendment will be sufficient to update performer location even if several bookings are involved.
• The deletion of a performer record will not result in the loss of details concerning agents, venues or
events, as performers, agents, venues and events are now stored independently of each other.
This is true not only of Performers, but similarly also for Agents, Venues and Events.
There are still insertion, update and deletion problems with our design however. We cannot enter a fee for any
date (e.g. if fees must increase fees from next month) until we have some booking on that date. This is because
the primary key attributes of a relation can never be null. If the fee for any date needs to be updated, there
may be many tuples with this booking-date, so inconsistencies will arise if some are updated but not all of
them. If there is inconsistency in a database - i.e. the same fact is stored in two separate places but has
two different values - it is impossible to tell which value is correct. Also, if we delete the only booking
currently made for next month, we lose the information as to what fee applies to that date. All these anomalies
are caused by the fee being dependent on the booking-date, and not directly dependent on the primary key of
the relation. This indirect, or transitive, dependency can be resolved by transforming the relations from 2NF
into third normal form. This is done by extracting the attributes involved in the indirect dependency into a
separate new relation.
10
Third normal form
We convert a table from 2NF into third normal form to ensure data depends directly on the primary key, and not
through some other relationship with another attribute (known as an indirect, or transitive, dependency). For
a relation to be in third normal form (3NF), all non-prime attributes must be directly dependent on the primary
key. Formally, a relation R is in 3NF if and only if, for every dependency X → Y that holds on R, either X is
a candidate key for R, or Y is a prime attribute. To transform a relation into 3NF, we take every dependency X
→ Y that violates 3NF and do the following:
Effectively, a transitive/indirect dependency is a separate relationship (between X and Y) that should therefore
have a separate table to represent it. How do we know it is a separate relationship? Precisely because X is not
a key of the original table and Y is not part of any key for that table! In our example, the only dependencies
that apply to the Performers table are those in which the key functionally determines another attribute, so
that table is in 3NF. This is also the case for the Agent, Venues and Events tables. As regards Events relation,
note that although we have
this does not violate 3NF because Event-name is a candidate key for the Events relation (we just didn’t choose
it as the primary key). The remaining relation, Bookings, does violate 3NF however, because of the dependency
Booking-date → Fee, as Booking-date is not a candidate key for the relation. Stating this same fact another
way to show the transitive dependency, we note:
Booking-date → Fee
Therefore Fee must be removed from the Bookings relation, and a new relation must be created. Thus we now have:
Effectively, there is a separate relationship between booking-dates and fees that has nothing to do with the
performer/agent/venue/event involved. So we create a separate relationship table for this. The fee for a new
booking-date can easily be added to the database, as this only requires inserting a new tuple in Fees. Similarly,
if the fee for any Booking-date changes, there is only one tuple to update, the one in the Fees relation, so no
inconsistencies can arise. And if we delete the only booking for a specific date from the Bookings table, the
fee associated with that date is still stored in the Fees table.
We have seen how the original set of data items has been transformed through the initial process of identifying
dependencies between data items, and the formulation of successively higher normal-form relations. The steps
used to derive each successive normal form are summarised below:
• First obtain 1NF: replace any tuple that has multi-valued attributes by separate tuples, each containing
one of those values.
• Determine what functional dependencies exist, i.e. identify data items which are the determinants of other
data items.
• Obtain second normal form by removing any non-prime attributes that are not fully functionally dependent
on the primary key of their relation. Create a separate relation for these, comprising the determinant
and the attribute(s) it uniquely determines.
11
• Obtain third normal form by removing any non-prime attributes that are transitively dependent on the
primary key of their relation. Create a separate relation for these, comprising the determinant and the
attribute(s) it uniquely determines.
Third normal form (3NF) is the point at which normalisation of most database designs is considered to be
complete.
Review questions
1. Give an example of a functional dependency of the form A → B that holds true for the attributes of a
university such as UCT.
2. Give an example of a functional dependency of the form A → B that is not true for the attributes of a
university such as UCT.
4. Give an example of a functional dependency of the form A,B → C that holds for the attributes of a university
such as UCT, where C is fully functionally dependent on A and B.
5. Give an example of a transitive dependency that holds for the attributes of a university such as UCT.
6. Given the functional dependencies below, answer the questions that follow:
• W,P → C
• J,P → W
• J → F
• F → T
6B. If these attributes were in a single relation R(W, P, C, J, F, T) would R be in 2nd normal form? Give a
reason for your answer.
6C. If these attributes were in a single relation R(W, P, C, J, F, T) it would not be in 3rd normal form. Design
a 3NF scheme to replace R.
6D. If these attributes were in a single relation R(W, P, C, J, F, T), give an example of any one insertion,
deletion or update problem that could arise.
12