An Introduction To Databases With Web Applications - Martyn Prigmore
An Introduction To Databases With Web Applications - Martyn Prigmore
Databases with
Databases are about information sharing. They allow many KEY FEATURES
people to access data at the same time, and also to mine " Covers the core of a traditional
that data for useful information. However, many database introductory database module,
applications were limited to one organization, and usually to but sets the development in the
a single physical location. Only the biggest companies had context of the web.
the necessary networking resources to allow their people to
access information from multiple locations. " Covers scripting languages first
so you understand how a database
The evolution of the world wide web, and the information application is implemented before
sharing ethos it promotes, has smashed these boundaries. trying to design it.
Now, any organization can use the web to allow its people to " The chapter on XML presents
access data from home, or to share its data with clients. This some of the possibilities this non-
book shows you how to use web and database technologies, relational alternative may offer the
and the latest design techniques, to do just that. world of databases.
An Introduction to Databases with Web Applications takes " Review questions and exercises AN INTRODUCTION TO
Databases with
a unique approach in that it gives equal weight to the key allow you to test your knowledge,
concepts of both database development and web technology whilst investigations at the end
and presents them in a single, introductory textbook. By of each chapter prompt you to
setting database development in the context of the web the delve even deeper into the topics
author overcomes the problem of learning each aspect in presented.
PRIGMORE
isolation and then trying to ‘bolt on’ one to the other. Instead,
he gives you a genuine understanding and solid grounding in
Web Applications
DR MARTYN PRIGMORE is a
this important field. senior lecturer at the University of
Huddersfield. He has eight years
This book will be invaluable to undergraduate students experience teaching web databases
studying databases as part of a business computing and and e-commerce.
information systems programme or a computer science
degree; and those on wider business courses as they will all
Cover images © Getty Images
MARTYN PRIGMORE
www.pearson-books.com
An Introduction to
Databases with
Web Applications
IDWA_FM.QXD 10/5/07 11:54 AM Page ii
An Introduction to
Databases with
Web Applications
Martyn Prigmore
IDWA_FM.QXD 10/5/07 11:54 AM Page iv
Edinburgh Gate
Harlow
Essex CM20 2JE
England
The right of Martyn Prigmore to be identified as author of this work has been asserted by him
in accordance with the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical, photocopying,
recording or otherwise, without either the prior written permission of the publisher or a
licence permitting restricted copying in the United Kingdom issued by the Copyright
Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS.
ISBN: 978-0-321-26359-9
10 9 8 7 6 5 4 3 2 1
11 10 09 08 07
For Andrew
IDWA_FM.QXD 10/5/07 11:54 AM Page vi
IDWA_FM.QXD 10/5/07 11:54 AM Page vii
Brief contents
Preface xv
3 Data models 77
12 XML 559
vii
IDWA_FM.QXD 10/5/07 11:54 AM Page viii
IDWA_FM.QXD 10/5/07 11:54 AM Page ix
Contents
Preface xv
Aims of the book xv
How to use this book xvi
Thanks xvii
Acknowledgements xvii
Chapter objectives 1
Chapter outline 1
1.1 Developing information systems 2
1.2 Case studies of information systems 5
1.2.1 The Staff Directory case study 6
1.3 Business rules 8
1.4 Databases and database technology 9
1.5 The Web and web technology 13
1.6 Web database applications 18
1.6.1 Data duplication 21
1.6.2 Program – data dependence 21
1.6.3 Fixed queries 22
1.6.4 Data integrity 23
1.6.5 Concurrent access 24
1.7 Roles 24
Chapter summary 26
Further reading 26
Review questions 27
Exercises 28
Investigations 28
Chapter objectives 29
Chapter outline 29
2.1 Client-server computing 30
2.2 Networks, interconnected networks and the Internet 32
2.3 Database technology 43
2.4 Web technology 57
2.5 Web database applications 71
ix
IDWA_FM.QXD 10/5/07 11:54 AM Page x
Contents
Chapter summary 73
Further reading 73
Review questions 74
Exercises 75
Investigations 75
3 Data models 77
Chapter objectives 77
Chapter outline 77
3.1 Representing data items 79
3.2 The relational data model 87
3.2.1 Organizing data in the relational data model 88
3.2.2 Integrity constraints in the relational data model 93
3.2.3 Relational data languages 102
3.3 Semistructured data models and XML 110
3.3.1 Organizing data in XML 113
3.3.2 Integrity constraints in XML 122
3.3.3 Data languages for XML 123
3.3.4 Some comments on XML as a data model 127
Chapter summary 128
Further reading 129
Review questions 130
Exercises 131
Investigations 131
x
IDWA_FM.QXD 10/5/07 11:54 AM Page xi
Contents
5.4 Creating, and removing, databases and tables with SQL DDL 207
5.5 Altering tables with SQL DDL 214
Chapter summary 222
Further reading 222
Review questions 222
Exercises 223
Investigations 226
xi
IDWA_FM.QXD 10/5/07 11:54 AM Page xii
Contents
xii
IDWA_FM.QXD 10/5/07 11:54 AM Page xiii
Contents
12 XML 559
xiii
IDWA_FM.QXD 10/5/07 11:54 AM Page xiv
Contents
Glossary 655
References and bibliography 675
Index 681
xiv
IDWA_FM.QXD 10/5/07 11:54 AM Page xv
Preface
Databases form the heart of most contemporary information systems, providing timely
access to accurate data for people in a range of organizations. They allow many peo-
ple to share the same data at the same time and mine that data for useful information.
Understanding databases and their associated technologies is essential for students on
courses in business computing and information systems. Students on general comput-
ing and business courses benefit from studying them, too, as both will encounter data-
bases in their professional lives.
Traditional database applications were often limited to users in one organization
and usually to those at a single physical location. Only large organizations could af-
ford the proprietary networking technologies needed to allow databases in one town
to be accessed by staff in another. The development of the Internet and, subsequently,
of the World Wide Web, changed that. The Internet, and particularly the World Wide
Web, promote information sharing without boundaries. Anyone, anywhere can access
information at a time that suits them. To share its data, an organization simply needs
to hook up its network to the Internet and put a web front-end on its existing data-
base. More and more organizations are seeking to do this, so it has become increasingly
important for business computing and information systems graduates to understand
the technical challenges this seemingly simple requirement poses.
Another recent development is the appearance of XML as an alternative to rela-
tional databases. Compared to relational database technologies, XML-based technolo-
gies are immature. It is still not clear whether XML databases will emerge as a fully
fledged parallel database technology or be subsumed into the existing relational data-
base technologies. Graduates need to understand the strengths and weaknesses of
XML as a format for organizing data and the limitations of current XML database
technologies.
xv
IDWA_FM.QXD 10/5/07 11:54 AM Page xvi
Preface
chapters cover the underlying concepts of both database and web technologies, pro-
viding a sound theoretical basis for students to build on.
Chapter 1 gives a general introduction to the key ideas. Chapter 2 introduces the
network, database and web technologies that underpin every web database applica-
tion. Chapter 3 discusses the theory of data models in detail, covering both relational
and semistructured data models.
Chapters 4–7 cover scripting, for both web and database servers. They introduce
web development with server-side scripting languages, using the data language SQL
and the server-side scripting language PHP.
Chapter 4 starts from the problem of how use a web page to deliver data drawn
from a database and uses this to introduce the basics of HTML and PHP (with a tiny bit
of SQL).
Chapters 5 and 6 cover the SQL data language in detail. They use the ISO SQL:2003
standard, but explain many DBMS-specific features.
Chapter 7 discusses standard techniques of web database development, including
how to deliver the results of database queries over multiple pages, how to maintain ap-
plication state and how to secure your application. Again, the examples use PHP but
the focus is on the techniques, not the scripting language.
Chapters 8–11 discuss database design. The traditional topics of ER diagrams and
normalization are covered in Chapters 8 and 9.
Chapter 10 explores issues that arise during physical database design. This chapter
includes discussion of controlled redundancy, surrogate keys, data storage structures
and how to estimate a database application’s likely performance.
Chapter 11 extends the traditional discussion of database design to include recent
work on designing web database applications.
Chapter 12 concludes the book with an exploration of XML and its related tech-
nologies, focusing on its use for data organization. The chapter covers XML schema
languages, for describing the structure of XML documents, and approaches to locating
data within an XML document. It also explains how to use an ER diagram to design
the XML schema.
Although the book sets databases in the context of the Web, it covers the traditional
core of an introductory database module. To use it in this way, simply omit Chapters 4,
7 and 11 and the web technology section of Chapter 2.
The book covers scripting languages before it covers design. This is because it is use-
ful to understand how an application will be implemented before trying to design it.
However, it’s possible to read most of the design chapters before covering implementa-
tion. Only Chapter 10 and the final section of Chapter 11, on physical design, require
an understanding of SQL.
This leads to two suggested ‘routes’ through the chapters. To use the book as writ-
ten, start from Chapter 1 and read the chapters in order. To cover design before imple-
mentation, read the chapters in this order: 1, 2, 3, 8, 9, 11, but not Section 11.3, then
4, 5, 6, 7, 10, Section 11.3 and 12.
xvi
IDWA_FM.QXD 10/5/07 11:54 AM Page xvii
Preface
Thanks
Thanks to Andrew Hawkett for his support and encouragement while I wrote this book
and for criticizing the early drafts. To my parents, Brian and Jean Prigmore, and my sis-
ter, Alison Prigmore, for tea and sympathy. To Kate Brewin, Owen Knight and Simon
Plumtree, all at various times editors of the book, for their help and advice. Thanks
also to the anonymous reviewers whose critical comments on the first draft were very
useful.
Acknowledgements
Adobe and Acrobat Reader are either registered trademarks or trademarks of Adobe
Systems Incorporated in the United States and/or other countries.
IBM and DB2 are trademarks of International Business Machines Corporation in the
United States, other countries or both.
Macromedia and ColdFusion are trademarks or registered trademarks of Macromedia,
Inc. in the United States and/or other countries.
Microsoft, Access and Windows are either registered trademarks or trademarks of
the Microsoft Corporation in the United States and/or other countries.
MySQL is a registered trademark of MySQL AB in the United States, the European
Union and other countries.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
W3C® is a registered trademark of the Massachusetts Institute of Technology, Euro-
pean Research Consortium for Informatics and Mathematics or Keio University on be-
half of the World Wide Web Consortium.
xvii
IDWA_FM.QXD 10/5/07 11:54 AM Page xviii
IDWA_C01.QXD 10/5/07 11:01 AM Page 1
Chapter objectives
➔ To review the information systems lifecycle and the place of application develop-
ment within that lifecycle.
➔ To discuss business rules and their importance for application development.
➔ To introduce key themes in the study of web database applications and associated
technologies.
➔ To introduce the case studies.
Chapter outline
Databases are at the heart of many modern information systems. They store data,
basic facts, that are processed to provide information, some useful combination of
facts. The Web offers one way to deliver information to users of the database. This
chapter gives an overview of information systems, databases and the Web, examines
some benefits of using these technologies and mentions some of the difficulties. Much
of the material serves as a gentle introduction to the more detailed discussion of sub-
sequent chapters.
Examples help in any textbook and in this one they come from the fictional Pennine
University. Like most universities, the Pennine University uses a number of informa-
tion systems, including a staff directory. The staff directory is a particularly simple in-
formation system that, nevertheless, has some interesting features. It forms a running
example through the book, with other university applications providing examples of
more complex situations.
■ Section 1.1 introduces the notion of an information system and the software tech-
nologies that support computerized information systems. The information systems
lifecycle is introduced as a way to approach the complex task of developing com-
puterized information systems.
1
IDWA_C01.QXD 10/5/07 11:01 AM Page 2
■ Section 1.2 introduces the main case studies, using use cases, and explains how to
read a use case diagram.
■ Section 1.3 discusses the important topic of business rules. It explores the different
kinds of business rules there are and discusses how they can be enforced.
■ Section 1.4 introduces the notion of a database and discusses database technology. It
also introduces the relational data model as an example of one approach to organiz-
ing data.
■ The World Wide Web and its associated technologies are discussed in Section 1.5
and web database applications in Section 1.6. In particular, Section 1.6 discusses
the classic problems that beset applications not using a database.
Section 1.7 ends the chapter with a quick run through of the different sorts of people
who are involved in developing web database applications.
Information systems are the means by which people and organizations, utilizing technolo-
gies, gather, process, store, use and disseminate information. (UKAIS, 1999, p. 1)
So, information systems provide individuals, businesses, governments and other orga-
nizations with timely access to accurate information. They support people as they
work through the procedures that their organization follows, making their work easier
and more efficient. Most contemporary information systems use computer technolo-
gies to achieve this. For example, a university may ask students to submit their assign-
ments electronically via a website and check the document for plagiarism using
computer technology. The actual marking is done by a person, but then the marks are
recorded and collated using a computerized information system.
The technologies that support the computerized parts of an information system in-
clude both hardware and the software that runs on it. There are two broad categories
of software:
● system software
● application software
System software – for example, operating systems and printer drivers – runs the com-
puter and its peripheral hardware. It forms the basic software infrastructure on
which to develop application software, such as word processors and e-mail clients. It is
the application software that people interact with most of the time. The focus of this
book is on developing application software using web and database technologies. Such
applications software is usually called a web database application.
Developing application software is only one part, though an important part, of de-
veloping an information system. It is useful to set application software development
in the broader lifecycle of an information system. A model of the information systems
lifecycle describes a phased approach to planning, developing and maintaining an
2
IDWA_C01.QXD 10/5/07 11:01 AM Page 3
information system. Each phase focuses on specific activities and produces a range of
outputs that feed into the subsequent phases. The collection of phases describe the full
lifecycle of an information system – from the initial idea, through development and
into its day-to-day use.
There are a number of different information systems lifecycle models, though two
linked ideas are common to contemporary descriptions. First, software development is
an incremental process. That is, it moves forward in small steps, with each step building
on what has already been achieved. Second, it is an iterative process. This means that
the information system is a work in progress, with different parts revisited and pol-
ished throughout the lifecycle. Figure 1.1 presents one view of the information systems
Develop the
Build system
information
system
Does it do what it should?
• Create test environment
• Test individual components Produce the system
• Test integration of components Test system • Create development
• Test system as a whole environment
• Ensure users are happy with final • Build bespoke components
product • Assemble packaged
components
Implement and
maintain system Deliver to the end users
• Create production
environment
• Train users
• Migrate legacy data
• Fix bugs
• Make minor enhancements
3
IDWA_C01.QXD 10/5/07 11:01 AM Page 4
lifecycle. Beginning with the analysis of the problem, it moves into the development
phase, with its design–build–test cycle familiar to application software developers,
then into the final phase of implementation and maintenance. The circling arrows in-
dicate that the development phase is iterative and incremental. This means that a de-
sign is realized by building software, the software is tested and flaws found, which
means rethinking the design. Other cycles occur, but each turn around a cycle leads to
a better design or better software, so takes the developers closer to a working informa-
tion system.
The notes in Figure 1.1 indicate some of the activities that are carried out in each
phase. Only the development phase is discussed in detail in this book and then only
selected activities. The focus is on the development of bespoke (written from scratch)
web database applications and the key activities in these development projects.
Looking at the information systems lifecycle in more detail, the problem analysis
phase usually generates a list of requirements, which state what the information sys-
tem must do, but not how to do it. There are two broad categories of requirements.
● Functional requirements are what information is to be held and what processes will
be needed, the data and behaviour of the information system.
● Non-functional requirements are any other requirements of the information system –
a requirement to uses a particular operating system, respond to users’ requests within
a set time or comply with certain regulations, for example.
Regulatory requirements are becoming more important than they used to be. For exam-
ple, any organization that gathers information about people within the EU must comply
with EU data protection laws. Recognizing this requirement in advance and building in-
formation systems that automate compliance can save users time and effort when they
are in daily use. However, data protection requirements are not always essential to the
functioning of an information system. An organization could choose to handle data
protection requirements by other means, so simplifying the problem analysis.
Once a list of requirements has been agreed, the development of the information
system begins. Three subphases of development are highlighted. The first, designing a
solution, includes understanding what system software is required and choosing how
the application software will utilize this infrastructure, which is discussed in Chapter 2.
Designing the application software components is the main focus of Chapters 8–11,
which discuss approaches to both database and web application design and include
some discussion of hardware.
User experience design deals with presentation, describing the look and feel of the
application and the navigation paths through it. Good user experience design ensures
a consistent approach to visual effects and navigation through the whole application.
It is particularly associated with web applications, where creating visually appealing,
even exciting, web pages has been a major concern.
User experience design is usually a separate design activity, undertaken parallel to
the design of the application software components themselves. The design phase pro-
duces a set of designs, often as graphical models, ready to be built. The software com-
ponents are then built with the look and feel set out in the user experience design and
the functionality set out in the application software design.
Chapters 4–7 develop the necessary skills to build bespoke application software
components that realize these designs using web and database technologies.
4
IDWA_C01.QXD 10/5/07 11:01 AM Page 5
Testing tends to be done both during and after the build stage. While building the
solution, individual developers will test their own components. Once all the compo-
nents have been built, the system is tested as a whole to ensure that the various parts
fit together correctly and work effectively.
The final stage of the lifecycle – implementation and maintenance – sees the infor-
mation system being used on a day-to-day basis. Notice that this phase includes the
activity ‘Make minor enhancements’. What counts as ‘minor’ varies depending on the
organization. The key point is that major enhancements mean starting a new run
through the lifecycle, with the existing information system as one of the inputs for the
analysis phase. Major enhancements could include upgrading the system software,
moving to new hardware or making significant changes to the behaviour of the appli-
cation software. In such cases, it is important to analyse the problem before develop-
ing and implementing a suitable solution.
Examples of information systems are easy to find. Automatic teller machines
(ATM), parking meters and the dashboard of a car are all information systems. On the
Web, online retailers use web pages to deliver information about their stock and gather
payment information from customers.
E-commerce retail information systems have provided the main examples of web
database applications in the past, but, increasingly, there are more sophisticated ones.
The Associated Board of the Royal Schools of Music (ABRSM) now offer Music Medals,
which are music examinations aimed at school children that ‘make maximum use of
online technologies’ (ABRSM, no date). Registering for the exam, delivery of the exam
materials and notification of results is all done over the Web. Another interesting web
database application is Wikipedia – ‘a Web-based, multi-language, free-content ency-
clopedia written collaboratively by volunteers’ (Wikipedia, 2005). These ‘volunteers’
add new articles or edit existing ones.
Developing sophisticated web database applications such as these requires consider-
able skill and experience. To learn the skills, and gain the experience, it is best to start
small, so the next section introduces the Staff Directory case study – a nice, small in-
formation system.
The requirements for an information system can be documented as use cases. A use
case is simply a description of one way in which an information system is used. In par-
ticular, it describes how the information system behaves in response to a user request.
Often the information system is used by people, but it may be used by other applica-
tion software. For example, a mail-merge document may get its data from a spread-
sheet, so, here, the word processor is a user of the spreadsheet application.
The term actor is used to describe a role that a user of the information system, human
or otherwise, might take. Actors are not themselves part of the information system.
The terms use case and actor are part of the Unified Modeling Language (UML) – ‘a
visual language for specifying, constructing and documenting the artifacts of systems’
(OMG, 2003, p. 20). The official UML definitions of use case and actor are as follows.
5
IDWA_C01.QXD 10/5/07 11:01 AM Page 6
Actor A construct that is employed in use cases that define a role that a user or
any other system plays when interacting with the system under consideration.
(OMG, 2003, p. 4)
UML is primarily intended as a way to model object-orientated systems, which is not the
subject of this book, but two facts strongly suggest that UML notation should be used
where possible. First, application software developers are likely to be presented with use
case diagrams as part of the requirements specification, so ought to be able to read them.
For this reason, the case studies in this book are presented as use cases. It isn’t necessary
to understand how these were created, only to be able to read them. Second, the main
diagramming tool for database design – entity-relationship diagrams – can be written
quite effectively using UML notation. Chapter 8 describes in detail how this is done.
6
IDWA_C01.QXD 10/5/07 11:01 AM Page 7
Figure 1.2 Use case diagram for the Staff Directory application.
Staff Directory
«extend»
University
member
Academic
Visitor
drawn as a dashed line with the keyword «extend» written next to it and an open ar-
rowhead pointing at the extended use case – that is the one the behaviour of which is
being extended. Any user of Get contact details (the extended use case) has the option
to use Get support sessions (the extending use case), so there is no need to associate
University member and Visitor with Get support sessions. In other words, an association
between the actors and Get support sessions is implied by the extend relationship.
The actor Academic is only associated with the use case Manage own support ses-
sions, which allows the days and times of support sessions to be changed. This seems
odd as academics may also want to get contact details for their colleagues, so might be
expected to be associated with Get contact details. This illustrates the difference be-
tween an actor (Academic) and a user (who happens to be an academic). Consider Paul
Smith, who lectures in the School of Computing. Taking the role of Academic, he may
change his support session details. He could also use the Staff Directory like any other
member of staff, to find contact details for a colleague. When he does so, he is acting
in the role of University member. One person is acting in two different roles that allow
him to use the information system in two different ways.
Each use case has a written description, or use case specification, that describes the
expected behaviour associated with it. Together, the use case specifications form the
formal definition of the users’ requirements.
There is a range of styles for writing use case specifications. The style adopted for
the case studies in this book is informal and the closest to a traditional textbook case
study. The use case for Get staff contact details shown below is typical. Notice that use
cases provide considerably more detail about behaviour than the summary given in
the first paragraph of this section. Appendix A includes a fuller account of how these
use case specifications were derived. However, to use this book, you only need to know
how to read a use case diagram and the use case specifications, not how to write them.
7
IDWA_C01.QXD 10/5/07 11:01 AM Page 8
The full contact details for an academic provide access to a list of support sessions
offered by him or her.
There is also an option to view a ‘help’ page, giving instructions on how to use the
search facility.
Business rules are usually identified during the analysis phase and can be documented
in the use cases. Here are some examples of business rules at Pennine University.
8
IDWA_C01.QXD 10/5/07 11:01 AM Page 9
(b) the status of an assessment may change from draft to moderated or from moder-
ated to issued. No other changes are permitted.
A database is a collection of data, though not just any collection counts as a database.
Usually the data are related in some way. The database for the Staff Directory holds
data about staff; it does not hold data about students. The data in a database is persis-
tent, meaning that, once stored, it is available until it is deleted. This means that the
data is usually stored on a hard disk, rather than in a computer’s memory.
9
IDWA_C01.QXD 10/5/07 11:01 AM Page 10
A database is structured to make it easy to use the data. Data about staff are gathered
into a single data structure, which clearly distinguishes the different data items. Thus,
first and last names are held separately and phone numbers are kept distinct from
e-mail addresses.
A description of each data structure is also held in the database – what data it holds
and whether or not the business rules place constraints on the data. This sort of data is
called meta-data. This is because it is data about the Staff Directory database, not data
about the staff. Here’s a succinct definition of a database.
The description of the database structure is usually called the database schema. The
actual data in the database at any particular time is called the database instance. Typ-
ically, the database schema does not change very often, while the database instance
changes frequently.
A database is designed to meet the need of an organization for timely access to ac-
curate information. It does this by acting as a central repository for data about the
things of interest to that organization. These data can be combined to provide the re-
quired information.
The Staff Directory database holds data about two sorts of thing: staff contact details
and support sessions. The data held relating to staff contact details includes the name,
staff number, phone number, e-mail address and a photograph of the member of staff.
For support sessions, the day, start and end times are held on the database. The database
also provides information about a relationship between these two sorts of thing – that
a particular support session is offered by a particular member of staff.
Different approaches to organizing the data in the database will capture this infor-
mation in different ways. Some approaches have one structure to capture information
about a thing, and a different structure to capture a relationship between things. Others
use a single data structure to capture both kinds of information. These different general
approaches to organizing the data in a database are called data models.
A data model has three aims, which are to:
The first aim of a data model is to ensure a consistent approach to data organization:
the same basic structures are used to organize data about all those things that the in-
formation system holds data about. The data model does not describe how the data is
physically arranged on disk. It is concerned with organization structures rather then
storage structures.
The second aim is to ensure that database users can interact with the database, ma-
nipulating both data and data structures.
10
IDWA_C01.QXD 10/5/07 11:01 AM Page 11
The third aim seeks to ensure that the data in the database is consistent so that it
does not give different answers to the same question. Also, constraints can be used to
implement the business rules. When a database constraint enforces a business rule, it
ensures that the database accurately reflects the business.
The relational data model is a popular data model used to organize many databases.
It has a simple solution to organizing data, which is that all data are gathered into ta-
bles with rows and columns.
In the Staff Directory database, a Staff table holds contact details for members of
staff. Each row of the Staff table holds the contact details for one member of staff.
Every row has exactly the same structure – the six columns staffNo, fname, lName,
phone, email and photo. On a given row, the columns hold items of data about a
particular member of staff. A typical instance of this table is shown in Figure 1.3.
Notice that there is no data in the phone column for the fourth row of the table,
which holds contact details for Paul Smith. This is because he does not have a
phone number. Even so, this row still has a phone column – it simply has nothing in
it. In the relational data model, a column that has nothing in it on a particular row is
said to be null for that row. Another important feature is that the values in the
staffNo column are different for every row of the Staff table. This is because that
column holds the staff number, which is a unique identifier for each member of staff.
The relationships between members of staff and their support sessions are captured
by including the staff number on the appropriate rows of both tables. For example, the
first and third rows of the support session table hold data about support sessions of-
fered by Selma Hutchins. So, these rows have the same value in the staffNo column
as the ninth row of the Staff table, as shown by the arrows in Figure 1.3. It is impor-
tant to realize that there are no special data structures linking a row of the Staff
table to the associated rows of the SupportSession table. Rather, it is the matching
of values in the corresponding staffNo columns that captures this relationship.
Chapter 3 discusses the relational data model in detail.
Whether a database is organized according to the relational data model or not, cre-
ating and maintaining it is a complex task. The database schema must be organized ac-
cording to the underlying data model, business rules must be enforced and multiple
users must have access to the database at the same time. The main software technology
that supports this is the database management system (DBMS). Examples of DBMS
that use the relational data model to organize their databases include the MySQL®
DBMS, the Oracle® database (which is actually a DBMS, despite the name), the
Microsoft® SQL Server DBMS and the IBM® DB2® DBMS.
A DBMS manages all interactions with the database, whether they involve manipu-
lating the database schema or the database instance. The users never directly access
the files where the database is stored. Instead, they ask the DBMS to manipulate these
files on their behalf. In fact, as DBMS are complex and often difficult to use, database
applications usually add another software application between the user and the DBMS.
This is illustrated in Figure 1.4, where the database user uses application software – the
database client – developed specifically for this database. The database client applica-
tion software interacts with the DBMS, which in turn manages the database. For exam-
ple, in the Staff Directory web database application, the database client application
software consists of a collection of web pages, the DBMS is a MySQL DBMS and the
database consists of the two tables shown in Figure 1.3.
11
IDWA_C01.QXD 10/5/07 11:01 AM Page 12
Figure 1.3 The Staff and SupportSession tables, with one null column value and one
set of matching rows indicated.
Null
Staff
staffNo fName IName phone email photo
10780 John Smith [email protected] [BLOB]
25447 John Smith 5104 [email protected] [BLOB]
25448 Judith Anne Smith 7709 [email protected] [BLOB]
31210 Paul Smith [email protected] [BLOB]
77712 Frank Rose 8871 [email protected] [BLOB]
14443 Helen Abbot 8032 [email protected] [BLOB]
23257 Freya Stark 8660 [email protected] [BLOB]
33935 Padma Brar 6641 [email protected] [BLOB]
35054 Selma Hutchins 8706 [email protected] [BLOB]
45965 Mikhail Sudbin 5553 [email protected] [BLOB]
35155 Helene Chirac [email protected] [BLOB]
55776 Gurpreet Choudhury 5454 [email protected] [BLOB]
56893 Ruth Bapetsi 8022 [email protected] [BLOB]
56673 Joshua Bittaye 7782 [email protected] [BLOB]
Matching 89987 Dan Lin 8514 [email protected] [BLOB]
rows
78893 Jo Karen O’Connor 8871 [email protected] [BLOB]
33509 Helen Timms 8661 [email protected] [BLOB]
SupportSession
staffNo dayOfWeek startTime endTime
35054 Monday 09:00:00 10:00:00
31210 Wednesday 11:00:00 13:00:00
35054 Monday 15:00:00 16:00:00
45965 Monday 11:00:00 12:00:00
45965 Wednesday 11:00:00 12:00:00
23257 Monday 15:00:00 16:00:00
55776 Monday 14:00:00 16:00:00
56893 Tuesday 14:00:00 15:00:00
56893 Thursday 09:00:00 10:00:00
56673 Thursday 15:30:00 16:30:00
56673 Friday 10:00:00 11:00:00
12
IDWA_C01.QXD 10/5/07 11:01 AM Page 13
Uses
Interacts with Manages
Database client DBMS Database
Database
user
The World Wide Web (or more simply just the Web) is one of several applications
based on Internet technology that have become important in recent years. Other im-
portant Internet applications include e-mail, file transfer and newsgroups.
For information systems developers, however, the Web has proved to be even more
useful. The Web allows information, in the form of documents, to be shared easily on
a global scale, using cheap (often free) computer software. This has led to increasing
numbers of businesses using websites to market themselves and buy and sell goods
and services (National Statistics, 2004). The Web is now seen as an important platform
for information systems and, as such, is increasingly used to deliver documents that
contain information derived from databases. This is particularly useful for documents
the contents of which are changed frequently.
Web documents are usually referred to as web pages. They are multimedia docu-
ments, so can contain different media types, such as text, images, sound and video.
The order in which someone views the content of a web page is not fixed. Hyperlinks
allow users to view document content in an order that suits them, by linking one part
of a web page to another or one web page to another.
13
IDWA_C01.QXD 10/5/07 11:01 AM Page 14
The basic technologies that support web applications are the web browser and web
server. The web browser is able to display a range of Internet content, though primar-
ily it is used to display web pages. As web pages can contain multimedia content, web
browsers can display this as well as text. As well as displaying web pages, the web
browser allows users to ‘browse’ through any web pages available via the Internet.
Users can either follow hyperlinks within web pages or tell the web browser the ad-
dress of the web page they want.
There are a number of web browsers available. Popular graphical web browsers in-
clude Mozilla Firefox, Microsoft® Internet Explorer, Netscape® Navigator, as well as
Opera from Opera Software ASA.
The role of the web server is similar to that of the DBMS as it manages interaction
with a collection of web pages. The Apache web server (an open source project of the
Apache Software Foundation) and Microsoft® Internet Information Services (IIS) are
the two most common web servers.
The web server listens for requests from web browsers for documents in its collec-
tion and responds to those requests by sending a copy of the document to the web
browser. A web server does not maintain a persistent connection with the web
browsers that display its web pages. Instead, the connection between web browser and
web server is closed as soon as the web browser receives the web page. Therefore, sub-
sequent requests from the web browser to the same web server must establish a con-
nection from scratch. This contrasts with the DBMS, which does maintain a persistent
connection with the application software that uses its databases, allowing the DBMS
to track multiple requests. To extend the restaurant analogy, a web server is more like
a self-service café. Staff in the café respond to direct requests from customers, but don’t
keep an eye on them. If customers want a glass of water or another chocolate pudding,
then they must return to the counter and attract the attention of a member of staff.
Web pages can be shared because they all have the same document model. A docu-
ment model defines a general approach to structuring documents. For example, the
chapters in this book all contain a title, objectives, outline, one or more sections, sum-
mary, glossary, further reading and exercises – in that order. Each of these document
structures can be specified in more detail. For example, a section contains a title,
which is followed by a mix of paragraphs, lists, figures and tables.
Specifying a document model for a book ensures consistency between chapters, as an
editor can check that each chapter follows the correct structure. It is particularly useful
for books with several authors, who might otherwise produce chapters that looked very
different from each other. The document model defines what structures can be used to
organize a document, while the document itself defines the content of each part.
Staying with the example of this textbook, the actual chapter title is ‘The web data-
base environment’, there is an actual list of objectives, the actual text for the outline
and so on. Once the structure has been specified and the content created, a stylesheet
can define how the chapter should be presented. For example, chapter titles in
14 point Helvetica bold, objectives in 10 point Garamond, with diamond bullet
points, 6 points after each line and with a pale blue background; standard text in 10
point Garamond and so on.
This approach is what gives the Web its global scope. Content producers – people
who write web pages – can produce documents to a common document model,
though each document can have different content.
14
IDWA_C01.QXD 10/5/07 11:01 AM Page 15
Software developers can produce software applications that can read any of these
documents easily as their basic structure is known. The presentation of the content in
the different document structures, however, can be adapted as required. For example,
we could change the objectives from a bulleted to a numbered list, have no back-
ground colour and enclose it in a box.
The document model for web pages is defined by the specification of the Hypertext
Mark-up Language (HTML). The HTML mark-up describes the content of the web page
and the web browser interprets this description and displays the content appropriately.
For example, the sixth line in Figure 1.5 (a) – <title>Welcome to Pennine
University</title> – tells the web browser to display the text ‘Welcome to
Pennine University’ as part of the title bar of the browser window. Figure 1.5 (b) shows
how the Firefox web browser interprets this instruction.
The HTML document model is written using the Standard Generalized Mark-up
Language (SGML). SGML is used to define different document models for different
purposes – HTML for writing web pages, DocBook for writing technical documents
such as software manuals. An SGML document model is often called an SGML appli-
cation as it applies SGML to solve a particular problem – namely, how to structure the
content of a particular kind of document.
15
IDWA_C01.QXD 10/5/07 11:01 AM Page 16
SGML document models do not define how the content is presented. There is no re-
striction on the font, indentation or spacing between paragraphs, nor on how lists are
presented. The document model even allows presentation using more than one media,
so document content may be presented as written text or spoken words. This separation
of structure, content and presentation is an important feature of modern document
models. Thus, the document model describes the general structure of a document, the
document captures the content and a stylesheet describes how to present that content.
The claim that SGML document models and, hence, HTML do not define presenta-
tion is not quite correct. They should not define presentation. HTML has physical style
elements specifically to alter the presentation of text, such as the <b> element for bold
type. These make it difficult to control the presentation of document content as the
style is then ‘hard-coded’. It is better to use HTML simply to define the document
structure. For example, use the <em> or <strong> element instead to indicate text
that should be emphasized. A stylesheet can instruct the web browser to italicize text
within an <em> element and make any text within a <strong> element bold. If it’s
decided that isn’t quite enough, the stylesheet can change the colour, font, font size
and so on. It can also indicate increased volume for emphasized content when a web
page is presented using a page reader (technology that literally reads out the contents
of a web page).
HTML and SGML are important technologies, but both have their weaknesses.
SGML is a very powerful language, enabling the definition of document models for all
kinds of purposes. This power is achieved as a result of complexity, so SGML can be
difficult to use. This would not be a drawback except that it is also difficult to write
16
IDWA_C01.QXD 10/5/07 11:01 AM Page 17
software tools that allow users to easily create SGML document models and documents
based on those document models. Consequently, these tools are expensive. In particu-
lar, it would be extremely difficult to write a web browser that could understand docu-
ments with any SGML document model. That is why HTML was created.
As all HTML documents have the same document model, it is much easier for con-
tent producers to produce HTML documents and for application developers to write a
web browser that can understand them. The downside of having just one document
model is that it is very restrictive. For example, it is not that easy to create a ‘chapter,
section, subsection’ structure in an HTML document. The HTML physical style ele-
ments are another problem, as they mean the mark-up to define content and the
mark-up to define presentation are mixed together.
These problems led to the development of a new language for defining document
models, called the Extensible Mark-up Language (XML). XML was developed by the
World Wide Web Consortium (W3C) and work continues on defining the XML stan-
dard and a set of related technologies that support the use of XML for sharing docu-
ments across the Internet.
There has been a lot of excitement over XML in recent years, some of which is jus-
tified. Like SGML, XML is a language for defining document models, but it is much
simpler than SGML. Because it is simpler, it is easier for content producers to extend
(hence the ‘Extensible’ in its name) an existing document model to meet their needs
or to create their own from scratch. It is also possible for application developers to cre-
ate web browsers that can understand documents with different document models ex-
pressed in XML, not just HTML.
The HTML document model itself has been rewritten using XML instead of SGML
and renamed XHTML (most XML-related technologies have names beginning with
‘X’). A web page written to the XHTML standard is an XML document (as well as an
XHTML document). XHTML itself is an XML application. There are quite a lot of XML
applications.
The Extensible Style Language XSL is another XML application. An XSL document
is a kind of stylesheet (there are other stylesheet languages). It instructs a web browser
to present an XHTML document exactly as you want (at least, it will when web
browsers can understand XSL). So an XSL document tells a web browser how to pre-
sent the content of a document structured using the XHTML document model.
This illustrates how a lot of the XML-related technologies work. One XML docu-
ment, written using one XML document model, is used to tell an application what to
do with some other XML document, written using a different XML document model.
In fact, XSL can instruct any application how to present any kind of XML document, as
long as the application understands XSL.
This is an important point. XML does not actually do anything. XML document
models are descriptions of how to structure certain kinds of document. XML documents
are descriptions of their content. If something needs to be done with an XML docu-
ment, then someone has to write a software application that reads the document, un-
derstands the description and carries out appropriate actions. In the XHTML and XSL
example, someone must write a web browser that understands both document models
and can use the XSL document content as instructions on how to format the XHTML
document content.
17
IDWA_C01.QXD 10/5/07 11:01 AM Page 18
The phrase XML document is used to refer to a document written using a particular
XML document model, such as a web page written using XHTML. Keep in mind that
XML is actually a language for defining document models, not writing documents.
Behind every XML document is a document model defined with XML.
Although XML was created with documents in mind, it wasn’t long before people real-
ized that XML documents could also be used to store data. The definition of a database
in Section 1.4 says that a database is a persistent, self-describing, structured collection
of related items of data. An XML document is all these things. It is usually a file, so it
is persistent. It is self-describing, as the different elements clearly indicate what kind of
content they hold. It is structured according to its document model and the data in it
are obviously related – why else would they be in the same document? This illustrates
the problem of defining concepts in computing. Few database practitioners would re-
gard a single XML document as being a database. Instead, they’d say that it is a data
file. A collection of XML documents might seem more like a database. One problem
with using XML documents like a database is that there is no single data model for
XML. Chapter 3 discusses one data model for XML. Chapter 12 covers database design
for XML and discusses XML applications and related technologies that are particularly
important from the database perspective.
The content of printed documents is always static. The content of this chapter will not
change, unless a new edition of the book is printed. Web pages may have both static
content (also called boiler-plate text) and dynamic content.
Static content is the same every time a copy of the web page is delivered to a web
browser. Dynamic content, however, is added to a web page’s static content by the web
server before sending out a copy of the web page. Thus, different users may receive dif-
ferent versions of the same web page, depending on what dynamic content was added.
For example, the Staff Directory needs a web page to display the contact details for
each member of staff. All these web pages have the same structure. The one for Selma
Hutchins’ contact details is shown in Figure 1.6 (a). It would be very time-consuming
to create, and maintain, a different web page for each member of staff. Instead, it
makes sense to have a single, skeleton web page to hold the static content and pull the
dynamic contact details for a particular member of staff from the database. Figure 1.6 (b)
shows the static content of the Staff Contact Details web page. Each time a user asks to
see Selma Hutchins’ details, they are retrieved from a database and merged with this
static content to create the web page.
Notice that in the actual web page shown in Figure 1.6 (a), there is an image and a
link to the member of staff’s support sessions, but these are not shown in the skeleton
page in Figure 1.6 (b). The image is derived from the database, so, like the name,
phone number and e-mail address, is dynamic content. This behaviour will be particu-
larly useful when Selma Hutchins’ photo is taken as the image file will automatically
18
IDWA_C01.QXD 10/5/07 11:01 AM Page 19
(b) The static content of the staff contact details web page.
19
IDWA_C01.QXD 10/5/07 11:01 AM Page 20
be added to the static content, replacing the default ‘No photo’ image. The link to sup-
port sessions should only appear for academic staff, so this link, and its associated text,
count as dynamic content.
There are several well-known problems with managing dynamic content. A bit of
the history of web applications at the Pennine University will illustrate these prob-
lems. Like most universities, it was an early adopter of web technologies, so some de-
partments already had applications similar to the Staff Directory. Figure 1.7 shows the
School of Computing’s staff list for academic staff. This application allows users to
browse a list of staff members who work in the School of Computing. Users can view
all members of staff or just those with a particular job type – namely, academics, ad-
ministrators, technical staff and researchers. Allowing users to browse the staff list
would not be an option for the whole university as the lists would be far too long,
even if they were split up by job type. Hence, the Staff Directory application did not
follow this approach. Comparing Figure 1.7 with Figure 1.6 (a), it is also apparent
that the School of Computing’s staff list provides slightly different information from
that in the Staff Directory. The staff name includes their title (‘Dr’, ‘Prof’, and so on),
but only gives the first initial, not the full forename. Also, their job title is included,
but there is no photo.
Like the web page in Figure 1.6 (a), the web page in Figure 1.7 includes dynamic
content, but the data is stored in a text file, not on a database. A sample of data from
the text file is shown in Figure 1.8. It is an example of a ‘comma separated values’ file.
Each line holds the contact details for one member of staff, with commas separating
the individual data values from one another. Notice that, in the second line, there are
two commas with only a space character between them immediately after the staff
name. This is to indicate that Dr P Smith does not have a telephone number.
20
IDWA_C01.QXD 10/5/07 11:01 AM Page 21
Figure 1.8 A sample of data from the Staff List data file.
To create the web page shown in Figure 1.7 from the text file shown in Figure 1.8, the
Computing Staff List web application performs the following actions:
This seems fairly simple, but there are problems with file-based web applications. The
following subsections discuss these problems and explain how the database approach
solves them.
21
IDWA_C01.QXD 10/5/07 11:01 AM Page 22
Staff
staffNo fName IName phone email photo department title jobType jobTitle
10780 John Smith NULL [email protected] [BLOB] Catering Mr Support Chef
25447 John Smith 5104 [email protected] [BLOB] Music Mr Administration Secretary
25448 Judith Anne Smith 7709 [email protected] [BLOB] Estates Mrs Support Estates Manager
31210 Paul Smith NULL [email protected] [BLOB] Computing Dr Academic Senior Lecturer
77712 Frank Rose 8871 [email protected] [BLOB] Computing Mr Technical Technician
14443 Helen Abbot 8032 [email protected] [BLOB] Computing Mrs Administration Secretary
23257 Freya Stark 8660 [email protected] [BLOB] Computing Prof Academic Dean
33935 Padma Brar 6641 [email protected] [BLOB] Health Ms Administration Administrator
35054 Selma Hutchins 8706 [email protected] [BLOB] Computing Ms Academic Senior Lecturer
45965 Mikhail Sudbin 5553 [email protected] [BLOB] Music Mr Academic Lecturer
35155 Helene Chirac NULL [email protected] [BLOB] Health Miss Technical Technician
55776 Gurpreet Choudhury 5454 [email protected] [BLOB] Music Dr Academic Senior Lecturer
56893 Ruth Bapetsi 8022 [email protected] [BLOB] Health Mrs Academic Senior Lecturer
56673 Joshua Bittaye 7782 [email protected] [BLOB] Computing Mr Academic Lecturer
89987 Dan Lin 8514 [email protected] [BLOB] Health Dr Administration Senior Administrator
78893 Jo Karen O’Connor 8871 [email protected] [BLOB] Health Miss Administration Administrator
33509 Helen Timms 8661 [email protected] [BLOB] Music Mrs Technical Technician
type and job title, as well as the school or department they work for, as in Figure 1.9,
would allow both the Computing Staff List and the Staff Directory to use this table.
However, this change will not affect the Staff Directory as it will use only the original
columns. It simply does not see the new columns, so does not see any change at all.
This ability to make some changes to a database’s structure without affecting the appli-
cations that use it is called program–data independence. It is a key advantage of the
database approach.
Note that in Figure 1.9, any columns that are null have the word ‘NULL’ in them,
rather than being left blank as in Figure 1.3. It is common practice to indicates nulls in
this way, but be aware that the actual database does not have the word NULL in
these columns. The column doesn’t have any value at all; that is what null means. A
similar comment applies to the photo column. Every row has the word ‘[BLOB]’ in it,
meaning binary large object, rather than the actual picture of the staff member.
22
IDWA_C01.QXD 10/5/07 11:01 AM Page 23
(b) The School of Computing Staff List showing effect of corrupt data.
23
IDWA_C01.QXD 10/5/07 11:01 AM Page 24
Table 1.1 Benefits of the database approach over the file-based approach.
of the data file only has her staff number and name. It is the third line where the real
problems occur. The file-based application interprets the first data value, ‘6154’, as a
staff number, ‘s.hutchins @ pennine.ac.uk’ as her name, ‘Senior Lecturer’ as her phone
number and ‘Academic’ as her e-mail address.
Maintaining data integrity is, thus, problematic. How do you ensure that the data
accurately reflects the real world? In a database application, the DBMS manages all
changes to the data and can prevent accidental corruption.
1.7 Roles
This last section considers who is involved in developing and using a web database ap-
plication. The end users are the people for whom the database is originally conceived
and created. Staff, students and visitors who need to look up someone’s phone number
are among the end users of the Staff Directory. So are members of staff who keep their
contact details up to date. End users only access the database instance using specially
written application software to retrieve or modify data; they don’t modify the database
schema.
Application developers also use the database, as they create the application software
that provides easy access to the database for end users. Application developers need to
use the database instance and understand the structure of the database schema. They
24
IDWA_C01.QXD 10/5/07 11:01 AM Page 25
1.7 Roles
may also need to request changes to the database schema in order to support the needs
of the application software they write. In web database applications, the application
developer will also need to understand web technologies and how to connect web
pages to the database.
The design of the database – its data structures and constraints – is the task of the
database designer. There are several stages in database design and a particular data-
base designer may be involved in every stage. They talk to end users to ensure that
there are appropriate data structures, and design database constraints to enforce
business rules. Database designers also design the storage structures, which are the
DBMS internal view of the database. The efficiency of the database will depend on
there being an effective design of the storage structures. Different DBMS offer differ-
ent data storage structures, so the database designer must be familiar with the DBMS
being used.
Once a design is available, the database – its data structures and constraints –
must be created. This is one of the tasks of the database administrator (DBA). The
DBA is an expert user of the DBMS. This often means that the DBA may also be in-
volved in designing the storage structures. The main goal of the DBA is to use the
DBMS as effectively and efficiently as possible, to realize the aims of the database
designer and meet the needs of the end users and application developers. They also
manage user access and database security, as well as day-to-day maintenance (such
as taking a back-up of the database) and performance tuning (improving the speed
of response to queries, for example).
The web administrator has a parallel role to that of the DBA, being an expert in the
particular web server used and associated Internet and web technologies. The web ad-
ministrator’s main goals are to use the web server as effectively and efficiently as pos-
sible, realize the aims of the website designer and meet the needs of the end users and
application developers. Again, security, maintenance and performance tuning form
important parts of their role.
As organizations have gathered more and more data, it has become increasingly im-
portant to their success. Organizations have recognized that the technical expertise of
the DBA needs to be supplemented by someone with a good understanding of how the
data is used. This is the role of the data administrator (DA).
Unlike the DBA, the DA does not need to be an expert in a particular DBMS. In-
stead, he or she is an expert on the organization and uses this expertise to ensure that
the organization’s data resources are used efficiently and effectively.
The DA works with the DBA and database designer, planning the database imple-
mentation, and is responsible for setting standards (such as naming conventions).
They also ensure compliance with any laws or regulations affecting the organization’s
use of data. For example, the Pennine University must comply with the UK’s Data
Protection Act 1998. This gives individuals a right to know what data about them is
held by the University. The Pennine University also has its own rules on who can use
data. Academics cannot access data about a student’s fee payments, and finance staff
cannot access a student’s results. Such internal policies are written by the DA after con-
sulting with management.
Table 1.2 summarizes the tasks taken on by each role. In reality, one person may
take on several roles. For example, the DA at the Pennine University uses the Staff
Directory, so is also an end user.
25
IDWA_C01.QXD 10/5/07 11:01 AM Page 26
Data entry
Application development
Database design
Database planning
Setting standards
Chapter summary
■ This chapter has discussed the importance of information systems and outlined
some of the challenges that face those developing them. It has discussed the need
for web application developers to understand users’ requirements and business
rules and introduced use cases as a way of documenting these.
■ The core of the chapter was a discussion of databases and the Web and their en-
abling technologies – data models, database management systems, document mod-
els, web browsers and web servers.
■ The chapter finished with a discussion of web database applications, the advantages
they have over file-based approaches, and the people involved in developing them.
■ While the discussion has been reasonably rigorous, it has not gone into details.
These details follow in subsequent chapters.
Further reading
Avison and Fitzgerald (2003) provides a good introduction to information systems de-
velopment, discussing a variety of approaches and describing a range of tools and
techniques.
26
IDWA_C01.QXD 10/5/07 11:01 AM Page 27
Review questions
Skidmore and Eva (2004) focus on tools and techniques rather than general approaches,
but cover similar material.
Holt (2001) includes a nice introduction to systems and systems engineering in general
before moving on to discuss the UML.
Bennett et al. (2005) provide a good introduction to UML, with lots of examples. Note,
though, that there is no real need to delve deeply into UML at this stage.
Ideas about business rules and the problems with the file-based approach to database
applications are covered in most database textbooks.
The discussion of the concepts that support database and web technologies and the
technologies themselves, continues in the next two chapters.
Review questions
1.3 What is a business rule? Describe the two main kinds of business rule.
1.5 Explain what distinguishes a database from a simple collection of data files.
1.6 What is a data model? What are the three aims of a data model?
1.7 What is a document model? What are the aims of a document model?
1.9 Explain what is meant by saying that the Web is a distributed collection of hypermedia
documents.
1.10 Explain the difference between static and dynamic web content.
27
IDWA_C01.QXD 10/5/07 11:01 AM Page 28
Exercises
1.11 What problems of file-based applications does the database approach overcome?
1.12 Explain why the database instance will change more frequently than the database
schema.
1.13 Using an example, describe the general approach to structuring data used in the
relational data model. Briefly explain how the relational data model deals with miss-
ing information and a relationship between two organization structures.
1.14 Explain what is meant by the phrase ‘separation of content from presentation’ in the
context of the Web. Why is it a good idea?
1.15 Use a diagram to explain, in general terms, how static and dynamic content are
merged to produce the final web page in a web database application.
1.16 Summarize the different roles played by users of a database, then consider an orga-
nization you know well and identify which of these roles are played by people in that
organization. If no one plays a particular role, explain how the tasks associated with
that role are achieved.
Investigations
1.17 Contemporary models of the information systems lifecycle tend to be iterative and
incremental. What approach did earlier models take and why did information sys-
tems developers move away from these models? (Avison and Fitzgerald (2003) is a
good initial source.)
1.18 The use of nulls to indicate missing data in relational databases is controversial.
Summarize the arguments for and against nulls. Why do you think that they continue
to be used? (Date (2004) is a good initial source.)
1.19 The relational data model is only one of the data models in use today (though it is the
one most widely used). Investigate what other data models are in use, describing
how each meets the three aims of a data model identified in Section 1.4. (Elmasri and
Navathe (2000a, 2000b, 2007) are good initial sources).
1.20 Elizabeth Castro notes that, ‘Lately (mid 2006), there has been a crescendo of rising
voices complaining about the W3C’s slow pace, overemphasis of [sic] the abstract
and lack of concrete results.’ (Castro, 2007, p. 17). Investigate these criticisms. How
do you think these problems affect web database developers? (Castro indicates the
following sources: http://microformats.org/; http://www.zeldman.com/2006/07/17/
an-angry-fix/; http://joshuaink.com/blog/753/no-i-am-not-bloody-sorry. Warning, free
speech zone! These websites may contain language that some of you may find
offensive.)
28
IDWA_C02.QXD 10/5/07 11:06 AM Page 29
2 Network, database
and web technology
Chapter objectives
Chapter outline
1 Database technologies allow data to be stored and managed effectively and effi-
ciently.
2 Web technologies deliver data to the end user without the need to install spe-
cialized software on their computer – any web browser will do.
3 Network technologies are responsible for transporting the information between
database and end user.
It is useful for a web database developer to understand these three distinct technolo-
gies before they begin to develop web database applications.
Web database applications are examples of client-server computing. One computer,
the client, requests services of another the server. Client-server computing is asym-
metrical: the client makes requests and the server decides whether or not to grant
them. Not all computer software uses the client-server approach. The software used
to allow communication over the Internet uses the peer-to-peer approach, which
means that each participant has equal status in the interaction and chooses how to
29
IDWA_C02.QXD 10/5/07 11:06 AM Page 30
respond to requests from its peers. However, for now, the client-server approach re-
mains the dominant approach for web database applications. The web database ap-
plication developer needs to understand this approach in order to use it effectively.
■ Section 2.1 introduces client-server computing. It discusses the standard two- and
three-tier architectures for such applications, and explains the importance of as-
signing the functionality to the appropriate tier.
■ Section 2.2 discusses networks, with an emphasis on how information, in the form
of network messages, is transported across the network. It examines the problems
that arise when distinct networks are interconnected to allow information sharing
between networks and how the TCP/IP suite of protocols overcome these problems.
■ Section 2.3 looks at database technologies. It focuses first on the server-side tech-
nologies, in particular the database management system (DBMS), and introduces
the ANSI/SPARC database systems architecture. The section concludes by discussing
the client-side database technologies, focusing on how the client communicates its
requests to the server.
■ Section 2.4 examines the server-side and client-side technologies that support web
applications. It includes a discussion of the core web technologies – the HTTP ap-
plication protocol and HTML – and the need to augment these to allow web
applications to interact with a database.
■ Section 2.5 concludes the chapter by returning to the discussion of client-server
computing and how web database applications utilize the 3-tier architecture.
30
IDWA_C02.QXD 10/5/07 11:06 AM Page 31
Interacts with
Client Server
:Client
:Client :Server
:Client
The notion of a ‘tier’ is usually extended to include the hardware and system software.
Typically, the client and server tiers of an application with a two-tier architecture are
implemented on two different computers and communicate via a network. For exam-
ple, the Pennine University’s website can be regarded as a two-tier client-server appli-
cation. The server tier’s application software is the collection of web pages and the web
server software; its hardware the computer they are installed on and its network con-
nection; and its system software is that computer’s operating system and the network-
ing software. The client tier’s application software is a web browser; its hardware the
computer it is installed on; and its system software that computer’s operating system
and networking software. Clearly, any personal computer with Internet access could
be a client of the Pennine University’s web server. This highlights an important bene-
fit of a client-server architecture: the ability of one server to interact with many clients.
This situation is shown in Figure 2.1 (b). The notation ‘:Client’ inside a box indicates
that this represents a particular instance of the client, rather than the general concept
of a ‘Client’ shown in Figure 2.1 (a).
Occasionally, it is useful to set up a client-server application with both client and serv-
er software running on the same machine. Application developers sometimes do this to
avoid the expense and complexity of setting up a network environment. For example,
it is cheaper and easier to install a web server on a standalone computer and develop and
test web pages on the one machine than it is to set up separate client and server machines
connected by a network. However, a client-server application developed in this way
should also be installed and tested on a network before being put into daily use.
In the early days of client-server computing, two-tier architectures were common-
place. However, more complex applications were developed with three or more tiers.
Figure 2.2 (a) shows the three-tier client-server architecture. The first tier of a three-tier
architecture is still called the Client. It only interacts with the server of the second tier,
named ‘Tier 2 Server’ in Figure 2.2 (a). The added complication is that the Tier 2 Server
can itself act as a client to the Tier 3 Server. This allows greater scope for application
partitioning. One typical use of a three-tier architecture is to enforce a stricter partition
of responsibilities, so that the Client deals only with presentation, the Tier 2 Server en-
forces the business rules and the Tier 3 Server manages the data.
31
IDWA_C02.QXD 10/5/07 11:06 AM Page 32
:Client
:Client
A three-tier architecture is the natural architecture for web database applications. The
database server occupies the third tier. It includes the actual database and the DBMS,
which manages all interaction with the database. The web server occupies the middle
tier, dealing with requests for web pages and integrating dynamic and static web page
content. The web browser occupies the position of the client, handling presentation
and accepting the user’s instructions. Connecting the three tiers are a range of net-
working technologies.
The remainder of this chapter discusses the web, database and networking tech-
nologies that support web database applications and application partitioning.
32
IDWA_C02.QXD 10/5/07 11:06 AM Page 33
DatabaseServer EmailServer
Manages
FileServer Printer
linked to it would not be much use. The network in Figure 2.3 supports the most com-
mon networked client-server applications, which are shared file storage, shared printers,
e-mail and a database.
The ‘Workstation’ represents end users’ computers connected to the ‘Network’. A
workstation would typically be a PC or a UNIX workstation with a cable connecting it
to the network, but might be a mobile computing device with a wireless connection. It
will have the client application software for a number of client-server applications
installed on it. For example, the workstation will have a client application to present
information about the files stored on the file server – their names, sizes, locations and
so on. Windows Explorer does this under the Windows operating system, allowing
users to share documents, spreadsheets, images and so on simply by storing them on
the file server.
The workstation also has a client application to communicate with the ‘PrintServer’,
which manages at least one ‘Printer’ (the arrowhead next to the word ‘Manages’ indi-
cates that the print server manages the printer, not the other way round!). Notice that
a printer is a separate computing device from the print server, with its own connection
to the network. This allows multiple users to print documents on the same printer.
Managing these printing jobs is done by the print server. Each workstation will have a
client application allowing users to request that the print server pause or cancel print
jobs. In this respect, the print server plays a role similar to the file server, though the
print server manages access to shared hardware, rather than shared information.
The computing devices connected to the network communicate by sending mes-
sages to each other. As a network is a multi-user system, there will be lots of messages
traversing the network, with different destinations. To identify the different destina-
tions, each device has a unique network address.
Consider a user who wishes to retrieve some data from a database. He or she will use
the database client application on a workstation to create a message requesting the data.
The message holds three key pieces of information – the network address of the work-
station, the actual request and the network address of the database server. That
message is then broadcast to the entire network. The database server hears the broad-
cast, sees it is addressed to itself and so reads the message. Other devices also hear the
broadcast, but, seeing that the message is not for them, ignore it. With the message
33
IDWA_C02.QXD 10/5/07 11:06 AM Page 34
:Subnetwork
:FileServer :Workstation
:Bridge
:Subnetwork
delivered to the database server, it retrieves the requested data and bundles it up into a
response message. The database server uses the address of the workstation as the desti-
nation address and broadcasts the response message back across the network.
Broadcasting messages works reasonably well for a local area network (LAN). LANs
use networking technology that limits their size to around 1 kilometre from end to
end. They are typically used by a single organization (often within one building) for
file and resource sharing, so have a limited number of computing devices attached.
Even so, many LANs are split into smaller subnetworks to make them easier to man-
age. Figure 2.4 shows a LAN that consists of two subnetworks linked by a bridge. Each
device broadcasts messages only to the subnetwork that it is attached to. The bridge
is attached to both subnetworks so sees all the messages and knows which network
addresses are on which subnetwork.
Suppose that a workstation on the top subnetwork of Figure 2.4 broadcasts a mes-
sage for the e-mail server. The bridge knows that the e-mail server is on the other
subnetwork, so it rebroadcasts the message to the other subnetwork. However, if a
workstation on the lower subnetwork broadcasts a message for the e-mail server, then
the bridge will not rebroadcast that message to the top subnetwork. There is no need,
because the e-mail server has already heard it.
This simple scheme can cut the number of messages each network device has to
check. Although there are a number of different technologies for building networks,
the subnetworks of a LAN all use the same networking technology. This means that all
the messages use the same language, and bridges only need to rebroadcast the original
message; no translation is required.
LANs bring benefits to individual organizations in the form of information and re-
source sharing. There are further benefits to be gained by interconnecting many orga-
nizational networks to create a network of networks. Such an interconnected network
is called an internet (note the common noun). The most famous internet is the
Internet (note the proper noun). The Internet is the global, publicly accessible internet
34
IDWA_C02.QXD 10/8/07 10:11 AM Page 35
used by organizations and individuals throughout the world to, for example, send e-
mails, browse web pages and download files. E-mail, the Web and file downloading are
Internet applications. Each is a client-server application using Internet technologies to
allow Internet users to share information and collaborate regardless of the physical dis-
tance separating them. The Wikipedia is a good example of both information sharing –
it is an encyclopaedia – and of a collaborative effort by Internet users, as anyone can
contribute an article (Wikipedia, 2005). The huge impact of the Internet meant that
the term ‘internet’ was often confused with ‘Internet’, so the word intranet was coined
to mean a private internet, typically used within a single organization. A private inter-
net used by an organization and its partners is sometimes called an extranet.
There are some obvious problems with the idea of interconnected networks sharing
information and resources around the globe. No one wants their confidential files or
data to be accessible to everyone in the whole world. Similarly, there are few benefits
to a person living in the UK being able to print a document on a printer attached to a
network in some unknown location. Consequently, networks only expose a portion of
themselves to the Internet and often even restrict access to this.
Figure 2.5 (a) shows the devices that are typically exposed to the Internet by a net-
work. Notice that a database server is not usually exposed to the Internet. Instead, a
three-tier web database application is developed and a web server mediates all requests
for information from the database. Clients connecting via the Internet cannot actually
see the database server and may not be aware that they are interacting with a database,
even though it is on the same network as the web server.
A larger problem that arises from interconnecting networks is how to design an
efficient connection architecture.
Each network connected to the Internet does so using a router (Figure 2.5 (b)). Each
router has connections to several other routers and, through them, their networks.
(In fact, the object labelled ‘Internet’ in Figure 2.5 (b) is really just this network of con-
nections between routers.) When a network message has a destination address that
is not for a device on the network, it is handled by the router. The router reads the
WebServer
Router
Internet
35
IDWA_C02.QXD 10/5/07 11:06 AM Page 36
:Network
:Internet
:Network
Pennine:Network Lowlands:Network
Pennine:Router Lowlands:Router
Islands:Router Seaboard:Router
Highland:Router
Islands:Network Seaboard:Network
Highland:Network
destination address and decides which of the other routers it is connected to is closest to
the destination. It then sends the message to that router and only that router. Routers do
not broadcast messages. To see why, consider Figure 2.6, which shows a small intranet of
five imaginary university networks and the connections between their routers.
Suppose that a user on the Highland network sends a message to a user on the Islands
network. If the Highland router broadcasts this message, then both the Islands and
36
IDWA_C02.QXD 10/5/07 11:06 AM Page 37
Seaboard routers receive the broadcast. The Islands router knows that the message is
for its network, so does not rebroadcast. The Seaboard router knows that the message
is not for its network, so it rebroadcasts the message, sending it to Lowlands and Pennine
(assuming that the routers are smart enough not to send a message back to the origina-
tor). Both Lowlands and Pennine also rebroadcast, even though Islands already has the
message! This is known as flooding, because unnecessary copies of the original mes-
sage flood the intranet.
Figure 2.6 illustrates another important point about internets, which is that there can
be more than one route between two networks. Suppose a computer on the Lowlands
network sends a message to a computer on the Islands network. The message goes to
the Lowlands router, which can send it to the Pennine or Seaboard routers. The best
choice is the Pennine router as it has a direct connection to the Islands router and
Seaboard does not. In general, routers will use the destination address to identify the
best route for a particular message. The details of how routers identify a ‘best route’ for
a message to take are beyond the scope of this book.
There is a third problem with interconnected networks. The devices in a network
communicate by sending messages, so must be able to understand those messages. This
is achieved by protocols, which define a common language that is sufficient to achieve
a specific task. The protocols used to facilitate client-server computing over an inter-
connected network have a layered structure (Figure 2.7), with each protocol relying on
services provided by the layer below it. At the highest level, the application layer in-
cludes protocols that allow a client and server to understand one another. At the lowest
level, the network layer has protocols specific to each particular network technology, al-
lowing messages to be broadcast to a network. Often the client and server are in differ-
ent networks, built with different network technologies, which are likely to use
different protocols. The protocols of the internetwork layer allow messages to pass be-
tween these two networks. The layered structure means that the intricacies of dealing
with the physical network infrastructure can be hidden from the application developer,
who focuses on the application protocols relevant for his or her specific applications.
Note that the system software implementing the various internetwork and network
protocols is installed on each network device. So, when a client application wishes to
communicate with its server, it uses a local copy of the networking system software to
create a network message. When this network message arrives at the network device
TCP
Internetwork layer
IP
37
IDWA_C02.QXD 10/5/07 11:06 AM Page 38
that hosts the application server, the local copy of the networking system software
passes the message to the correct server application. It is useful to have a clear under-
standing of how the protocols of the internetwork layer achieve message delivery
between client and server.
The Internet and its applications use a standard suite of protocols in the internet-
work layer, but the most important of these are the Transmission Control Protocol
(TCP) and the Internet Protocol (IP). These are so important that the whole suite of
protocols is called TCP/IP, even though it includes others.
Strictly speaking, neither TCP nor IP uses the client-server architecture. They are
both peer-to-peer protocols as the source and destination of a message are treated
equally. However, client-server applications can use TCP/IP to send messages be-
tween the client and server. The application protocol will encapsulate the different
roles of client and server.
As indicated in Figure 2.7, TCP and IP work at different levels. IP interconnects the
networks on the Internet by providing a common addressing scheme and mecha-
nisms to identify a route across the Internet from the source to the destination ad-
dress. Each IP address of a network device has two parts (Figure 2.8). The network
number identifies which network the device is connected to. The host number iden-
tifies a particular device in that network. Note that routers, and other devices that are
connected to two or more networks, have a different IP address for each network they
are connected to. The IP address is a 32-bit binary number, usually written as four in-
tegers, separated by dots, so 192.168.7.94 is a valid IP address. Note that the length
of the network and host numbers vary (Rodriguez et al. (2001) give a more detailed
description of IP addresses).
In general, network devices must have a unique IP address. However, two uncon-
nected intranets may use the same IP addresses for their network devices as devices on
one intranet will never interact with devices on the other. Thus, the standards govern-
ing the IP protocol reserve a set of IP addresses for use on intranets (192.168.7.94 is an
IP address for an intranet device, rather than an Internet one).
IP sends messages as IP datagrams. Each IP datagram includes the actual data being
sent and a header. The header holds important information, including:
38
IDWA_C02.QXD 10/5/07 11:06 AM Page 39
● the actual data being sent, which is passed to the higher-level protocol once at its
destination
● time to live.
The last of these needs further explanation, and the explanation is tied up with IP
routing – how IP datagrams are routed from their source to their destination.
Suppose an IP datagram is sent from a device in the Pennine network in Figure 2.6. If
the destination is also in the Pennine network, the best route is directly across the local
network, so IP sends the datagram that way. If the destination is not in the Pennine
network, the IP datagram is sent to the Pennine router. The IP system software on the
router checks whether or not the network number is one that it recognizes. This may
be a network with a router directly connected to the Pennine router (the Lowlands,
Seaboard or Islands routers we saw in Figure 2.6) or one connected via a chain of two
or more routers (the Highland via Seaboard router in Figure 2.6). If so, the message is
sent directly to the first router in the chain. Otherwise, it is sent to a default router.
In this sense, IP is a best effort protocol as the IP system software on each router
does its best to find the destination, but, if it fails, hands the problem on to another
router. Note that a routers list of known networks is updated regularly, so the ‘best’
route may change.
IP is also a connectionless protocol – there is no attempt to notify the destination
address that a datagram is on the way. Indeed, the first router may not know if the des-
tination address even exists. This is where the ‘time to live’ comes in. This is the num-
ber of routers the datagram can pass through before IP assumes that the destination
address is wrong and throws the datagram away.
39
IDWA_C02.QXD 10/5/07 11:06 AM Page 40
workstation, or a server, a unique local identifier called a port number. The combination
of IP address and port number is called a socket and allows TCP to route messages di-
rectly between a client and server (or, more generally, between any two computer pro-
grams that need to communicate).
Internet application servers tend to use well-known port numbers between 1 and
1023 and each will always have this port number. For example, most web servers are
always given TCP port 80 (see IANA (2005) for a full list). Clients tend to use an
ephemeral port number (1024 or greater), which is not fixed for each one. This is be-
cause most TCP connections are initiated by the client requesting a service from the
server. The client needs to know the IP address and port number of its server, but can
tell the server its own IP address and port number in its initial message.
The combination of the source and destination sockets forms the TCP connection
(Figure 2.9). TCP establishes a connection between the source of a message and the in-
tended destination, which, typically, are the client and server of a client-server appli-
cation. The TCP system software running on the source computer sends a message to
the destination. Once it has been acknowledged by the TCP system software on the
destination computer, the connection is established. If the destination IP address or
the destination port number are not recognized, TCP reports an error.
TCP uses the connection to provide reliable delivery of IP datagrams. The TCP sys-
tem software on the destination computer acknowledges receipt of each IP datagram.
If the TCP system software on the source computer does not receive acknowledgement
within a set time, it will resend the lost IP datagram.
TCP also deals with the problem of large messages by splitting them into segments
and sending each segment in its own IP datagram. The TCP system software on the
destination computer collects all the segments, sets them into the correct order and
reassembles the original message. This is known as stream data transfer as, to the
application using TCP, its data appears to have been transmitted in a continuous
stream. This is often essential. For example, when a database client sends a sequence of
instructions to the DBMS, they must be carried out in the correct order.
TCP provides three other, slightly more technical, features. First, it is a full duplex
protocol – that is, the source and destination can be sending messages simultaneously.
This is a little like two people talking over one another, but still being able to under-
stand everything that is said.
Second, as each application using TCP gets its own port number, TCP can manage
communications for multiple applications, concurrently – this is called multiplexing.
Finally, the TCP system software on the destination computer can let the TCP sys-
tem software on the source computer know how busy it is. This allows TCP to slow
TCP Connection ⬅ Source IP address Source port Destination IP address Destination port
40
IDWA_C02.QXD 10/5/07 11:06 AM Page 41
down the rate at which it sends data through the connection, to ease the load at the
other end. This is called flow control.
Numeric IP addresses are not easy for people to remember. They are all right for
workstations, which hold only client software, but users need to remember the loca-
tion of application servers (especially web servers). A domain name is an easily remem-
bered alternative to an IP address. Domain names have a hierarchical structure and
should be read from right to left. Consider the domain name www.pennine.ac.uk. The
‘.uk’ is the top-level domain – in this case, indicating that the server is managed by an
organization in the United Kingdom. There is a top-level domain for each country in
the world and they are called country code top-level domains (ccTLD). There are also
several generic top-level domains (gTLD), such as .com and .org, which are not associ-
ated with any particular country.
Originally there were only gTLD. Only when the Internet spread out from its origins in
the USA were ccTLD added. Most organizations in the USA have domain names with
a gTLD. The USA’s ccTLD is ‘.us’, but it is not widely used.
The domain name system (DNS) is currently overseen by the Internet Corporation
for Assigned Names and Numbers (ICANN). It decides on what top-level domains
should be available. Further information on domain names and the DNS is available
from its website at www.icann.org
Note that, although all the examples so far have had four parts to their domain names,
this is not a requirement. For example, ‘www.wikipedia.org’ has only three, and
‘fr.comp.pennine.ac.uk’ has five (it happens to be the French language version of the
School of Computing’s website). There is no direct connection between the four num-
bers in the IP address and the different parts of a domain name.
The domain name system (DNS) allows an Internet application to translate a do-
main name, which is used by the application users, into the corresponding IP ad-
dress, which is used by the router. The idea is that each top-level domain has a
name server that maintains a database of domain names and their corresponding
IP addresses. So, for example, there will be a name server dealing with domain
41
IDWA_C02.QXD 10/5/07 11:06 AM Page 42
names ending ‘.uk’. As this would be a very large database, the ‘.uk’ domain (and
every other top-level domain) is divided up into zones. For example, there might be
a ‘.ac.uk’ zone and a ‘.co.uk’ zone, each with its own name server. Zones are defined
in such a way that the associated name server is not overloaded with requests for IP
addresses. Thus, when a user gives a domain name to an Internet application, the
application looks at the top-level domain and asks the name server for this domain
to supply the IP address. The top-level domain name server may refer the request to
the name server of one of its zones, but eventually the request is fulfilled. Note that
this is done via the Internet; there is no need for a physical connection to the name
servers.
IP addresses, domain names and ports are all concepts that allow the networking
technology to route messages between a client and server. The uniform resource loca-
tor (URL) allows the client to specify exactly which resource it wants from a particular
server. For example, a web browser uses a URL to tell the web server which web page
(resource) it wants.
URLs are used by a number of different Internet applications, called schemes, but
have a standard format. Figure 2.10 (a) shows the general structure for a URL. The
name of the scheme (typically the application protocol name) is first, followed by a
colon – ‘:’ – then some scheme-specific information. The structure of the scheme-spe-
cific information varies from scheme to scheme, but Figure 2.10 (b) shows the struc-
ture commonly used by application protocols that are based on IP. First come two
forward slashes – ‘//’. Immediately following these comes the login information – the
username and password, separated by a colon and terminated by an ‘@’ symbol. Login
information is optional and usually it is not included as including usernames and pass-
words in a URL is now recognized as being very insecure. The first compulsory part is
the ‘host’, which is the IP address, or domain name, of the application server. Follow-
ing this is the optional TCP port information, preceded by a colon. As most Internet
applications have a default port, this is usually omitted, too. Next comes a single for-
ward slash – ‘/’ – followed by the optional URL path. The URL path indicates how to
access the resource and may include data that the client is sending to the server, so is
specific to the scheme.
Consider the URL ‘http://www.pennine.ac.uk:80/index.html’. The scheme is HTTP,
which indicates a web application. The host is ‘www.pennine.ac.uk’, the Pennine Uni-
versity’s web server. This host is attached to the TCP port 80, the standard port number
for web servers. Finally, the URL path is ‘index.html’. For web URLs, this indicates a
URL for scheme-specific information ⬅ // User name : Password @ Host : Port / URL path
42
IDWA_C02.QXD 10/5/07 11:06 AM Page 43
Figure 2.11 Structure of the URL path for the HTTP scheme.
particular file managed by the web server. Note that login information (user and pass-
word) is not allowed in a URL with the HTTP scheme (Berners-Lee, 1994).
A URL is one form of the uniform resource identifier (URI). The URL identifies the net-
work address of a resource, so, if a resource moves, its URL changes.
The other form of URI is the uniform resource name (URN). A URN defines a unique
name for the resource, rather than its network location. The idea is to translate a URN
into a URL, rather like a domain name is translated to an IP address. This allows the
URN to stay the same even when the resource is moved, so provides a constant way
to locate the resource – no more broken links on a website. Unfortunately, the tech-
nology to support URN has not been widely implemented, so, for now, the URL is the
main way to locate a resource.
The URL path for the HTTP scheme can be more complex than a simple file name. For
web database applications, the important components of the URL path are shown in
Figure 2.11. The path component indicates where the requested resource is stored in the
web server’s file system. The query (or query string) component is used to pass informa-
tion to the DBMS. For example, the URL ‘http://www.pennine.ac.uk/StaffDirectory/
StaffDetail.php?staffNo=31210’ has a path component ‘StaffDirectory/StaffDetail.php’,
which indicates a PHP script file held in the StaffDirectory folder. The query compo-
nent – ‘staffNo=31210’ – passes a staff number to the PHP script, which it uses to query
the database.
Strictly speaking, the fragment identifier (shown in Figure 2.11) is not actually part
of the URL. Its purpose is to identify a particular section within the resource. For HTML
documents, the fragment identifier tells the web browser which part of a long web page
to display in the browser window. Users can simply scroll up or down to see the rest.
43
IDWA_C02.QXD 10/5/07 11:06 AM Page 44
DatabaseServer
Network Operating
system software system software
Manages
Data
communication for
communications DBMS
manager (DCM)
Describes
DataDictionary Database
Figure 2.12 highlights the major software and data components of the server tier. The
database server includes the operating system (OS) software, which manipulates the
physical files and computer’s memory according to instructions from the DBMS. The
role of the network system software has already been discussed. The interface between
the DBMS and the network system software is quite complex, so is usually treated as a
separate component – the data communications manager (DCM) in Figure 2.12.
The main application software is the database management system (DBMS). This
manages the actual database and its data dictionary by means of a range of functions.
The server’s software is not usually written by database application developers. In-
stead, they use an existing DBMS (with its associated DCM) to create a database (with
its associated data dictionary) that meets the end user’s requirements. The DBMS has a
range of functions that allow application developers to do this.
The DBMS is classed as application software as its functions allow end users to carry
out tasks that benefit them directly. However, it is now rarely used directly by end
users. Instead, they interact with increasingly sophisticated database client applica-
tions and these interact with the DBMS. It could be argued that, as the DBMS sinks
into the background, it becomes more like system software.
44
IDWA_C02.QXD 10/5/07 11:06 AM Page 45
for a particular database. Most DBMS can manage databases based on the relational
data model and some support other data models.
Data definition functions are used to define the data structures and constraints, not
the actual data.
Once the data structures have been created, users will want to fill them with data. After
all, an empty database is not of much use. The DBMS provides data manipulation
functions to manage the actual data that forms the database instance. Users can add
data and modify or remove existing data. They can also ask to view data from the data-
base. A request to view selected data held in the database is called a database query.
The DBMS also provides data integrity functions that monitor users’ requests to modify
the data and ensure that these changes do not violate the database’s constraints. This
means that the database instance is always consistent.
A database is a shared collection of data, so the DBMS must be able to deal with
many users making different requests at the same time. It provides concurrency func-
tions to keep users’ interactions separate.
A DBMS also provides a number of security functions related to users. When a user
asks to use the database, the DBMS must check that he or she is allowed access to it.
Identification and authentication functions achieve this. The usual approach is
to identify users by a username, which is authenticated by a password. If a user fails to
identify him or herself correctly (wrong username) or fails to authenticate his or
her identity (wrong password), the DBMS denies access to the database.
Once users have gained access, they request that the DBMS carry out actions on
their behalf. Authorization functions check that the user is authorized to make each
request. For example, in the University Staff Directory database, Paul Smith is not
authorized to change Freya Stark’s staff details, though he is authorized to view them.
The DBMS must store all this security information – usernames, passwords and autho-
rization details. Although the exact data will vary from DBMS to DBMS, most store it
in the data dictionary. Thus, the data dictionary is a central repository of the data and
meta-data that the DBMS needs to carry out its functions.
The database is not usually located on the user’s own machine, so a DBMS must
have a set of communication functions allowing it to send and receive messages. This
may or may not involve sending messages over a network. For example, some database
clients are installed on the same computer as the database server.
In addition, database technology was widely adopted before the rise of internet-
working technology and the TCP/IP protocol suite, so most DBMS have their own
approaches to client-server communication over a network. Strictly speaking, it is not
the DBMS that handles this, but the data communications manager (DCM). This soft-
ware works alongside the DBMS to integrate it with local clients or the network system
software. However, for most purposes, it is reasonable to treat the DCM as being part of
the DBMS.
The concurrency, security and communications functions allow the DBMS to man-
age its users. Together with the data definition functions, which manage the database
schema, and data manipulation and integrity functions, which manage the database
instance, they form the core functionality of the DBMS, summarized in Table 2.1.
45
IDWA_C02.QXD 10/5/07 11:06 AM Page 46
Most DBMS also provide some utility functions, making the DBMS itself more use-
ful. A DBMS will usually provide backup functions. These create a copy of the data-
base, both the data and the meta-data. The database backup can be stored safely in
case the actual database is lost if, for example, the disk crashes or a virus damages the
database. If the database is lost or damaged, the DBMS recovery functions can recreate
the database from the backup copy. This is the simplest form of recovery. Many com-
mercial DBMS have more sophisticated recovery functions. For example, a power fail-
ure will stop the DBMS working immediately. If it was part way through writing data
to the disk, then it’s possible that one or more integrity constraints would be violated
so the database instance would be inconsistent. When the power returns, some DBMS
can automatically return the database instance to a consistent state.
A database may replace an existing information system. It is unlikely that the users
will want to retype all the existing data, so a DBMS is usually able to import this data
directly into the new data structures. Similarly, it can export data ready to import into
a different database. These import/export functions allow data exchange between
databases that use different data models.
All the above functions must be performed efficiently, so a DBMS usually has a set
of performance and optimization functions. These allow users to monitor performance
and identify suitable optimizations. Different DBMS may offer more or fewer utility
functions. The particular needs of the database application will determine which func-
tions are used and which are not.
Chapter 1 discussed the problem of program–data dependence and noted that an
important advantage of database applications is that they do not suffer from this prob-
lem. A database system architecture gives a high-level description of how to achieve
program–data independence and other benefits.
The best-known of these is the ANSI/SPARC database system architecture (Figure
2.13). It identifies three views of the data structures and data that comprise a particular
database–external, logical and internal. Each is a view of the same database, so it is pos-
sible to move between them. The idea is that each view provides an image of the data-
base appropriate to a different group of database users. At the highest level are the views
seen by the end users. Few end users need to see the whole database, so it is sensible to
hide any irrelevant parts from them. Thus, each external view provides a tailored view
46
IDWA_C02.QXD 10/5/07 11:06 AM Page 47
:ExternalLogicalMapping
:ExternalLogicalMapping :ExternalLogicalMapping
:LogicalView
:LogicalInternalMapping
:InternalView
:PhysicalLevel
of the data structures and data for a group of end users with similar information needs.
Often, a different external view will be defined for each client application, as Figure 2.13
indicates by showing three external views.
Below the external views is the logical view. It provides a complete description of
the data structures and data in the database, so there is a single logical view for each
database. The logical view is often called the community view as it serves the whole
community of database users. Application developers and database administrators use
the logical view to understand what data is available and how it can be used to meet
the information needs of end users.
For each external view, they develop an external-logical mapping. In Figure 2.13,
these are shown as the lines joining the external views to the logical view. Each external-
logical mapping describes how the data structures in a particular external view are
derived from the data structures of the logical view.
Both the logical and external views provide descriptions of the database in terms of
the underlying data model. This is a useful, human-friendly way of thinking about the
data structures and data in the database. However, the focus is on organizing the data
into meaningful structures. Data models usually ignore data storage issues – what
space is available for data storage, how the data is stored and what techniques are
used to ensure data can be accessed efficiently.
There are various data storage structures available to deal with these issues. The inter-
nal view of the ANSI/SPARC database systems architecture describes which of these
data storage structures are used and how. Each data structure from the logical view is
linked to data storage structures of the internal view. The logical-internal mapping
shown in Figure 2.13 describes these links.
An example of a particular data storage structure is clustered data. Figure 2.14 shows
an instance of the Staff table, with data clustered by jobType – all members of staff
47
48
IDWA_C02.QXD
10/5/07
Figure 2.14 Data from the Staff table clustered by job type.
Staff
11:06 AM
staffNo fName IName phone email photo department title jobType jobTitle
35054 Selma Hutchins 8706 [email protected] [BLOB] Computing Ms Academic Senior Lecturer
55776 Gurpreet Choudhury 5454 [email protected] [BLOB] Music Dr Academic Senior Lecturer
Page 48
56893 Ruth Bapetsi 8022 [email protected] [BLOB] Health Mrs Academic Senior Lecturer
31210 Paul Smith NULL [email protected] [BLOB] Computing Dr Academic Senior Lecturer
89987 Dan Lin 8514 [email protected] [BLOB] Health Dr Administration Senior Administrator
78893 Jo Karen O’Connor 8871 [email protected] [BLOB] Health Miss Administration Administrator
25447 John Smith 5104 [email protected] [BLOB] Music Mr Administration Secretary
14443 Helen Abbot 8032 [email protected] [BLOB] Computing Mrs Administration Secretary
33935 Padma Brar 6641 [email protected] [BLOB] Health Ms Administration Administrator
25448 Judith Anne Smith 7709 [email protected] [BLOB] Estates Mrs Support Estates Manager
10780 John Smith NULL [email protected] [BLOB] Catering Mr Support Chef
33509 Helen Timms 8661 [email protected] [BLOB] Music Mrs Technical Technician
77712 Frank Rose 8871 [email protected] [BLOB] Computing Mr Technical Technician
35155 Helene Chirac NULL [email protected] [BLOB] Health Miss Technical Technician
IDWA_C02.QXD 10/5/07 11:06 AM Page 49
with the same type of job being stored in a continuous chunk. This makes it easy to
retrieve a list of academics, for example. You just read through the academics, noting
their details, and, when you reach the first administrator, you stop. In contrast, to find
all staff in the School of Health, you must read through all the data or you’d miss Helene
Chirac, the department’s technician, as she’s neither an academic nor an administrator.
It’s important to understand that clustering data is about more than simply order-
ing the rows. For example, the DBMS keeps track of the starting point for each cluster,
so can skip through the stored data directly to the first row with jobType ‘Adminis-
tration’. Data storage structures are discussed in more detail in Chapter 10.
Figure 2.15 presents an image of the data and data structures in the Staff Directory
database at the three different levels of the ANSI/SPARC database systems architecture.
It shows the external views for two applications – the Staff Directory itself (labelled
StaffDirectory : ExternalView) and the School of Computing’s Staff List (labelled
ComputingStaffList : ExternalView). Both these applications use the same Staff Directo-
ry database, but have different views of the data. For example, in the external view for
the Staff Directory, the Staff table does not hold the titles of members of staff, nor
any details about their jobs. Similarly, the external view for the School of Computing’s
Staff List has no details of support sessions.
All the data in these two external views is derived from the data held in the logical
view (labelled StaffData : LogicalView), although the derivation is not always a straight
copy. For example, in the external view for the School of Computing’s Staff List the
name column of the StaffList table is a combination of data from three columns
of the logical view (see Figure 2.16).
In the internal view, the data for the StaffList and SupportSession tables is
stored in an appropriate format. The data storage structures shown in StaffData : Internal-
View in Figure 2.15 are explained in Chapter 10 (note that not all data is shown for the
internal view).
It is worth analysing more closely the mapping between the two external views and
the logical view. The external view for the Staff Directory includes the SupportSession
table unaltered – all rows and columns present in the logical view are visible in the ex-
ternal view. The Staff table in the external view does not have the final four
columns of that in the logical view, but includes all the rows. Figure 2.15 shows these
differences.
The mapping between the external view for the School of Computing’s Staff List
and the logical view is more complex. First, there is no data from the logical view’s
SupportSession table in this external view. The data in the StaffList table in
the external view is drawn solely from the logical view’s Staff table. However, the
StaffList table does not include data from every row of the logical view’s Staff
table, only from those rows in which the department column has the value
‘Computing’ (check the staffNo column values shown in Figures 2.15 or 2.16 to ver-
ify this). This makes sense from the application’s perspective as it is only supposed to
show staff from the School of Computing.
Note that the StaffList table does not include all the columns of the logical
view’s Staff table either. The photo,department,title, fName and lName
are all absent.
Finally, the data value in the name column of the StaffList table in the external
view is a combination of data from the title, fName and lName columns in the
49
Figure 2.15 Staff and support session data at the different levels of the ANSI/SPARC architecture.
50
IDWA_C02.QXD
StaffDirectory : ExternalView
Staff SupportSession
staffNo fName IName phone email photo staffNo dayOfWeek startTime endTime
10780 John Smith NULL [email protected] [BLOB] 35054 Monday 09:00:00 10:00:00
25447 John Smith 5104 [email protected] [BLOB] 31210 Wednesday 11:00:00 13:00:00
10/5/07
25448 Judith Anne Smith 7709 [email protected] [BLOB] 35054 Monday 15:00:00 16:00:00
31210 Paul Smith NULL [email protected] [BLOB] 45965 Monday 11:00:00 12:00:00 ComputingStaffList : ExternalView
77712 Frank Rose 8871 [email protected] [BLOB] 45965 Wednesday 11:00:00 12:00:00
StaffList
14443 Helen Abbot 8032 [email protected] [BLOB] 23257 Monday 15:00:00 16:00:00
staffNo name phone email jobTitle jobType
23257 Freya Stark 8660 [email protected] [BLOB] 55776 Monday 14:00:00 16:00:00
31210 Dr P Smith NULL [email protected] Senior Lecturer Academic
33935 Padma Brar 6641 [email protected] [BLOB] 56893 Tuesday 14:00:00 15:00:00
77712 Mr F Rose 8871 [email protected] Technician Technical
35054 Selma Hutchins 8706 [email protected] [BLOB] 56893 Thursday 09:00:00 10:00:00
11:06 AM
StaffData : LogicalView
Staff SupportSession
Chapter 2 Network, database and web technology
staffNo fName IName phone email photo department title jobType jobTitle staffNo dayOfWeek startTime endTime
10780 John Smith NULL [email protected] [BLOB] Catering Mr Support Chef 35054 Monday 09:00:00 10:00:00
25447 John Smith 5104 [email protected] [BLOB] Music Mr Administration Secretary 31210 Wednesday 11:00:00 13:00:00
25448 Judith Anne Smith 7709 [email protected] [BLOB] Estates Mrs Support Estates Manager 35054 Monday 15:00:00 16:00:00
31210 Paul Smith NULL [email protected] [BLOB] Computing Dr Academic Senior Lecturer 45965 Monday 11:00:00 12:00:00
77712 Frank Rose 8871 [email protected] [BLOB] Computing Mr Technical Technician 45965 Wednesday 11:00:00 12:00:00
14443 Helen Abbot 8032 [email protected] [BLOB] Computing Mrs Administration Secretary 23257 Monday 15:00:00 16:00:00
23257 Freya Stark 8660 [email protected] [BLOB] Computing Prof Academic Dean 55776 Monday 14:00:00 16:00:00
33935 Padma Brar 6641 [email protected] [BLOB] Health Ms Administration Administrator 56893 Tuesday 14:00:00 15:00:00
35054 Selma Hutchins 8706 [email protected] [BLOB] Computing Ms Academic Senior Lecturer 56893 Thursday 09:00:00 10:00:00
45965 Mikhail Sudbin 5553 [email protected] [BLOB] Music Mr Academic Lecturer 56673 Thursday 15:30:00 16:30:00
35155 Helene Chirac NULL [email protected] [BLOB] Health Miss Technical Technician 56673 Friday 10:00:00 11:00:00
55776 Gurpreet Choudhury 5454 [email protected] [BLOB] Music Dr Academic Senior Lecturer
56893 Ruth Bapetsi 8022 [email protected] [BLOB] Health Mrs Academic Senior Lecturer
56673 Joshua Bittaye 7782 [email protected] [BLOB] Computing Mr Academic Lecturer
89987 Dan Lin 8514 [email protected] [BLOB] Health Dr Administration Senior Administrator
78893 Jo Karen O’Connor 8871 [email protected] [BLOB] Health Miss Administration Administrator
33509 Helen Timms 8661 [email protected] [BLOB]] Music Mrs Technical Technician
StaffData : InternalView
Figure 2.16 The external-logical mapping: a column value in the external view is derived from three of the column values in the
logical view.
ComputingStaffList : ExternalView
10/5/07
StaffList
staffNo name phone email jobTitle jobType
31210 Dr P Smith NULL [email protected] Senior Lecturer Academic
11:06 AM
StaffData : LogicalView
Staff SupportSession
staffNo fName IName phone email photo department title jobType jobTitle staffNo dayOfWeek startTime endTime
10780 John Smith NULL [email protected] [BLOB] Catering Mr Support Chef 35054 Monday 09:00:00 10:00:00
25447 John Smith 5104 [email protected] [BLOB] Music Mr Administration Secretary 31210 Wednesday 11:00:00 13:00:00
25448 Judith Anne Smith 7709 [email protected] [BLOB] Estates Mrs Support Estates Manager 35054 Monday 15:00:00 16:00:00
31210 Paul Smith NULL [email protected] [BLOB] Computing Dr Academic Senior Lecturer 45965 Monday 11:00:00 12:00:00
77712 Frank Rose 8871 [email protected] [BLOB] Computing Mr Technical Technician 45965 Wednesday 11:00:00 12:00:00
14443 Helen Abbot 8032 [email protected] [BLOB] Computing Mrs Administration Secretary 23257 Monday 15:00:00 16:00:00
23257 Freya Stark 8660 [email protected] [BLOB] Computing Prof Academic Dean 55776 Monday 14:00:00 16:00:00
33935 Padma Brar 6641 [email protected] [BLOB] Health Ms Administration Administrator 56893 Tuesday 14:00:00 15:00:00
35054 Selma Hutchins 8706 [email protected] [BLOB] Computing Ms Academic Senior Lecturer 56893 Thursday 09:00:00 10:00:00
45965 Mikhail Sudbin 5553 [email protected] [BLOB] Music Mr Academic Lecturer 56673 Thursday 15:30:00 16:30:00
35155 Helene Chirac NULL [email protected] [BLOB] Health Miss Technical Technician 56673 Friday 10:00:00 11:00:00
55776 Gurpreet Choudhury 5454 [email protected] [BLOB] Music Dr Academic Senior Lecturer
56893 Ruth Bapetsi 8022 [email protected] [BLOB] Health Mrs Academic Senior Lecturer
56673 Joshua Bittaye 7782 [email protected] [BLOB] Computing Mr Academic Lecturer
89987 Dan Lin 8514 [email protected] [BLOB] Health Dr Administration Senior Administrator
78893 Jo Karen O’Connor 8871 [email protected] [BLOB] Health Miss Administration Administrator
33509 Helen Timms 8661 [email protected] [BLOB] Music Mrs Technical Technician
51
2.3 Database technology
IDWA_C02.QXD 10/5/07 11:06 AM Page 52
logical view’s Staff table. Figure 2.16 shows how this data value has been derived. To
support this mapping, the DBMS must be able to strip off the first letter of fName and
combine it with title and lName into a single data value.
The main goal of all this is data independence – a key feature of the database approach.
Separating the external and logical views achieves logical data independence – that
is, data structures in the logical view can change without necessarily affecting the
external views. Some changes to the logical view will affect the external views, but
such changes can be kept to a minimum. Separating the logical and internal levels
achieves physical data independence – that is, the data storage structures of the inter-
nal view can change without necessarily affecting the logical view.
Logical data independence This insulates the external view from changes to
the logical view.
Physical data independence This insulates the logical view from changes to
the internal view.
The ANSI/SPARC database systems architecture is concerned with managing the data-
base. It is not concerned with how the data is used, nor with how it is presented. Its
main achievement is to provide an architecture for database systems with data indepen-
dence as a built-in feature.
The internal view is the lowest level of the ANSI/SPARC database systems architec-
ture. Beneath it (in Figure 2.13) is the physical level – the actual physical files that im-
plement the data storage structures on disk. In most cases, the computer’s operating
system (such as Windows or UNIX) is used to read from and write to these physical
files. The physical level is not really part of the ANSI/SPARC database systems architec-
ture, though it is usually shown for completeness.
Like the database server, the database client includes both system software and appli-
cation software (as shown in Figure 2.17). In contrast to the server software, the appli-
cation developer does create the database client application software and this forms
the focus of much of the rest of the book.
DatabaseClient
Network Operating
system software system software
Manages
Database DatabaseClient
communication for
connectivity application
software software
52
IDWA_C02.QXD 10/5/07 11:06 AM Page 53
MySQL : DBMS
: DatabaseClient
application PHP : API
software
: DatabaseClient
MyODBC
application CLI : API
: ODBC driver
software
53
IDWA_C02.QXD 10/5/07 11:06 AM Page 54
sd Database connection
: DatabaseClient : DatabaseServer
Connection requested
User authenticated
Connection granted
connecting database client application software to the MySQL DBMS PHP API directly
and using ODBC to achieve the connection.
To round off this section, it is worth noting that, for applications written using the
Java programming language, there is JDBC. This provides similar functionality to that
of ODBC, but only for the Java language.
The client and database server communicate by sending messages. To begin com-
munication, the client must connect to the database server. This can be a complex
process, so Figure 2.19 presents a much simplified view, highlighting the main messages
sent between a database client and database server. The diagram (a UML sequence
diagram, indicated by the ‘sd’ in the top left corner) shows a particular database client
and its database server. The dashed lines below them are timelines, indicating the pas-
sage of time. The rectangles overlaying the timelines indicate whether the client or the
server currently has the initiative. Initially, the database client has the initiative, as it is
the client that must request a connection. It does this by sending a message, indicated
by the arrow labelled ‘Connection requested’. Now the database server has the initia-
tive. It may choose to grant a connection immediately, request user authentication or
even reject the request outright. In the diagram, the database server requests user au-
thentication and so the initiative passes back to the database client, which must get
the username and password from the end user. The third message sends these from the
database client to the database server, which will check them against a list of autho-
rized users before it sends a fourth message granting (or denying) the user a connec-
tion to the database server. In this case the request is granted.
The computing time devoted to establishing a connection is significant, so the
communication channel is kept open, allowing many different messages to be sent.
54
IDWA_C02.QXD 10/5/07 11:06 AM Page 55
Recall the restaurant analogy used to explain this in Chapter 1 – the DBMS keeps lis-
tening for further requests from each of its clients. Eventually, either the client re-
quests that the connection is closed (the user logs out, for example) or the server closes
the connection, perhaps because no messages have been sent for a long time.
An open connection allows the database server to keep track of all requests sent by
a particular client. When an application maintains a record of the current and past
state of its communications, it is called stateful. One useful feature of stateful applica-
tions is the ability to rollback any changes, rather like the ‘undo’ facility of a word
processor. The client can request a series of modifications to the database instance, but
only commit (‘save’) them once all have been completed successfully. Note that the
rollback facility of database applications only usually applies to the database instance.
Any change to the database schema is generally made permanent immediately.
A series of modifications to the database instance that can only be committed if all
of them succeed (they must be rolled back otherwise) is called a transaction. For exam-
ple, if Paul Smith leaves the university, his staff contact details and associated support
sessions should be removed from the Staff Directory database. Consider what happens
if the DBMS successfully removes his staff contact details (shown greyed out in Fig-
ure 2.20), but cannot remove the data about his support sessions. The database will be
inconsistent, as shown, with a row in the SupportSession table where the value
for the staffNo column does not match the value in any (undeleted) row in the
Staff table. In this situation, the row deleted from the Staff table should be rein-
stated, using a rollback.
The messages sent between database client and server can be quite complex. If the
request is a database query, the client must specify exactly what data is required, and
the server’s response must include a copy of the requested data. The broad range of
functions offered by a DBMS discussed above must all be accessible to the client. This
means that an application protocol focused on a specific task is insufficient to support
communication between database client and server. Instead, the database client and
server use a data language to communicate.
A data language is a computer language designed to allow the manipulation of the
data and data structures in a database. It allows database users to tell the DBMS what
they want it to do without necessarily having to explain how to do it.
These languages are initially defined as part of the data model, though typically this
only includes statements to manipulate the data and data structures. When a data lan-
guage is implemented as the language used by a DBMS, it usually includes statements for
manipulating data storage structures, creating usernames and passwords for database
users, starting and stopping the DBMS software itself and so on. Most relational DBMS,
and many others, implement the SQL data language, examined in Chapters 5 and 6.
Data languages need not be full programming languages. Instead, application devel-
opers use data language statements within a program written in a full programming
language, such as Java, Visual Basic or PHP. Hence, they are sometimes called data
sublanguages. Figure 2.21 shows an SQL statement used within a fragment of PHP
code. The SQL statement is:
55
56
IDWA_C02.QXD
10/5/07
Figure 2.20 Deleted row in Staff table leaving unmatched row in SupportSession table.
Staff SupportSession
11:06 AM
staffNo fName IName phone email photo staffNo dayOfWeek startTime endTime
10780 John Smith NULL [email protected] [BLOB] 35054 Monday 09:00:00 10:00:00
25447 John Smith 5104 [email protected] [BLOB] 31210 Wednesday 11:00:00 13:00:00
Page 56
25448 Judith Anne Smith 7709 [email protected] [BLOB] 35054 Monday 15:00:00 16:00:00
31210 Paul Smith NULL [email protected] [BLOB] 45965 Monday 11:00:00 12:00:00
77712 Frank Rose 8871 [email protected] [BLOB] 45965 Wednesday 11:00:00 12:00:00
Chapter 2 Network, database and web technology
14443 Helen Abbot 8032 [email protected] [BLOB] 23257 Monday 15:00:00 16:00:00
23257 Freya Stark 8660 [email protected] [BLOB] 55776 Monday 14:00:00 16:00:00
33935 Padma Brar 6641 [email protected] [BLOB] 56893 Tuesday 14:00:00 15:00:00
35054 Selma Hutchins 8706 [email protected] [BLOB] 56893 Thursday 09:00:00 10:00:00
45965 Mikhail Sudbin 5553 [email protected] [BLOB] 56673 Thursday 15:30:00 16:30:00
35155 Helene Chirac NULL [email protected] [BLOB] 56673 Friday 10:00:00 11:00:00
55776 Gurpreet Choudhury 5454 [email protected] [BLOB]
Unmatched row
56893 Ruth Bapetsi 8022 [email protected] [BLOB]
56673 Joshua Bittaye 7782 [email protected] [BLOB]
89987 Dan Lin 8514 [email protected] [BLOB] Deleted row
78893 Jo Karen O’Connor 8871 [email protected] [BLOB]
33509 Helen Timms 8661 [email protected] [BLOB]
IDWA_C02.QXD 10/5/07 11:06 AM Page 57
…
$result = mysql_query("SELECT fName, IName,
phone, email
FROM Staff", $connection);
$row = mysql_fetch_array($result);
…
PHP variables
Embedding data language statements in procedural application code can be less effi-
cient than running procedural code directly on the DBMS. Recognizing this, many
DBMS support stored procedures by extending the data sublanguage to include the
usual programming language constructs, such as if statements, for loops and so on.
SQL is an example of a data language that has grown into a full programming lan-
guage in this way (see Chapters 10 and 11 for a discussion of these SQL features).
Within a data language, the statements used to manipulate the database instance
form the data manipulation language (DML). These statements allow users to instruct
the DBMS to retrieve, add, modify and remove data. Clearly, the DML lets users tell the
DBMS what they want it to do with the data. With a procedural DML, users must also
tell the DBMS how to retrieve, or manipulate, the data. A non-procedural DML lets the
DBMS work this out for itself, so is simpler to learn. It’s rather like the difference be-
tween the schoolteacher who tells students exactly what to do and the university lec-
turer who makes vague statements about learning outcomes and suggests that you
read a few books.
The data definition language (DDL) is the set of statements for creating and modify-
ing the database schema. These statements allow users to create data structures and
constraints and usually include statements for dealing with data storage structures,
too. Most DBMS will include further management statements in the DDL, such as
statements to create usernames and passwords for database users, start and stop the
DBMS itself or set default character sets.
57
IDWA_C02.QXD 10/5/07 11:06 AM Page 58
WebServer
Network Operating
system software system software
HTTP server
Application server
LogUseOf
Log Website
These documents, called web pages, are written using the hypertext mark-up language
(HTML), discussed briefly in Chapter 1.
At its simplest, a web server manages access to a set of web pages. In this respect, it
is similar to a file server, in this case allowing users access to a structured collection of
files called a website. Figure 2.22 shows the major software and document compo-
nents of a web server. The HTTP server is the web server application software and is
what most people call ‘the web server’. However, ‘web server’ is also used to mean a
particular computer with the appropriate software installed on it. Where the difference
is important, the term ‘HTTP server’ is used instead.
The HTTP server is the main application software that manages access to the web
pages. It does not usually provide services that allow web pages to be modified or to
modify the structure of the document collection (what folders there are). In this sense,
it has less functionality than a DBMS, which manages all aspects of its databases.
As with the DBMS, few application developers write their own HTTP server. Instead,
they use an existing HTTP server to manage access to a custom-built website. The
HTTP server also keeps a log of all requests to access documents on the website and its
responses to those requests (such as denying access).
The final software component shown in Figure 2.22 is the application server. Most
websites include dynamic web pages and these may need to be processed by a special-
ized application server on behalf of the HTTP server before the HTTP server can pass
the resource to the web client.
Most DBMS keep log files, too, but this is seen as a minor part of their functionality.
HTTP servers offer much less functionality than a DBMS and, consequently, the log-
ging function has greater prominence.
58
IDWA_C02.QXD 10/5/07 11:06 AM Page 59
HTTP servers provide a range of functions. The most basic are the connection func-
tions, allowing the client and server to communicate. The HTTP server listens for re-
quests from web clients to establish a TCP connection (see Figure 2.9). Usually an
HTTP server is assigned TCP port 80, so only hears TCP/IP messages with destination
port 80. Note that it is the TCP system software that actually establishes the connec-
tion, but only if the HTTP server accepts the connection request. The HTTP server may
choose to reject a connection request, in which case no TCP connection is established.
For example, some HTTP servers only accept requests from authorized clients, rejecting
any connection requests from unrecognized IP addresses. (Remember, the IP address of
the client is part of the information sent in the IP datagram, so is known to the TCP
system software and, hence, also to the HTTP server.) The HTTP server then keeps track
of all open connections.
The HTTP server’s request-handling functions determine what the client is request-
ing and identify the resources required to fulfil that request. For example, a client may
request a copy of a resource from the website or to add a resource to the website or re-
move a resource from the website or it may request information about a resource or
about the web server itself.
Requests involving website resources entail locating the resource before preparing a
response. The simplest approach stores all the website’s resources in a single folder
called the document root (docroot). The combination of docroot and URL path gives a
pathname for the associated file (see Figure 2.23). For example, the docroot of the Pen-
nine University’s website is ‘/usr/local/httpd/files’. Consider a resource with URL ‘http://
www.pennine.ac.uk/StaffDirectory/search.html’. The URL path is ‘StaffDirectory/search.
html’, so the actual file is located at ‘/usr/local/httpd/files/StaffDirectory/search.html’.
A single HTTP server can manage more than one website by allocating each website
its own docroot.
The HTTP server’s response-handling functions allow it to perform the requested
action. It ensures that any processing is carried out, builds the response message
and sends it. Once the response has been sent, the HTTP server closes the TCP/IP con-
nection. This is an important point. A standard TCP/IP connection is closed as soon as
the response has been sent. Once closed, any information associated with that connec-
tion is lost, including the application state. This means that the communication be-
tween web client and web server is stateless.
The HTTP server usually also has logging functions, used to maintain a log of client
requests and its responses, and security functions to allow for user authentication and
authorization.
These last complete the list of an HTTP server’s core functionality (summarized in
Table 2.2), although, as with DBMS, the more powerful HTTP servers offer additional
utility functions. These may include functions to help the web administrator manage
the site or the web server cope with surges in demand from clients.
Figure 2.23 Comparing the web URL and the pathname of a web resource.
59
IDWA_C02.QXD 10/5/07 11:06 AM Page 60
Websites may include a range of resources. Requests for different kinds of resource need
to be processed in different ways. The HTTP server must know how to process requests
for each type of resource it manages. The processing can be quite simple. For example,
processing the request to view a static web page or image file merely involves taking a
copy of the requested file. However, some web pages include dynamic content. It is im-
portant to understand that dynamic content does not mean animated content. A
video clip includes moving pictures, but that is not dynamic content. All the data
needed by a media player to play the video clip is included in the video clip file – it, in
fact, has only static content. However, a web page that draws some of its content from
a database has dynamic content as the actual file stored on the website does not in-
clude all the data needed to display the web page.
There are two kinds of dynamic content that can be included on a web page:
60
IDWA_C02.QXD 10/5/07 11:06 AM Page 61
Figure 2.24 Relationships between the web client, HTTP server and application server.
RequestsScriptFrom ProcessesScriptFor
Web client HTTP server Application server
Request processing
(b) HTTP and application servers dealing with a request for a dynamic web page.
surrounding static content. The resulting HTML file is sent back to the HTTP server,
which delivers it to the web client.
Thus, when an application server is involved in generating web pages, the web ap-
plication already uses a three-tier architecture (Figure 2.24 (a)). Interestingly, in this
three-tier architecture there will be one HTTP server in the second tier, but there may
be many application servers in the third tier.
The HTTP server needs some mechanism to recognize which web pages need pro-
cessing by which application server. The simplest approach is to have a separate folder
under the docroot associated with each available application server. The HTTP server
knows to pass resources from these folders to the associated application server. Alter-
natively, the file extension can identify files that require special processing.
Augmenting the web server with one or more application servers is the more pow-
erful and flexible solution to the problem of server-side dynamic content. It is more
powerful because application servers can be built to offer as much, or as little, addi-
tional functionality as is required. They offer flexibility as the website developer can
bolt on additional application servers with the required processing capabilities with-
out necessarily having to rewrite existing web pages.
Two issues must be addressed in order to implement this approach effectively. The
first is to decide how to define the server-side dynamic content of a web page. HTML is
61
IDWA_C02.QXD 10/5/07 11:06 AM Page 62
not a programming language, so cannot be used for this purpose. One approach is to
use a standard programming language, such as Java or C⫹⫹, to write a program that
actually writes out the HTML to a new file, line by line. This program becomes the
website resource and the programming language’s runtime environment is the appli-
cation server. Because a program can easily generate dynamic content, this approach is
very powerful. Unfortunately the program must also generate all the static content
and any client-side dynamic content.
An alternative is to extend the HTML language, adding tags that turn it into a pro-
gramming language. These non-standard tags allow for standard programming lan-
guage capabilities – assigning variables, writing ‘if’ statements and loops and so on.
The Macromedia® Coldfusion® web development environment takes this approach,
with the ColdFusion application server dealing with non-standard tags on behalf of
the HTTP server.
A third approach, and the one taken in this book, is to interleave static HTML with
program code. Unlike a Java or C⫹⫹ program, the program code is not compiled. In-
stead, the application server interprets the code as it is written by the application
developer. This is known as scripting.
There are several different scripting languages, but all work in the same way. Look at
the two files shown in Figure 2.25. On the left is a PHP script. The first section is pure
HTML, followed by a section of PHP (and some SQL!) enclosed within the special tag
'<?php...?>'. In the next section, some lines include both HTML and PHP. The
final section is pure HTML. On the right is an example of the HTML generated when
the PHP application server processes this script. It is this HTML file that is sent to
the web client, not the PHP script. Notice how the final PHP statement (circled in
Figure 2.25) has generated some dynamic content – in this case, Selma Hutchins’ e-
mail address – that has been merged with the static content to produce a line of pure
HTML. Chapter 4 introduces PHP scripting and it’s used in subsequent chapters to
solve a number of problems that arise when developing web database applications.
The second, more technical, issue is how to define the communication interface
between the HTTP server and the application servers. As one HTTP server may need to
communicate with many different application servers, the obvious solution is to develop
a suitable application programming interface (API) for the HTTP server. Then, each
application server can be written to use this API.
Probably the most widely used HTTP server API is the common gateway interface
(CGI). The CGI API defines how an HTTP server passes web resources to the application
server. In particular, it provides a mechanism for passing the value of any variables
(such as data entered by the end user in an HTML form) from the HTTP server to the
application server.
One weakness of the CGI API is that the HTTP server and application server must be
installed on the same machine, which may not be suitable for some web applications.
Also, the CGI API is mainly used by those writing their own application servers, using
a standard programming language (the first approach discussed above). It has the ben-
efit of portability as, once an application developer is skilled in using the CGI API, he
or she can work with most HTTP servers. The downside, however, is that you need to
learn the API.
In contrast, scripting languages come with ready-made application servers, which
are often closely integrated with a range of HTTP servers by means of a server-specific
62
IDWA_C02.QXD
10/5/07
<!DOCTYPE html PUBLIC"-//W3C//DTD XHTML 1.1//EN" <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
11:06 AM
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html> <html>
<head> <head>
Pure <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> <meta http-equiv="Content-Type" content=" text/html; charset=ISO-8859-1" />
HTML <title> Pennine University Staff Directory </title> <title> Pennine University Staff Directory </title>
</head>
Page 63
</head>
<body> <body>
<!--Need to retrieve the staff contact details from the database --> <!--Need to retrieve the staff contact details from the database -->
<?php
include '../MySqlConnection.inc'; //Sets up the database connection <h1> Staff Details </h1> Pure
HTML
$sql = "SELECT staffNo, fName, IName, phone, email <p>Name: Selma Hutchins
Pure FROM Staff <br />Phone: 8706
PHP WHERE staffNo = '" .$_GET["staffNo"] "'"; <br />Email: [email protected]
</p>
$result = @ mysql_query($sql, $connection);
$row = mysql_fetch_array($result); <hr />
?> <p>Author: <cite>Martyn Prigmore</cite>
<br />©2005 and beyond
<h1> Staff Details </h1> </p>
</body>
HTML <p>Name: <?php echo $row["fName"]. " ". $row["IName"];?> </html>
and PHP <br />Phone: <?php echo $row["phone"] ?>
<br />Email: <?php echo $row["email"];?>
</p>
<hr />
<p>Author: <cite>Martyn Prigmore</cite>
Pure
<br />©2005 and beyond
HTML
</p>
</body>
</html>
PHP script HTML
63
2.4 Web technology
IDWA_C02.QXD 10/5/07 11:06 AM Page 64
API. The application developer does not need to learn the details of the API, but uses a
range of functions and facilities provided by the scripting language to get data from
and pass it back to the HTTP server. In effect, someone else has already done the hard
work of developing a nice, easy-to-use layer of functions above the API. This makes
scripting languages easier to learn and use than the combination of CGI and a stan-
dard programming language and they certainly offer more built-in functionality. In
particular, they usually support database connectivity, as well as HTTP server connec-
tivity. So, learning just one scripting language allows the application developer to
build web database applications. These benefits make scripting languages a good
choice for first-time web database application developers.
The role of the web server is to store the web resources, and the role of the web
client is to use them. Figure 2.26 shows a typical web client. The main application soft-
ware on the web client is the web browser. It uses the network system software to es-
tablish a connection to the HTTP server, request web resources and display them to the
end user. Most of these resources are web pages and images, which the web browser is
able to display itself.
Some resources need plug-ins. Examples of plug-ins include the Macromedia®
Flash® Player software and the Adobe® Acrobat® Reader® software. These play a similar
role to application servers on the web server, providing additional functionality to that
offered by the web browser itself. While some web database applications make use of
resources that require plug-ins, they are not essential, so won’t be considered further.
Be aware that they form a part of the web client architecture, though.
Plug-ins are not the same as browser script. Browser script is like the server-side
scripting discussed above. Program code is interleaved with static HTML, only, in
this case, the program code is sent to the web browser along with the static HTML.
The web browser recognizes and interprets the browser script code itself – there is no
need for separate application servers on the web client to interpret the browser
script. Browser script can’t connect to databases, so is not directly relevant to the
topic of this book. For simplicity and clarity, browser script isn’t used in the exam-
ples or exercises.
WebClient
Network Operating
system software system software
Web browser
Plug-in
64
IDWA_C02.QXD 10/5/07 11:06 AM Page 65
One last point about Figure 2.26. There are web clients that do not include web
browsers. For example, spiders explore websites automatically and can build an index
of keywords. Search engines use spiders to keep their indices up to date. There is a
range of such automated web clients, but the focus here is on the web browser and
how it can provide end users with access to a database.
The main function of a web browser is to display web pages. The language used to
define the structure and content of a web page is the hypertext mark-up language
(HTML), as already mentioned. The structure is defined using elements. Each element
contains a particular kind of content and is delimited by a start and an end tag. For ex-
ample, the title element contains the document’s title. Its start and end tags are
'<title>...</title>' (in the document head of Figure 2.27). The start and end
tags for all elements follow this same format – the element name enclosed in angle
brackets, with a forward slash distinguishing the end tag from the start tag. So, for ex-
ample, the body element will have start and end tags '<body> ... </body>'.
Every valid HTML document has the same basic outline as that highlighted in Fig-
ure 2.27. First, a document type declaration indicates which version of HTML is being
used (in Figure 2.27, it is XHTML 1.1).
The actual HTML document follows the document type declaration, enclosed in the
‘html’ element, and split into a document head and a document body.
The document head contains information about the document and is enclosed in
the head element. Every document head must include a title element. Most web
browsers display the title in the browser window’s title bar.
Document type
declaration
Document head
Document body
65
IDWA_C02.QXD 10/5/07 11:06 AM Page 66
The body element encloses the document body, which is the actual content of the
web page. The document body can include a range of elements to define the sort of
content on the web page. For example, in Figure 2.27, the '<p>...</p>' tags de-
limit paragraph elements, the '<h1>...</h1>' tags delimit a level-one heading
and the '<a...>...</a>' tags delimit an anchor (that is, a hyperlink). The start
tag for the anchor element includes an ‘href’ attribute. Attributes are name-value pairs
that provide additional information about the element – in this case, the URL of the
document that the hyperlink links to.
Some elements, like the html, head and body elements, can only contain other ele-
ments. They define the document’s high-level structure. The title, paragraph, heading
and anchor elements can all contain text (as well as certain other elements). This text
forms the document content. The web browser will present the text in a different for-
mat for different elements. For example, text in a level-one heading element is usually
black text, displayed bold and in a large font size, while text in an anchor element is
underlined and in a different colour (often blue). Users can instruct their web browser
to present each element’s content in a particular way, so the same document can look
quite different to two different users.
Cascading stylesheets (CSS) control presentation and, together with HTML pro-
vide web browsers with all the information they need to handle web pages.
There is only room for this brief sketch of HTML here. Readers not familiar with HTML
should consult one of the sources in the Further reading section below for more infor-
mation. The examples used in this book use plain HTML and the web browser’s
default presentation, rather than user-defined style sheets, so there is no need to
understand CSS in this book.
The web browser and web server communicate using an application protocol – the
hypertext transfer protocol (HTTP). HTTP plays a similar role in web communication
to the role played by SQL in database communications. It allows the client to request
specific services – called HTTP methods – from the server. For example, the HTTP GET
method allows a client to get a copy of a particular resource from the server. HTTP was
designed to utilize the TCP/IP suite of internetworking protocols. It sits in the top layer
of the protocol stack shown in Figure 2.7.
The main function of HTTP allows a web browser to request a web page and the web
server to respond by sending a copy of the web page. The process of splitting the web
page into bite-sized chunks, sending these across the Internet and reassembling them
at the other end is performed by the network system software (TCP/IP).
Figure 2.28 illustrates what happens when a web browser requests a web page.
First, the web browser asks its local network system software to establish a connec-
tion with the server. This involves creating and then sending a connection request
across the Internet to the server. The client network system software creates a TCP
socket and sends a TCP connection request to the server. The server’s network sys-
tem software receives the request and asks the HTTP server to either accept or reject
66
IDWA_C02.QXD 10/5/07 11:06 AM Page 67
sd HTTP request
Request connection
Create
client
socket
Request connection
Approve connection
Connection approved
Create connection
Confirm connection
Create
IP packets
Send IP packets
Reassemble
IP packets
it. In Figure 2.28, the request is accepted and this message makes its way back across
the Internet to the web browser. At this point, the web browser sends the request for
the actual web page. Again, this goes via the local network system software. The re-
quest is broken up into IP packets, ready to be sent across the Internet (note that the
connection request and response messages are also sent as one or more IP packets,
which is not shown on the diagram). The IP packets are reassembled into the origi-
nal request message by the server’s network system software and the message for-
warded to the HTTP server. All this occurs before the HTTP server even begins to
process the request.
There are just two kinds of message in HTTP. An HTTP request message is created
by a web client and sent to a web server. The web server then creates an HTTP
67
IDWA_C02.QXD 10/5/07 11:06 AM Page 68
…
Header section
Header name : Header value CRLF
CRLF
response message to send back to the web client. As indicated in Figure 2.28, HTTP
messages are actually sent as a number of IP packets, each of which includes the TCP
connection to identify which application generated the message and which applica-
tion it is targeted at. The interactions between the client and server network system soft-
ware shown in Figure 2.28 are invisible to the web browser and HTTP server, which
simply see the HTTP messages themselves.
Both kinds of HTTP message have the same structure (see Figure 2.29). The first
line is the start line, which always ends with the carriage return and linefeed (CRLF)
characters. Following the start line is the header section, with one or more headers,
each terminated by a CRLF. The header section always ends with a blank line, even if
there were no headers in the message. Finally, the entity body includes the actual data
being sent in this message. The entity body is optional and many request messages
have no entity body. This is because many request messages are simple requests to
view a static web page, so there is no need to pass data to the HTTP server. In contrast,
most response messages do include an entity body – the HTML source of the requested
web page.
The header section may contain zero or more headers. Each header takes the form
of a name-value pair separated by a colon – ‘:’ – as indicated in Figure 2.29. General
headers provide information about the message itself. For example, the date header
gives the date and time that the message was created. Entity headers describe the enti-
ty body or, when the entity body is missing, the requested resource. For example, the
allow header lets a server tell the client what methods are allowed to be used on this
resource, while the content-length header gives the size of the entity body. General
and entity headers may be used in both request and response messages.
Request headers are used only with request messages. They give the server informa-
tion about the client and its capabilities or about the request message itself. For exam-
ple, the accept header lets the server know what media types the client can accept and
the host header holds the domain name and port of the website the request message is
targeted at – useful if the HTTP server manages more than one website.
Response messages also have their own headers – the response headers. These can
give information about the response message itself or the HTTP server sending the
message. For example, the server header includes the name and version of the HTTP
68
IDWA_C02.QXD 10/5/07 11:06 AM Page 69
server application software and the title header is used when the requested resource is
a web page, and includes the content of the HTML title element.
The start line for a request message has the structure shown in Figure 2.30 (a). First
comes the name of the HTTP method, followed by a single space and then the URL
of the requested resource to apply the method to. Another space is followed by the
HTTP version, indicating which version of HTTP the client has used to build the
request message.
A response line (Figure 2.30 (b)) begins by stating which version of HTTP was used
to build the response. Typically, both client and server would use HTTP 1.1. The
response line continues with a single space, then the status code, which indicates what
happened when the requested action was performed. The reason phrase is a descrip-
tion of the status. The start line of a typical request message might be:
69
IDWA_C02.QXD 10/5/07 11:06 AM Page 70
is not considered a safe method. It usually passes data to an application server (see Fig-
ure 2.22), which may initiate changes on the web server other than simply logging the
request. For example, it may modify a database.
A third method that can be useful to web application developers is the HEAD
method. This is very similar to GET, but only the start line and headers are included in
the response message. The entity body is omitted. This allows the client to check
whether it really wants to request this resource or not. For example, the resource might
be in a media type that the web client cannot process, or the resource may be too big.
The HEAD method is considered safe.
Table 2.3 summarizes the safety and purpose of these three methods.
There are 41 status codes defined in the HTTP/1.1 specification. Table 2.4 lists some
of the common ones. Some status codes have headers associated with them. For exam-
ple, a response message with the status code ‘301 Redirection’ informs the client that
the requested resource has moved from the location specified by the request URL. The
response message should include a location header, which gives the URL at which the
requested resource can now be found. Typically, the web client will send a second re-
quest message using this new URL.
70
IDWA_C02.QXD 10/5/07 11:06 AM Page 71
Although both database applications and web applications can be developed with a
two-tier client-server architecture, web database applications require a three-tier archi-
tecture (see Figure 2.2). Figure 2.31 shows the three-tier architecture for web database
applications. The major components of the web server and database server, discussed
above, are included in the model. Notice that the ‘RequestData’ association in Fig-
ure 2.31 has the name ‘DatabaseClient’ at one end. This is simply a means of indicat-
ing that the application server is playing the role of a database client when it interacts
with the DBMS. For example, it may ask the DBMS to retrieve data from the database,
then integrate that data into a skeleton web page before the final web page is passed to
the HTTP server, ready to be sent to the web client.
Figure 2.31 highlights two problems for web database applications. First, the
DBMS only communicates directly with the web server, not with the web client,
which represents the real end users. If the DBMS is to distinguish between its many
different end users, then the web server must maintain a separate connection to the
DBMS for each web client. Notice that any database application with a three-tier archi-
tecture suffers from this problem.
One solution is to include in the middle tier some software known as a transac-
tion processing monitor. The transaction processing monitor keeps track of individ-
ual users, rather than the DBMS doing so. It maintains a separate connection with
each client and a pool of connections to the DBMS. This allows it to keep track of
requests from different clients and ensure that responses from the database server are
routed to the correct client. Transaction processing monitors can handle many hun-
dreds of concurrent users. However, there may be hundreds of thousands of web
clients accessing a web server at the same time. Consequently, in the Web environ-
ment, transaction processing monitors are not yet a viable general solution to this
problem.
The second problem concerns communication protocols. The channel of communi-
cation between the web client and the HTTP server is stateless. Once the HTTP server
has responded to the web client’s request, the connection is closed. However, a DBMS
expects to maintain an open connection with its end users. Even if it maintains an
open connection with the web server, it cannot maintain an open connection with
WebServer DatabaseServer
RequestWebPage
Web client HTTP server Database
RequestData
Application server DBMS
DatabaseClient
71
IDWA_C02.QXD 10/5/07 11:06 AM Page 72
the real end users, represented by the web clients. (This is another reason for transac-
tion processing monitors not being easy to use in web database applications.)
The design and development of web database applications must take account of
both these problems. Techniques for handling them effectively are discussed in the
following chapters.
Most web database applications are designed to allow an end user to access a data-
base via a web browser. It is worth tracing what happens to the data as it passes from
user to database server and back again. Consider the HTML form in Figure 2.32. This
form allows an end user to type the surname and first initial of a member of staff and
search the database for their contact details. The form element’s action attribute is the
URL ‘http://www.pennine.ac.uk/StaffDirectory/SearchName.php’. When the user clicks
on the ‘Submit Query’ button, the web browser’s first task is to extract the domain
name and translate it into an IP address. Each web browser keeps a local cache of such
translations, for URLs that have been accessed regularly by the end users. If the IP ad-
dress for ‘www.pennine.ac.uk’ is not in the local cache, the web browser will use the
domain name system to look it up. As discussed above, this involves sending a mes-
sage to the ‘.uk’ name server. Once the domain name has been resolved into an IP ad-
dress, the web browser opens a TCP connection to the HTTP server. At this point, the
actual HTTP request message can be sent.
HTML forms can use either the GET or POST methods to send data to the HTTP
server. The form in Figure 2.32 uses GET, so the web browser creates the HTTP request:
GET
www.pennine.ac.uk/StaffDirectory
/SearchName.php?surname=Smith&initial=J HTTP/1.1
(Note that the HTTP request line will be a single line. Here, it has been split over two
lines to fit it on the page.)
72
IDWA_C02.QXD 10/5/07 11:06 AM Page 73
Further reading
The form data has been copied into name-value pairs in the query string of the URL.
There are no headers and no entity body. The HTTP request message is sent, using
TCP/IP, to the HTTP server.
When the HTTP server receives this message, it recognizes that the requested re-
source is a PHP script, not a plain HTML file. The HTTP server delegates the task of run-
ning the PHP script to an application server. It tells the application server the script file’s
location in the local file system – in our example, the file is called SearchName.php
and it is located in the StaffDirectory folder under the docroot. The HTTP server
also passes the name-value pairs from the URL of the HTTP request message. The appli-
cation server now runs the PHP script, using the name-value pairs as input parameters.
The SearchName.php script opens a connection to the DBMS on the database
server (see Figure 2.19) and uses an SQL query to ask the DBMS which staff have the
surname ‘Smith’ and first initial ‘J’. The DBMS searches the database and finds
three matching rows in the Staff table (see Figure 2.14). It passes these three rows to
the application server, which uses the data to create a web page listing these members
of staff. The application server passes the web page to the HTTP server, which sends it
as the entity body of an HTTP response message to the web browser. Finally, the HTTP
server closes the TCP connection. This process happens each time data is submitted to
an HTTP server and used to query or modify a database.
Chapter summary
■ The aim of this chapter has been to describe the concepts and technologies that
underlie web database applications. As web database applications are always client-
server applications, the first section discussed the client-server architecture in gen-
eral and the final section showed how this concept applies to web database
applications. Just as the client-server architecture forms a conceptual basis for web
database applications, networks and network technologies form the technical
basis. Without networks, there would be no World Wide Web (though there could
still be databases). It is hard to discuss networks without getting a little technical,
but understanding how networks work is well worth the effort.
■ The rest of the chapter introduced the two technologies that are explored in the
rest of the book – namely, database and web technology.
Further reading
73
IDWA_C02.QXD 10/5/07 11:06 AM Page 74
All the above discuss the OSI seven-layer reference model for layered network architec-
tures. If this is not familiar to you, don’t worry, as, while important, it is not essential
knowledge for database specialists.
Berners-Lee (1994) is the original Internet ‘request for comments’ (RFC), detailing the
URL syntax, with Fielding (1995) describing the authoritative URL syntax for relative
URLs.
The classic reference for databases is Date (2004), with a critical discussion of standard
approaches and technologies. Elmasri and Navathe (2007) and Connolly and Begg
(2004) are good alternative texts. Much of the material covered in this book is also cov-
ered in these books, although web topics tend to be dealt with in isolation from the
main text.
Though now quite old, Wilde (1999) provides a good general introduction to Web and
Internet technologies. Gourley and Totty (2002) is a good reference for HTTP and the
web architecture, while Ince (2002) covers Java technologies and web robots.
Review questions
2.3 Define the terms internet, Internet, intranet and extranet, highlighting the similarities
and differences between them.
2.4 What is a protocol? Explain the purpose of each of the following protocols:
(a) IP
(b) TCP
(c) HTTP.
2.7 In the ANSI/SPARC architecture for database systems, what is meant by:
(a) logical data independence
(b) physical data independence?
74
IDWA_C02.QXD 10/5/07 11:06 AM Page 75
Investigations
2.9 Explain what a mark-up language is, clearly distinguishing between the terms
‘attribute’, ‘element’ and ‘tag’.
2.10 Explain why each of the four numbers in an IP address lies between 0 and 255.
Exercises
2.11 Describe the purpose of application partitioning and the different responsibilities
involved. Explain, using a diagram, how application partitioning in a two-tier data-
base application would differ from that in a three-tier database application.
2.12 Briefly describe the function of a bridge in a local area network. A router performs a
similar role in an internet, but functions differently. Describe this difference.
2.13 Explain the addressing mechanism used on the Internet. You should include an
explanation of the domain name system and uniform resource locator.
2.14 Describe the functions of a DBMS. Why does a DBMS vendor include utility func-
tions in its product?
2.15 Describe, with the aid of a diagram, the ANSI/SPARC database systems architec-
ture. You should explain the purpose of each of the various views, and the mappings
between them. What is the main purpose of this architecture?
Investigations
2.17 The database approach to storing data has a number of advantages over the file-
based approach (discussed in Chapter 1). Investigate how the functions of the
DBMS ensure that these advantages are realized in practice.
75
IDWA_C02.QXD 10/5/07 11:06 AM Page 76
of web database applications compared to rich client (that is, traditional GUI) data-
base applications. One approach to this question is to consider what happens at the
application protocol level when a user submits data that is used to modify a data-
base instance, for both web and rich client database applications.
2.19 Using a diagram, explain the three-tier architecture for web database applications.
You should explain the purpose of each of the three tiers and the separation of appli-
cation functionality across the tiers. How useful is this approach in practice?
2.20 Most contemporary approaches to web development make use of the model–view–
controller design pattern. Investigate this approach to application development, its re-
lationship to the three-tier architecture and its use by PHP web application developers.
76
IDWA_C03.QXD 10/5/07 11:08 AM Page 77
3 Data models
Chapter objectives
Chapter outline
1 a set of structures that are used to organize data about things that the end user
is interested in and which can capture relationships between data held in differ-
ent structures
2 a language that the end user can use to manipulate structures and the data they
contain
3 a collection of integrity constraints that limit what data can be held or how that
data can be changed.
The structures a data model provides to organize data and describe relationships
should be sufficiently flexible to represent any information a database user might wish
to keep. They should be sufficiently simple, too, so that software engineers can create
a DBMS capable of managing these structures efficiently. This need for a balance be-
tween flexibility and simplicity means that there is only a small number of widely
used data models. The most widely used of these is the relational data model and the
one currently generating most interest is the semi-structured data model of XML.
Both these data models are covered in detail in this chapter.
Each data model has at least one data language defined for it. Some have more than
one. The relational data model, for example, has two formal languages, the relational
77
IDWA_C03.QXD 10/5/07 11:08 AM Page 78
78
IDWA_C03.QXD 10/5/07 11:08 AM Page 79
■ Here, Section 3.1 discusses data items and how they are represented on a computer.
The discussion centres on database applications, but applies more generally to all
computer software. It also discusses data types and their operators – the first steps
towards a data model and its data language respectively.
■ Section 3.2 discusses the relational data model. It describes the table – the structure
used to organize data – in some detail. The consequences of allowing columns to be
null are explored. Detailed discussion of the data language is postponed to Chap-
ters 5 and 6, but the main relational integrity constraints are explained and some
consequences of enforcing them explored.
■ Section 3.3 discusses XML, which can be viewed as a semistructured data model.
Data items represent some fact about a thing of interest to an organization. For exam-
ple, in our case study, they can represent the phone number of a member of staff or
the day and start time of a support session. In a database, data items are stored on a
computer as values. Each value represents some fact about a particular thing. The facts
‘Paul Smith is the name of a member of staff’ and ‘[email protected] is Paul
Smith’s e-mail address’ are represented by storing appropriate values in the appropriate
data structure.
When a human being enters data into a database, he or she passes the value as a
literal. A literal is a human-readable representation of a particular value. The literals
‘21 January 2005’, ‘21-JAN-2005’ and ‘2005-01-21’ all represent the same value: the
twenty-first of January in the year 2005 of the common era. Even this simple example
shows that there can be many literals representing the same value. The same is true of
numbers. Thus, 1.5 is the same as 1 1⁄2 which is the same as 3/2. In each case, the actu-
al value stored on the database would be the same, though the data model does not
specify the details of these internal storage formats.
Most DBMS have a preferred format for literals. Numeric literals are usually decimal
numbers, not fractions. Literals for character strings are natural language text. Typically,
character string literals are enclosed in single or double quotes to distinguish them
from any surrounding text. For example, ‘abc xyz’ is a character string literal with
seven characters – six letters and a single space character. This raises the interesting
problem of how to store the name Rosalind O’Neil without the apostrophe being mis-
taken for the ‘end of string’ single quote. The SQL language solves this problem by the
convention that two single quote marks represent an apostrophe, so the literal
‘Rosalind O"Neil’ will be stored as the data value Rosalind O’Neil.
The preferred format for date literals varies. The Oracle DBMS prefers ‘21-JAN-2005’,
while the MySQL DBMS prefers ‘2005-01-21’. This is because there is no internationally
agreed format for date literals. For example, in the USA, ‘3/4/2005’ is March the fourth
2005, while in Europe it is the third of April 2005. The DBMS data language will pro-
vide ways to convert date literals from one format to another. The application
programmer can use these to ensure that dates entered by the end user are correctly in-
terpreted by the DBMS. Again, date literals are enclosed in single quotes.
79
IDWA_C03.QXD 10/5/07 11:08 AM Page 80
A variable is a named container for data values. At any one time, a variable holds a
single value, but the value can be changed. Consider the following database query put
to the Staff Directory database:
Here, staffSurname is a variable that can take the value of any surname. The Staff
Directory Search Page (Figure 3.1) asks this question. The value of staffSurname is
set to the literal entered in the text box labelled ‘Name’. This means that a single data-
base query can be used to search for the contact details of any member of staff.
Dates, character strings and numbers are different kinds of values for data items and
are used in different ways. Numbers can be added, multiplied, averaged or even have
their cube root taken. Dates can be compared, but not multiplied. It makes sense to ask
whether 21 January 2005 is earlier or later than 3 August 2004, but not to multiply the
dates together. Strings can be sorted into alphabetical order or concatenated (added to-
gether). Concatenating the string ‘This is part’ with the string ‘of a sentence’, gives the
string ‘This is partof a sentence.’ Notice that, because there was no space character
after the last character of ‘This is part’ or before the first character of ‘of a sentence’,
the concatenated string is not correct English – ‘partof’ should be ‘part of’. This is a
common error made by programmers when assembling sentences from string data
held on a database.
A reasonable first step in organizing data is to identify the different kinds of data
values. A data type does this by giving a name to a set of values that share some com-
mon characteristics.
80
IDWA_C03.QXD 10/5/07 11:08 AM Page 81
For example, emailAddress would make a sensible data type. Every e-mail address iden-
tifies an electronic mailbox, so they are clearly the same sort of data. However,
emailAddress is not usually one of the data types defined by a data model. It is too spe-
cific as not all database applications need to store data about e-mail addresses. Instead,
a data model usually allows user-defined data types. The database designer can define
data types such as emailAddress if they are needed. Some data types are so common
that data models almost always include them. The commonest system-defined data
types are listed in Table 3.1, together with a brief description of their values.
The numeric data type includes the sorts of numbers used in basic mathematics. If
values of a particular data item can sensibly be added together or sorted into numeri-
cal order, then its data type is numeric. Clearly, phone numbers are not numeric. Staff
numbers at the Pennine University might be numeric. Although it doesn’t make sense
to add them together, it might make sense to sort them into numerical order. Before
deciding the data type for staff numbers, the database designer should check if this is
the case.
A data item that is numeric can be given a precision (the number of digits used) and
a scale (the number of digits after the decimal point). Salaries at the Pennine Univer-
sity range up to £135,000.00, which means a precision of 8 and a scale of 2.
If it does not make sense to sort staff numbers into numerical order, then their data
type will be character string. Character strings are simply strings of characters. The
data model may allow the database designer to specify which character sets are
supported – Western European, Cyrillic or Arabic, for example. Character strings can
be of varying length or fixed length. Telephone numbers are varying length character
strings. For example, ‘+44 (0) 1484 042526’ and ‘01484 042526’ are both the same
phone number, with the first written out for international dialling. A staff number is
always five characters long, so is of fixed length. The abbreviation Varchar is used to
indicate a varying length character string, and Char a fixed-length character string.
The Boolean, date and time data types are straightforward. Like dates, the Boolean
data type has a variety of literals. Depending on the DBMS, the value for true may be
represented by the literals TRUE, 1 or 0. Like character strings, the literals for dates and
times are usually enclosed in single quotes.
81
IDWA_C03.QXD 10/5/07 11:08 AM Page 82
The interval data type has values that measure an interval of time. There are two
different sorts of interval or, rather, there are two different interval data types. One
measures intervals of time in years and months, the other in days, hours, minutes and
seconds. The two cannot be mixed as months do not all have the same number of
days. If you buy a monthly bus pass, adding an extra 28 days on 1 February gets you a
whole month extra, while on 1 August, it gets you three days short of a month. The ir-
regular structure of the calendar was a real problem for early application developers.
Fortunately, the developers who wrote today’s DBMS have dealt with these problems.
If a database application needs to manipulate dates, then the easiest approach is to use
the DBMS date manipulation functions.
The data types in Table 3.1 are known as scalar data types (sometimes called atom-
ic or primitive data types). To the end user, each value of a scalar data type is complete
in itself – it has no subparts that are directly accessible to the end user. The numeric
value 57 is complete in itself. Fifty seven is always, and only ever, fifty seven. It is less
obvious that the date value ‘21 January 2005’ is complete in itself. Obviously it is al-
ways the same date, but it seems to have user-accessible subparts – the year, month
and day. In fact, this is simply one way of representing the underlying date value. An-
other possible representation would be to show a date as the number of seconds
elapsed since some fixed point in the past (the UNIX operating system uses this form
of representation). So, although the date literal ‘21 January 2005’ has subparts, the date
value it represents does not.
In contrast, non-scalar data types do have user-accessible subparts. Table 3.2 lists
some system-defined non-scalar data types. They are non-scalar because they allow the
end user to store a collection of data values as a single value, but retain access to the
values of the members of the collection. The non-scalar data types in Table 3.2 are par-
ticularly simple, as the members must be of the same type. For example, if staff at the
Pennine University can have more than one phone number, each of which is a charac-
ter string, then it makes sense to represent them as a set of character strings. The end
user can see the value of some, or all, of the members of this set. A set is the simplest
kind of non-scalar data type, with no implied ordering of members and no repeated
member values. This is because there is no point having two members representing the
same phone number, for example.
A multiset (sometimes called a bag, but multiset is the SQL terminology) is also an
unordered collection, but it allows repeated members. The list of books a lecturer has
could be represented as a multiset because he or she might have two copies of the
same textbook (one for the office, one for home, say).
An array is an ordered collection of members and, like a multiset, may include re-
peated member values. Although arrays are the most complex of the three non-scalar
82
IDWA_C03.QXD 10/5/07 11:08 AM Page 83
data types considered here, they are actually the simplest to implement on a computer.
Accordingly, most programming languages include arrays, but many do not include
the set or multiset data types. The array data type was added to the SQL data language
in SQL:1999, but it was only in SQL:2003 that multiset was added. There is still no set
data type in SQL.
Each data type has a set of operators associated with it. An operator takes one or
more values, manipulates them in some way and outputs a single value. The output
value is not necessarily the same data type as the inputs. Operators actually combine
literals and variables as they are the representations of data values used by database
users. A valid combination of literals, variables and operators is called an expression.
Thus, ‘2 3’, ‘x 7’ and ‘(3 z) – 2’ are all valid numeric expressions (assuming x and
z are numeric variables). An expression is said to evaluate to its output value, so ‘2 3’
evaluates to 5 and ‘15 < 12’ evaluates to false. An expression that evaluates to a numer-
ic value is called a numeric expression, one that evaluates to a Boolean value is a
Boolean expression and so on.
Boolean expressions are extremely important in SQL and other data languages.
They allow the end user to place conditions on the actions of the DBMS: ‘only do this
action if this Boolean expression evaluates to true’. Most Boolean expressions involve
comparison operators. These compare two values of a given data type, but always
return a Boolean value.
Comparisons are often used when an application needs to make choices. For exam-
ple, comparisons are used when an end user wants the DBMS to retrieve rows from a
database table, but only those rows where the data values satisfy certain conditions.
Suppose a user of the Staff Directory database wants to know which support sessions
start before 10.00 am. The DBMS must look at each row of the SupportSession
table and, if that support session starts before 10.00 am, show the data to the end user.
The Boolean expression startTime<'10:00' evaluates to true if the support ses-
sion starts before 10 am and to false otherwise. The DBMS can evaluate this Boolean
expression, so the user can pose a query in the form:
‘Show me the data from those rows of the SupportSession table where the
Boolean expression startTime<'10:00' evaluates to True’
One of the difficulties many ordinary database users have is framing their question
in this form. Database application developers create client applications that make it
easier for end users to state what data they want, then translate these data require-
ments into a form that the DBMS can understand.
The ‘less than’ operator – ‘<’ – is a good example of a comparison operator. It com-
pares the relative positions of two values in the usual ordering for their data type. For
numeric values, 2 < 3, evaluates to true, while 7 < 5 evaluates to false. For character
strings, ‘apple’ < ’banana’ is true, because ‘apple’ comes before ‘banana’ in a standard
English dictionary.
Table 3.3 lists the other common comparison operators, their usual symbols and
the symbols used in SQL and most other computer languages. Computer languages use
different symbols because of the limitations inherent in computer keyboards. The
table also shows which data types the comparison operator can be used with. For
83
IDWA_C03.QXD 10/5/07 11:08 AM Page 84
SQL Familiar
Comparison operator symbol symbol Used with data types
Equal to = = All data types.
Not equal to <>, or != All data types.
Less than < < Numeric, character string, date,
time, interval.
More than > > Numeric, character string, date,
time, interval.
Less than or equal to <= Numeric, character string, date,
time, interval.
More than or equal to >= Numeric, character string, date,
time, interval.
example, it doesn’t make sense to ask whether true is less than false, so this operator
doesn’t work with the Boolean data type.
In most languages, it is not permitted to compare values of different data types. For
example, 5 < ‘one’ is not a valid Boolean expression as 5 is a numeric literal, but ‘one’
is a character string literal. A language that insists comparison operators only compare
values of the same type is said to be strongly typed. Not all languages are strongly
typed. For example, SQL is, but PHP is not. The advantage of strong typing is that it
can prevent misleading results. For example, in SQL the Boolean expression 5 < ‘one’
is simply not allowed. A DBMS encountering this expression would tell the user that
there was an error in the SQL statement. In contrast, PHP is not strongly typed, so it
evaluates the Boolean expression 5 < ‘one’ and announces that it is false. What does
this mean? The two literals can only meaningfully be compared if both have the same
data type. PHP converts the character string literal ‘one’ to the numeric literal 0 and
evaluates the Boolean expression 5 < 0, which is clearly false. Unfortunately, PHP eval-
uates 5<‘seven’ as false, too as ‘seven’ is also converted to 0! This can be very mislead-
ing, but there is a reason for PHP not being strongly typed.
Consider what happens when SQL is confronted with the Boolean expression ‘5’ <
‘17’. As both literals are character strings, SQL evaluates this expression as false. In a
dictionary ordering, all character strings beginning with ‘1’ come before those begin-
ning with ‘5’, so this is correct, but it is misleading. PHP automatically converts both
character string literals to numeric values and evaluates the Boolean expression 5 < 17,
getting the expected value true. The reason it does this is that, when data is passed
from an HTML form to PHP, any values entered in a text box are passed as character
strings, even if they are actually numbers. Because PHP can automatically convert
character string literals such as ‘5’ and ‘17’ to numeric values, it is easier to deal with
data from an HTML form. Even so, it is better to be sure of the data type of any vari-
ables to avoid misleading results.
In contrast to comparison operators, the basic arithmetic operators take numeric
values as inputs and output another numeric value, so the numeric expression 2 3
84
IDWA_C03.QXD 10/5/07 11:08 AM Page 85
Table 3.4 Single data type operators for common system-defined types.
evaluates to five. The ‘ – ’ operator can either combine two numeric values or modify a
single value, so 7 – 4 evaluates to three, while –17 evaluates to minus seventeen.
An operator that takes two input values, like ‘+’, is called a binary operator. Opera-
tors with only one input value are unary operators. The symbol ‘ – ’ can represent ei-
ther a unary operator or a binary operator – the context making it clear which is
intended. Table 3.4 lists the most common non-comparison operators usually provid-
ed for each data type listed in Tables 3.1 and 3.2, together with their more familiar
symbols, where these exist. All these operators take values of the given data type as in-
puts, and output a value of the same data type. The operators for dates, times and in-
tervals behave differently, so are dealt with below.
The concatenation operator – || (two vertical bar characters) – acts on character
strings (or arrays or BLOBs). It pastes the second character string (or array or BLOB) on
to the end of the first, so ‘the quick’ || ‘brown fox’ evaluates to ‘the quickbrown fox’.
Notice how there is no space between the words ‘quick’ and ‘brown’. This is correct as
neither of the input character string literals included a space at this position. Applica-
tion developers must be careful when concatenating character string data to ensure
that any necessary spaces are present or to add them themselves.
|| is not always used to denote concatenation. The MySQL DBMS uses || to denote
the Boolean OR operator. It provides the CONCAT() function to perform string con-
catenation. Check the DBMS manual to find the correct symbols to use for operators.
The Boolean data type has the usual logical operators. Truth tables for these are
shown in Figure 3.2. To use these, note that the Boolean variables p and q can be any
Boolean expression. The second line of the truth table for AND tells us that, whenever
p is true and q is false, then p AND q is also false and so on.
The operators for multiset and set are UNION, INTERSECT and DIFFERENCE.
Note that the union of two set values is also a set, so duplicate values are removed: For
example:
85
IDWA_C03.QXD 10/5/07 11:08 AM Page 86
p NOT p p q p AND q p q p OR q
TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
FALSE TRUE FALSE FALSE TRUE TRUE
FALSE FALSE FALSE FALSE FALSE FALSE
As multisets allow duplicate values, the union operator for multisets does not eliminate
duplicates. In SQL, the union of two multisets looks like this:
[2, 7, 15] UNION [7, 7, 15, 22] = [2, 7, 7, 7, 15, 15, 22]
The intersection operator for sets (and multisets) identifies common members. This in-
cludes common repeated members for multisets. For example:
Finally, the DIFFERENCE operator identifies members in the first set (or multiset) that
are not in the second:
In mathematics, adding together all the repeated members from both multisets, as
the SQL UNION operator does, is called fusion. Multiset union takes the maximum
number of members in the two multisets:
[2, 7, 15, 15] ∪ [7, 7, 15, 22] = [2, 7, 7, 15, 15, 22].
The date and time data types have only one operator – ‘–’. The ‘–’ operator is a little
unusual, because it takes two dates (or two times), but returns an interval. All the non-
comparison operators for date, time and interval data types involve a mix of these
three data types. For example, ‘17 January 2007’ – ‘21 January 2007’ evaluates to the
interval ‘4 days’. The Pennine University’s Web Timetable database can use this to cal-
culate the duration of a teaching session (which is clearly an interval) from its start
and end times. An alternative approach would be to store the start time (a time) and
the duration (an interval) and calculate the end time by adding them together.
Adding, or subtracting, a year–month interval from or to a date will give another date.
Adding or subtracting a day–hour–minute–second interval to a time will give another
time. These two operators are usually supported whenever the interval data type is. A
summary of the different mixed data type operators involving the date, time and
interval data types is given in Table 3.5.
86
IDWA_C03.QXD 10/5/07 11:08 AM Page 87
Table 3.5 Mixed data type operators for date, time and interval types.
Organizing data values into data types is an obvious first step towards a data model
and defining the operators a first step towards defining a data language. Most data
models include one or more data types and allow the DBMS application developers to
define additional data types as they see fit. These additional data types will be part of a
data language for the data model. The data types discussed above are the most com-
mon ones used in business information systems. Geographical information systems
(GIS), which hold data about maps, have more complex data types – points, lines and
paths, for example (Elmasri and Navathe (2007) include a brief introduction to GIS).
The relational data model was developed by E.F. Codd in the late 1960s in response to
the difficulties existing data models had when dealing with large, shared databases. The
relational data model is quite simple, and remains one of the easiest data models to un-
derstand. It is also very flexible as any data at all can be represented by a relational data-
base. Not surprisingly, most commercial DBMS are based on the relational data model;
these are called RDBMS. Earlier data models (such as the hierarchic and network data
models) have largely faded away, though some important legacy applications still use
databases based on these data models. Later data models (such as object-orientated and
semistructured) are important in niche markets and their use is growing. For now,
though, it is the relational data model that every database professional must understand.
Although motivated by practical problems, Codd gave the relational data model a
firm mathematical foundation using set theory and predicate logic. In fact, the name
‘relational data model’ refers to the fact that all data is organized into relations, which
are the mathematical equivalent of tables. So, a relation is a table and ‘relational’
means ‘table-based’. It is worth emphasizing this, because ‘relation’ is often mistaken
as meaning ‘relationship’. It does not: ‘relation’ is just another name for ‘table’. To avoid
confusion, this book uses the word table throughout.
The relational data model requires only a single data type: Boolean. An RDBMS is free
to include other data types in its implementation of the relational data language. The
data types discussed above are supported by many RDBMS. The relational data model
can also accommodate any integrity constraint that can be written as a Boolean expres-
sion. Traditionally two sorts of integrity constraint have been important – candidate
key constraints and foreign key constraints. More general enterprise constraints are
87
IDWA_C03.QXD 10/5/07 11:08 AM Page 88
often supported by commercial RDBMS. For example, the SQL standard includes check
constraints and assertions (see Chapters 5 and 6).
Table heading and table instance A table consists of a table heading and a table
instance. The table heading includes the table name and the column descriptions.
The table instance is the set of rows held in the table at a given time.
The relational data model also includes integrity constraints, to limit what data can be
held in a table instance. The description of the table, its columns and constraints is
called the table schema. The collection of all table schema for a particular database is
the database schema. Figure 3.3 gives a picture of the SupportSession table with
the various features labelled. Note that each column description shows the column
name and data type. The column values all have the appropriate data type, which are
varying length character strings for the Varchar columns and times for the Time
columns. The table name is shown as part of the table heading, but the relational data
model includes both named and unnamed tables. For example, the RDBMS responds
to a database query by creating an unnamed table and passing this table to the data-
base client. This means that tables really are the only data structure in the relational
data model.
There are four properties that the table instance must satisfy:
1 for each row of the table instance, there is exactly one value for each column and
that value has the same data type as the column
2 there are no duplicate rows in the table instance
3 the order of the rows in the table instance is not significant
4 the order of the columns for a particular row is not significant.
88
IDWA_C03.QXD 10/5/07 11:08 AM Page 89
Column
Table name descriptions
SupportSession
Table
heading staffNo:Varchar dayOfWeek:Varchar startTime:Time endTime:Time
35054 Monday 09:00:00 10:00:00
31210 Wednesday 11:00:00 13:00:00
35054 Monday 15:00:00 16:00:00
45965 Monday 11:00:00 12:00:00 Rows
45965 Wednesday 11:00:00 12:00:00
Table 23257 Monday 15:00:00 16:00:00
instance
55776 Monday 14:00:00 16:00:00
56893 Tuesday 14:00:00 15:00:00
56893 Thursday 09:00:00 10:00:00
56673 Thursday 15:30:00 16:30:00
56673 Friday 10:00:00 11:00:00
Column values
Properties 3 and 4 mean that, for example, the two instances of the Staff table in
Figures 3.4 (a) and (b) are identical. The fact that the columns and rows are in different
orders is irrelevant. Both table instances hold exactly the same data about members of
staff.
Property 2 means that the table instance shown in Figure 3.5 is illegal, because the
third and last rows are identical. It is worth noting that most RDBMS do not enforce
this rule, nor does the ISO standard for SQL. Even so, it is good practice to use integrity
constraints to prevent duplicate rows in a table instance, unless there is a very good
reason for allowing them. Tables that hold audit information – who did what, when –
or that log errors are good examples of tables that might need to hold duplicate rows.
Interpreted strictly, property 1 means that the Staff table instances shown in Fig-
ures 3.4 (a) is also illegal. On the fifth row, the row of data about Helene Chirac, the
phone column is null (indicated explicitly by writing the keyword NULL in that col-
umn, rather than by leaving the column empty). Null is not a value, but an indication
that there is no value for this column, on the current row. Nulls are the mechanism
most RDBMS use to deal with missing data (though this approach has its critics – see,
for example, Date (2004)). Since null is not a value, an amended version of property 1
is needed:
1 (amended version) For each row of the table instance, each column either has exactly
one value or is null. When there is a value, it has the same data type as the column.
Notice that null, as it is not a value, does not have a data type. Although null is wide-
ly used to represent missing information, it does complicate things. Consider what
happens when a user asks of the Staff Directory whether any members of staff have the
89
IDWA_C03.QXD 10/5/07 11:08 AM Page 90
Chapter
Figure3 3.4
DataTwo
models
identical instances of the Staff table.
Staff
staffNo:Varchar fName:Varchar IName:Varchar phone:Varchar email:Varchar photo:BLOB
14443 Helen Abbot 8032 [email protected] [BLOB]
56893 Ruth Bapetsi 8022 [email protected] [BLOB]
56673 Joshua Bittaye 7782 [email protected] [BLOB]
33935 Padma Brar 6641 [email protected] [BLOB]
35155 Helene Chirac NULL [email protected] [BLOB]
55776 Gurpreet Choudhury 5454 [email protected] [BLOB]
35054 Selma Hutchins 8706 [email protected] [BLOB]
89987 Dan Lin 8514 [email protected] [BLOB]
78893 Jo Karen O’Connor 8871 [email protected] [BLOB]
77712 Frank Rose 8871 [email protected] [BLOB]
25447 John Smith 5104 [email protected] [BLOB]
25448 Judith Anne Smith 7709 [email protected] [BLOB]
31210 Paul Smith NULL [email protected] [BLOB]
10780 John Smith NULL [email protected] [BLOB]
23257 Freya Stark 8660 [email protected] [BLOB]
45965 Mikhail Sudbin 5553 [email protected] [BLOB]
33509 Helen Timms 8661 [email protected] [BLOB]
Staff
staffNo:Varchar email:Varchar fName:Varchar IName:Varchar phone:Varchar photo:BLOB
10780 [email protected] John Smith NULL [BLOB]
14443 [email protected] Helen Abbot 8032 [BLOB]
23257 [email protected] Freya Stark 8660 [BLOB]
25447 [email protected] John Smith 5104 [BLOB]
25448 [email protected] Judith Anne Smith 7709 [BLOB]
31210 [email protected] Paul Smith NULL [BLOB]
33509 [email protected] Helen Timms 8661 [BLOB]
33935 [email protected] Padma Brar 6641 [BLOB]
35054 [email protected] Selma Hutchins 8706 [BLOB]
35155 [email protected] Helene Chirac NULL [BLOB]
45965 [email protected] Mikhail Sudbin 5553 [BLOB]
55776 [email protected] Gurpreet Choudhury 5454 [BLOB]
56673 [email protected] Joshua Bittaye 7782 [BLOB]
56893 [email protected] Ruth Bapetsi 8022 [BLOB]
77712 [email protected] Frank Rose 8871 [BLOB]
78893 [email protected] Jo Karen O’Connor 8871 [BLOB]
89987 [email protected] Dan Lin 8514 [BLOB]
(b) The same table instance with columns and rows in a different order.
90
IDWA_C03.QXD 10/5/07 11:08 AM Page 91
same phone number as Helen Abbot. Given the Staff table instance in Figure 3.4 (a),
the obvious answer is ‘No’. Obvious, but wrong! The null in the phone column of the
rows representing Paul Smith, Helene Chirac and John Smith means their phone num-
ber is not known. Any one of them might share a phone number with Helen Abbot,
but, then again, they might not. The RDBMS should recognize this uncertainty and
answer ‘Unknown’. Formally, introducing null to represent missing information alters
the interpretation of Boolean expressions. There are now three Boolean values: TRUE,
FALSE and UNKNOWN. Thus, the truth tables shown in Figure 3.2 need to be amended.
Figure 3.6 shows the truth tables for this three-valued logic.
Consider the truth table for NOT. Whenever the Boolean expression p evaluates to
TRUE, it follows that the Boolean expression NOT p must evaluate to FALSE and vice
versa. To say that a Boolean expression p evaluates to UNKNOWN means that it cannot
be decided whether it is true or false. Consequently, NOT p must also evaluate to
UNKNOWN because, if, for example, NOT p was TRUE, then p would have to be FALSE.
Staff
staffNo:Varchar fName:Varchar IName:Varchar phone:Varchar email:Varchar photo:BLOB
14443 Helen Abbot 8032 [email protected] [BLOB]
56893 Ruth Bapetsi 8022 [email protected] [BLOB]
55776 Gurpreet Choudhury 5454 [email protected] [BLOB]
56673 Joshua Bittaye 7782 [email protected] [BLOB]
33935 Padma Brar 6641 [email protected] [BLOB]
35155 Helene Chirac NULL [email protected] [BLOB]
55776 Gurpreet Choudhury 5454 [email protected] [BLOB]
Duplicate
row
Figure 3.6 Truth table for the Boolean operators using three-valued logic.
p NOT p p q p AND q p q p OR q
TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE
UNKNOWN UNKNOWN TRUE UNKNOWN UNKNOWN TRUE UNKNOWN TRUE
FALSE TRUE FALSE FALSE TRUE TRUE
FALSE FALSE FALSE FALSE FALSE FALSE
FALSE UNKNOWN FALSE FALSE UNKNOWN UNKNOWN
UNKNOWN TRUE UNKNOWN UNKNOWN TRUE TRUE
UNKNOWN FALSE FALSE UNKNOWN FALSE UNKNOWN
UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN UNKNOWN
91
IDWA_C03.QXD 10/5/07 11:08 AM Page 92
Figure 3.7 The Staff table redefined so that the phone column has data type ‘set of charac-
ter strings’.
Staff
staffNo:Varchar fName:Varchar IName:Varchar phone:Varchar Set email:Varchar photo:BLOB
14443 Helen Abbot {'8032'} [email protected] [BLOB]
56893 Ruth Bapetsi {'8022'} [email protected] [BLOB]
56673 Joshua Bittaye {'7782'} [email protected] [BLOB]
33935 Padma Brar {'6641'} [email protected] [BLOB]
35155 Helene Chirac NULL [email protected] [BLOB]
55776 Gurpreet Choudhury {'5454', '5771'} [email protected] [BLOB]
35054 Selma Hutchins {'8706'} [email protected] [BLOB]
89987 Dan Lin {'8514'} [email protected] [BLOB]
78893 Jo Karen O’Connor {'8871'} [email protected] [BLOB]
77712 Frank Rose {'8871'} [email protected] [BLOB]
25447 John Smith {'5104'} [email protected] [BLOB]
25448 Judith Anne Smith {'7709'} [email protected] [BLOB]
31210 Paul Smith NULL [email protected] [BLOB]
10780 John Smith {} [email protected] [BLOB]
23257 Freya Stark {'8660', '8661'} [email protected] [BLOB]
45965 Mikhail Sudbin {'5553'} [email protected] [BLOB]
33509 Helen Timms {'8661'} [email protected] [BLOB]
For Boolean expressions involving AND, the truth table offers the obvious interpre-
tation: the Boolean expression p AND q evaluates to TRUE precisely when both p and q
are TRUE. When either one of the input values is FALSE, p AND q evaluates to
FALSE. In all other circumstances, one of the input values must be UNKNOWN and p
AND q evaluates to UNKNOWN. A similar case-by-case description can be given for the
Boolean operator OR.
Both versions of property 1 state that, on any row, each column has a single value.
This has led many textbooks to insist that a column must have a scalar data type. The
argument is that non-scalar data types are designed to hold more than one value. For
example, suppose that the Staff table is defined so that the data type of the phone
column is ‘set of character strings’ rather than ‘character string’. A possible instance of
this version of the Staff table is shown in Figure 3.7. The value in the phone column
in the row for Freya Stark is a set with two members, the two character strings '8660'
and '8661'. However, the column’s data type is not Varchar, it is Varchar Set.
So, the column value is the whole set {'8660', '8661'} and this is a single set
value. Thus, the table instance in Figure 3.7 does still satisfy property 1.
The table also illustrates an important distinction between null and ‘no value’. The
fifth row from the top holds data for Helene Chirac and the entry in the phone column
for this row is null. This indicates that her phone number is not yet known. In contrast,
92
IDWA_C03.QXD 10/5/07 11:08 AM Page 93
the fourth row from the bottom, which holds data about one of the two John Smiths
employed at the Pennine University, has the value {} in the phone column, which
indicates the empty set. This is a valid set value – specifically, it is the set with no mem-
bers. It tells us that John Smith is known to not have a phone number. This is not the same
as null, which indicates that it is unknown whether or not the person has a phone
number.
Allowing columns to have a non-scalar data type opens up an interesting possibility.
A table heading can be thought of as declaring a user-defined data type. Consequently,
a column’s data type can be a table. A named table is simply a variable with this data
type and its table instance a particular value of the data type. Not all RDBMS support
table data types. Even when they do, it is a feature that should be used sparingly. Un-
necessary complexity is best avoided.
Integrity constraint A statement about the database instance that must always
be true. It is usually written as a Boolean expression.
The relational data model has two main integrity constraints, both based on the no-
tion of a relational key. A relational key is simply a set of columns that serve a particu-
lar role within the database schema. A candidate key is a set of columns the values of
which are unique to a particular row of the table. The candidate key acts as a unique
identifier for a row as, given the values for the columns of the candidate key, there can
be only one matching row in the table instance. A candidate key constraint is a
Boolean expression that tells the DBMS to ensure that the value of a candidate key is
unique within the table instance. A table may have several candidate keys. One of
these is chosen to be the main identifier for the table and is called the primary key.
The uniqueness condition is not quite enough to properly define a candidate key.
Consider the Staff table (shown in Figure 3.4 (a)). It has six columns, staffNo,
fName, lName, phone, email and photo. Clearly the set of all these columns satis-
fies the uniqueness property as every row in a table is unique. However, it is not much
help as an identifier, as you must know all the data values in a row before you can
identify that row. A candidate key should be a minimal set, in the sense that if you
removed any one of the columns, then the resulting set would not be unique. This
leads to the following definition.
93
IDWA_C03.QXD 10/5/07 11:08 AM Page 94
Candidate key A candidate key for a table is a set of columns that has two
properties:
1 uniqueness – the combination of values for these columns is different for each
row of the table instance
Candidate key constraint States that the candidate key (the combination of
values of the candidate key columns) is unique within the table instance.
Sets of columns that have uniqueness, but not irreducibility are sometimes called
super keys – meaning that they contain a candidate key as a subset.
The value of staffNo will be different for each row as, after all, the purpose of a staff
number is to uniquely identify a member of staff. Any set of columns with just one
column in it is obviously irreducible, so {staffNo} is a candidate key. Similarly,
{email} is a candidate key, because the Pennine University has a business rule that
each member of staff has exactly one e-mail address. E-mail addresses are usually
unique to an individual, but not always. Within an organization, individuals will
typically have their own e-mail address, though there may be shared e-mail addresses,
too. At the Pennine University, for example the members of the Admissions team
each have their own e-mail address, but share responsibility for checking the
‘[email protected]’ e-mail address, used by applicants to request brochures,
make enquiries and so on. Similarly, some families share an e-mail address because
their Internet service provider only provides one e-mail address per connection. Using
a shared e-mail address to identify customers in a billing system is not a good idea.
Who settles the bill if the couple split up?
Notice that {fName}, {lName} and {fName, lName} are not candidate keys (check
the table instance shown in Figure 3.4 (a)). Names almost never work as candidate
keys, because different people often share the same name. Actors’ stage names are
unique, so can be used as candidate keys, but that is the only obvious example and is
pretty limited in application.
The phone column presents an interesting situation. Suppose that the Pennine
University has the following business rule:
This suggests that phone has the uniqueness property. However, it does not. To see
why, suppose the user adds a new row to the database, with the column values:
staffNo = '56890'
fName = 'Anna'
lName = 'Jones'
phone = '1229'
email = '[email protected]'
94
IDWA_C03.QXD 10/5/07 11:08 AM Page 95
(The photo column will be null for this new row of the table instance.) The RDBMS
must check whether or not the value of the phone column for this new row will be
unique. It does this by looking at each of the existing rows and asking the question:
‘Is the value of the phone column on this row equal to ‘1229’?’
This is equivalent to evaluating the Boolean expression phone = '1229' for each
row in the table. With the data shown in Figure 3.4 (a), this Boolean expression evalu-
ates to FALSE for the first four rows. For the fifth row it evaluates to UNKNOWN, be-
cause the phone column is null for the fifth row. This means that it cannot be said
with certainty that the new phone number is unique, so it cannot be a candidate key.
In general, no column of a candidate key can be null.
SQL includes a constraint called UNIQUE. This does allow columns to be null, but
enforces uniqueness across rows where there are no null columns. This added flexi-
bility in defining integrity constraints can be useful.
A candidate key with a single column is called a simple candidate key, while ones with
two or more columns are composite candidate keys. The SupportSession table
shown in Figure 3.3 is an example of a table with no simple candidate keys. The set of
columns {staff, dayOfWeek, startTime} is a composite candidate key. It is
unique, as a member of staff can only offer one support session at any one time. Notice
that the uniqueness condition refers to the combination of values in all three columns.
The first and third rows of the table instance in Figure 3.3 have the same values in
staff and dayOfWeek, but differ in the startTime column, while the fourth and
fifth have the same values in startTime and staff, but differ in the dayOfWeek
column. Analysing all possible subsets of the candidate key shows that {staff,
dayOfWeek, startTime} is irreducible. {staff, dayOfWeek} cannot be guaran-
teed to be unique as one member of staff may offer two support sessions on the same
day. Similarly, {staff, startTime} may not be unique as staff may offer support ses-
sions at the same time on different days. {dayOfWeek, startTime} is not unique as
two different members of staff may offer support sessions on the same day and at the
same time. Thus, none of the two-member subsets satisfies the requirement of unique-
ness. This is sufficient to establish irreducibility as, if any one-member subset had the
uniqueness property, so would the two-member subsets containing this member.
The SupportSession table has another candidate key: {staff, dayOfWeek,
endTime}. For both the Staff and SupportSession tables, one of the candidate
keys must be chosen as the primary key – the main identifier for the table. The other
candidate keys are sometimes referred to as alternate keys. There are no fixed rules as
to which candidate key to choose as the primary key. One useful guide is to pick the
candidate key:
95
IDWA_C03.QXD 10/5/07 11:08 AM Page 96
For the Staff table, staffNo probably makes more business sense than email.
Note that neither will change and both have only a single column, but that each value
of staffNo is shorter in length than a value of email (5 characters compared to
more than 14).
Of the two candidate keys for the SupportSession table, {staff, dayOfWeek,
startTime} makes more business sense as a primary key as it is more useful to know
when a support session starts than when it ends. Note that both the candidate keys for
the SupportSession table change rarely, have the same number of columns and
the same length for their data values. If both made equal business sense as primary
keys, then it wouldn’t matter which was chosen.
The candidate key allows an RDBMS to uniquely identify a row from a fragment of
the data held on that row. Candidate key constraints help the RDBMS to ensure that
there are no duplicate rows in a table instance. The second kind of key is not con-
cerned with individual tables, but with the relationships between them. Figure 3.8
shows how this works for the Staff and SupportSession tables. Each of these ta-
bles has a column called staffNo. For the Staff table, {staffNo} is a candidate
key (in fact, the primary key), so a particular value for staffNo identifies a single row
of the Staff table. For the SupportSession table, the staffNo column is not a
candidate key, so many rows can have the same value in the staffNo column. For
example, the two rows with the value ‘45965’ are indicated in Figure 3.8, along with
the matching candidate key value in the Staff table. This matching of data values
tells us that the two support sessions are offered by Mikhail Sudbin. This mechanism
for capturing a relationship is simple and elegant. No special data structures are
needed. Everything is dealt with through the relational data model’s only data structure,
the table.
There is quite a bit of terminology associated with the scenario in Figure 3.8. The set
of columns {staffNo} from the SupportSession table is called a foreign key.
Notice that a foreign key, like a candidate key, is a set of columns (although in the ex-
ample there is only one member in the set). In any row, the values of the foreign key
columns act as a key to related data from another (a foreign) table. The Staff table’s
candidate key {staffNo} is called the matching candidate key because it is the
column values of that candidate key, rather than any other, which will match those of
the foreign key. The table with the foreign key is called the referencing table and the
table with the matching candidate key the referenced table. That is because, for exam-
ple, rows of SupportSession refer back to rows of Staff.
The referencing table is sometimes called the child table and the referenced table the
parent table. The idea is that the rows of the child table ‘belong to’ the matching row
of the parent table.
The foreign key constraint (also called the referential integrity constraint) is a Boolean
expression that tells the RDBMS to ensure that the values entered for the foreign key
columns of the referencing table can be found as the column values of the matching
candidate key on some row in the referenced table. In the example above, the foreign
key constraint insists that, for each row of SupportSession, there is a row in
Staff such that the values in the staffNo columns of the two tables are the same.
96
IDWA_C03.QXD 10/5/07 11:08 AM Page 97
Staff
staffNo:Varchar fName:Varchar IName:Varchar phone:Varchar email:Varchar photo:BLOB
14443 Helen Abbot 8032 [email protected] [BLOB]
56893 Ruth Bapetsi 8022 [email protected] [BLOB]
56673 Joshua Bittaye 7782 [email protected] [BLOB]
33935 Padma Brar 6641 [email protected] [BLOB]
35155 Helene Chirac NULL [email protected] [BLOB]
55776 Gurpreet Choudhury 5454 [email protected] [BLOB]
35054 Selma Hutchins 8706 [email protected] [BLOB]
89987 Dan Lin 8514 [email protected] [BLOB]
78893 Jo Karen O’Connor 8871 [email protected] [BLOB]
77712 Frank Rose 8871 [email protected] [BLOB]
25447 John Smith 5104 [email protected] [BLOB]
25448 Judith Anne Smith 7709 [email protected] [BLOB]
31210 Paul Smith NULL [email protected] [BLOB]
10780 John Smith NULL [email protected] [BLOB]
23257 Freya Stark 8660 [email protected] [BLOB]
45965 Mikhail Sudbin 5553 [email protected] [BLOB]
33509 Helen Timms 8661 [email protected] [BLOB]
SupportSession
staffNo:Varchar dayOfWeek:Varchar startTime:Time endTime:Time
35054 Monday 09:00:00 10:00:00
31210 Wednesday 11:00:00 13:00:00
35054 Monday 15:00:00 16:00:00
45965 Monday 11:00:00 12:00:00
45965 Wednesday 11:00:00 12:00:00 Referencing
table
23257 Monday 15:00:00 16:00:00
55776 Monday 14:00:00 16:00:00
56893 Tuesday 14:00:00 15:00:00
56893 Thursday 09:00:00 10:00:00
56673 Thursday 15:30:00 16:30:00
56673 Friday 10:00:00 11:00:00
97
IDWA_C03.QXD 10/5/07 11:08 AM Page 98
An RDBMS that ensures foreign key constraints are always satisfied is said to maintain
referential integrity.
Foreign key A Set of columns in the referencing table, the values of which refer
to those of the matching candidate key columns in some row of the referenced
table.
Foreign key constraint States that, for each row in the referencing table, there
must be a row in the referenced table such that the value in the foreign key
columns is the same as the value of the matching candidate key columns.
As the values of the foreign key columns are identical to values from the matching
candidate key columns, they must have the same data type. When the two keys are
composite, the declaration of the foreign key constraint must specify which columns
of the foreign key match which of the candidate key. This is not always obvious as
there is no rule that insists on the column names being the same, only their data
types. Figure 3.9 shows the foreign key relationship between the Equipment and
Room tables. These tables are part of the Web Timetable database at the Pennine
University. The values in the Equipment table’s portable column indicate
whether the equipment is fixed in a particular room or can be moved from room to
room. For fixed equipment, such as ceiling-mounted projectors, the database captures
the link between the equipment and the room it is fixed in by enforcing a foreign key
constraint. The foreign key for the Equipment table is {building, room}, the
matching candidate key for the Room table is {building, roomNo}. These have dif-
ferent column names, so the foreign key constraint must make it clear that the match
is between data in the two building columns and in the room and roomNo
columns. This issue is addressed in the data language (see Chapter 5).
This may seem a minor point, but it is important. Remember, a DBMS is computer
software and computer software cannot make an educated guess. While to a person
it seems obvious to match room and roomNo, computer software has no knowledge
base of common sense from which to make such an inference.
Unlike a candidate key, the columns of a foreign key may be null. Figure 3.9 also shows
this situation. For example, the entries in the second row for both the foreign key
columns are null. This does not violate the foreign key constraint, even though there can
never be a row in the Room table for which the entries in the building and roomNo
columns are null. The reason for this is tied up with the interpretation of null and three-
valued Boolean logic. When a row is added to the Equipment table, the DBMS checks
that the foreign key constraint is satisfied by answering the following question:
Is there is a row in the Room table for which the entries in the building and
roomNo columns have the same values as those supplied for the building and
room columns respectively of this new row in the Equipment table?
98
IDWA_C03.QXD 10/5/07 11:08 AM Page 99
Figure 3.9 A composite foreign key, with different column names and nulls.
Equipment
assetNo:varchar assetType:varchar description:varchar building:varchar room:varchar portable:boolean
1570131 Data projector HP vp6100 Wilson 205 FALSE
1799131 Data projector HP vp6100 NULL NULL TRUE
4560293 OHP 3M OHP 1608 Locke 24 FALSE
4503993 PC Dell OptiPlex™ GX280 Wilson 205 FALSE
5010009 OHP 3M OHP 2000 NULL NULL TRUE
2992220 Data Projector ToshibaTDP-SW20 Priestley G12 FALSE
2892112 Laptop SONY X505 NULL NULL TRUE
Matching
Room
rows
building:varchar roomNo:varchar capacity:numeric
Wilson 205 25
Wilson 113 25
The constraint is only violated if this statement evaluates to FALSE, in which case the
DBMS would not add the row to the table instance. When a null has been supplied for
one of the foreign key columns in the Equipment table, the statement evaluates to
UNKNOWN. The constraint has not been violated as UNKNOWN is not FALSE and so the
row is added to the table instance. A similar process of checking occurs when a row of
the Equipment table is modified.
It is possible to design any database so that foreign key columns do not allow null
and, indeed, many database practitioners prefer this approach (in particular, those who
never liked nulls in the first place). Figure 3.10 shows how the Equipment table could
be split into two separate tables – PortableEquipment and FixedEquipment.
There is now a foreign key linking FixedEquipment and Room that does not allow
nulls in the foreign key columns. There is no foreign key for PortableEquipment.
In addition, there is no longer a need for the portable column in either of the two
equipment tables.
The main use of foreign key constraints is to ensure that new rows in the referenc-
ing table have valid values in the foreign key columns. A second use is to check that
99
IDWA_C03.QXD 10/5/07 11:08 AM Page 100
Figure 3.10 Redesigning the Equipment table to avoid nulls in the foreign
key columns.
Foreign key
FixedEquipment
assetNo:varchar assetType:varchar description:varchar building:varchar room:varchar
1570131 Data projector HP vp6100 Wilson 205
4560293 OHP 3M OHP 1608 Locke 24
PortableEquipment
assetNo:varchar assetType:varchar description:varchar
5010009 OHP 3M OHP 2000 No foreign key
1799131 Data projector HP vp6100
2892112 Laptop SONY X505
changes to the column values of the matching candidate key do not violate referential
integrity. For example, suppose that the Wilson building at the Pennine University is
renamed the Heath building. The RDBMS must amend the data in the first three rows
of the Room table (see Figure 3.9) to reflect that change. In doing so, it would leave the
first and fourth rows of the Equipment table without a matching candidate key, so,
to maintain the foreign key constraint, the RDBMS must reject the requested change.
This is a little inconvenient, as, to make the change, three distinct steps are needed.
1 For each row in the Room table where the building column has the value
‘Wilson’, add a new row, identical to it except that the building column has the
value ‘Heath’.
2 For each row in the Equipment table where the building column has the value
‘Wilson’, change this value to ‘Heath’.
3 Delete all rows from the Room table where the building column has the value
‘Wilson’.
To avoid this rigmarole, there is a referential action associated with every foreign key
constraint. The referential action tells the RDBMS how to maintain referential integri-
ty when a user requests changes to matching candidate key values that would violate
the foreign key constraint. The default referential action is NO ACTION. This means
that there is no special action defined for the foreign key constraint. If a request to
change the referenced table instance would violate the foreign key constraint, then
the change will be rejected.
The CASCADE referential action says that changes to the referenced table instance
are cascaded down to the referencing table instance. In our example, this would mean
that changing the value in the building column from ‘Wilson’ to ‘Heath’ in the
100
IDWA_C03.QXD 10/5/07 11:08 AM Page 101
Room table would automatically lead to a similar change being made in the
Equipment table.
Two further referential actions are widely supported. SET NULL says that when a
row of the referenced table is modified or removed, the foreign key columns of any
matching rows in the referencing table are set to null. SET DEFAULT has a similar
effect, except that the foreign key column values are set to their default value.
Referential actions are a good solution to the problem of maintaining referential in-
tegrity, while still allowing the user to make reasonable changes to the database in-
stance. They are applied automatically by the DBMS, with no need for the application
developer to code an exception-handling procedure. However, they only deal with the
situation where the user changes the value in the matching candidate key of a foreign
key constraint. A more general approach is to define a transaction. A transaction is a
series of database modifications that are treated as a single unit of work. During the
transaction, some integrity constraints may be temporarily violated – there may be un-
matched foreign keys or duplicate rows, for example. However, at the end of the trans-
action, these violations have been resolved and all integrity constraints are satisfied.
Up to this point, table schema have all been given by showing a picture of the table
(table heading and table instance). Database designers use a standard format for writ-
ten table schemas (usually called a relation schema). Figure 3.11 shows the general for-
mat for a written table schema. First comes the table name, followed by the list of
column descriptions in parentheses. In many cases, this is all that is given (sometimes,
even the column data types are omitted), but it is handy to include a description of the
main constraints. For the primary key constraint, the primary key column(s) can sim-
ply be listed, but, for the foreign key constraint, both the foreign key column(s) and
the matching candidate key column(s) need to be given. The written table schemas for
the Staff and SupportSession tables are shown below.
Column descriptions
101
IDWA_C03.QXD 10/5/07 11:08 AM Page 102
An alternative way to show primary keys is to underline the primary key columns.
This works well enough, but the practice of showing foreign keys by placing an aster-
isk – ‘*’ – next to foreign key columns is less effective. Which table does the foreign key
refer to and, if there is more than one foreign key, how do you indicate this? In this
book, the fuller notation shown above is used instead.
Although the primary and foreign key constraints are the main ones in the relation-
al data model, most RDBMS include at least two other integrity constraints. The not-
null constraint states that a particular column must never be null; every row must have
a valid data value for this column. The unique constraint acts a little like the primary
key constraint. It applies to a set of columns and ensures that, in every row of the
table, the combination of column values is unique. The significant difference is that
any of the columns may be null, in which case the unique constraint is ignored. Can-
didate keys can be enforced by placing a unique constraint on the candidate key and
not null constraints on each of the candidate key columns.
102
IDWA_C03.QXD 10/5/07 11:08 AM Page 103
Table Constraint
RestrictsInstanceOf
name name
type
RestrictsValueOf
Column
name
dataType
default
Each user must have his or her own connection to the DBMS and each connection has
an associated session. Any SQL statements sent over the user’s connection are execut-
ed within the associated session. The user can modify the properties of that session –
for instance, specifying a default database from the many managed by the DBMS. Most
DBMS provide their own way to associate a particular database with a session. For ex-
ample, the MySQL DBMS has the 'USE' statement. The main function of a connec-
tion and its associated session is to support the communication and concurrency
functions of the DBMS.
SQL is a database language for the relational model, so its main data structure is the
table, made up of one or more columns. A table may also have constraints. Figure 3.12
shows the associations between a table, its columns and constraints using a simple
UML class diagram. You will notice that there is a little more notation in this diagram
than in those seen in earlier chapters. For a start, each box has two compartments. The
upper compartment tells us what the thing is and the lower compartment tells us what
information is known about that thing. So, the only information known about a Table
is its name, while, for a Column, its name, dataType and default are known. For a con-
straint, the name and type are known. The black diamond symbol on the association
line between Table and Column means ‘is composed of’, so this association is read ‘a
table is composed of columns’.
Primary key and foreign key constraints can be defined in ISO SQL. The standard
also includes the not-null constraint, which states that a particular column cannot be
null in any row of the table; the unique constraint, which is used to enforce candidate
key constraints; and the check constraint, which places a more complex restriction on
the value of a particular column. Many state constraints can be implemented using the
SQL check constraint, but transition constraints must be implemented using SQL trig-
gers (see Chapter 10) or handled in the application code. These SQL constraints are all
discussed in Chapters 5 and 6.
103
IDWA_C03.QXD 10/5/07 11:08 AM Page 104
Every SQL constraint restricts what rows can exist in a table instance. Consider
what happens when an SQL statement modifies a table instance (either adding a new
row or modifying an existing row). Before the DBMS modifies the table instance, it
checks that the changes will not violate any constraint. If they will, the statement is
rejected, with an appropriate error message being given. Warning messages will in-
clude the name of the constraint violated, as SQL allows the DBA to give constraints
meaningful names.
SQL constraints only restrict an individual table’s instance. The SQL standard also
defines assertions, which restrict what data can exist in a database instance. This is a
useful feature for implementing more complex enterprise constraints that involve two
or more tables. Unfortunately, it is not supported by many commercial DBMS, so is
not covered in this book.
The ISO SQL standard distinguishes between two kinds of constraints – table con-
straints and column constraints. An SQL column constraint is defined as part of the col-
umn definition, so can only involve a single column – not-null is the obvious example.
Table constraints are defined after the column definitions and may involve more than
one column, such as a primary key constraint for a composite primary key. This distinc-
tion is purely syntactic as both table and column constraints restrict what rows can exist
in a table instance. In fact, a column constraint is just shorthand for a table constraint.
The specific syntax for defining constraints is discussed in Chapters 5 and 6.
Data is added to tables as rows. The collection of rows held in a table at a particular
time is the table instance. SQL provides facilities to manipulate the table instance, sup-
porting the data manipulation function of the DBMS. One important feature of an
SQL table instance is that, if there is no primary key constraint on the table, it can in-
clude two rows that are identical. This means that, in SQL, a table instance is a multiset
of rows. In the relational data model, tables cannot have repeated rows – they are sets.
(It is important to remember this difference when reading books on the theory of rela-
tional databases.) In most cases, a table includes a primary key constraint, so there are
no repeated rows. However, sometimes repeated rows are necessary. The module evalu-
ation questionnaire used by the Pennine University includes a free text section for
‘any other comments’. This sort of data gives useful feedback, but cannot be analysed
easily, so is kept in a separate table. The table does not have, and does not need, a pri-
mary key. An instance of the table is shown in Figure 3.13. Notice that comments in
the first and fourth rows are identical. Those in the second and fifth have the same
meaning, but are not identical.
ModuleEvaluation
moduleCode:char studentComment: varchar
17335 Too little lab time
17335 Great tutor
17335 Well-structured assessment, but need more feedback
17335 Too little lab time
17335 Brilliant tutor
104
IDWA_C03.QXD 10/5/07 11:08 AM Page 105
Table
DerivedTable BaseTable
There are two different sorts of table in SQL – base tables and derived tables. Both
have further specializations, as shown in Figure 3.14. This is another UML class dia-
gram and introduces another important piece of notation. It is the small triangle,
which indicates a generalization relationship. Often, things have general and more
specialized forms – cars can be hatchbacks, saloons and estates; people can be male or
female; animals can be reptiles, mammals, fish, birds and insects, for example. In such
cases, a particular instance of the more specialized form is also an instance of the gen-
eral form. So, a Skoda Fabia is a hatchback, but is also a car; Angela Merkel is a woman,
but is also a person; my pet cat Suki is a mammal, but is also an animal. Similarly, an
instance of a View is also a DerivedTable, which is also a Table.
Base tables have their table instance stored on disk. A persistent base table also has
its table definition stored in the data dictionary. Its table instance is shared by all data-
base users, so changes by one user are seen by other users. A temporary base table may
or may not have its table definition stored in the data dictionary. However, each ses-
sion (each user) gets its own instance of the table. Data is copied into the temporary
base table instance, then used exclusively by one session. Temporary base tables are
often used to assemble data for reports or export.
The other sort of table is the derived table. The data in a derived table is only ever
stored in memory; it is never written to permanent storage in the database. This means
that a derived table is not part of the database instance, even though its data is derived
from the database instance. When a database query is executed in SQL, the result is
always a derived table. A view is a special sort of derived table, the definition of which
is stored in the data dictionary. The view instance is derived from one or more base
tables and assembled on demand. A view’s instance is shared, so may change if another
user changes the data in the underlying base tables. SQL’s facilities to manipulate table
and view structures support the data definition function of the DBMS.
The ANSI/SPARC database system architecture is based on three different views of the
database. The data storage structures of the internal view are discussed in Chapter 10.
The external and logical views are concerned with organizing the data and both are
based on the underlying data model. In a relational DBMS that uses SQL, base tables
are the organization structures of the logical view. The organization structures of the
105
IDWA_C03.QXD 10/5/07 11:08 AM Page 106
external views can be both base tables and derived tables, in particular relational views
(the two uses of the word ‘view’ can be confusing, so, in this section, ‘relational view’
and ‘external view’ are used). The idea of an external view in the ANSI/SPARC database
system architecture is that it delivers an image of the database instance tailored to
meet the needs of a particular group of end users. The obvious mechanism is the
derived table, both as a database query and a relational view. A DBMS responds to a
database query by creating an unnamed derived table and passing this derived table to
the database client. So, a query is clearly a tailored view of the database instance. How-
ever, each time a database query is run, the DBMS must interpret the instructions it
contains before carrying them out. A relational view is defined by a database query,
but the definition is named and stored in the data dictionary. This storing process
allows the DBMS to interpret the query once, then carry out its instructions many dif-
ferent times, improving performance.
For the end user, a relational view can be used much like a base table. A user can
query the database via a relational view. The RDBMS assembles the relational view
from the underlying base tables and then answers the query based on the data in the
relational view. Thus, the user is actually seeing the base tables through the relational
view rather than looking at them directly. To the end user, the relational view looks
just like a base table. Relational views are often used to present data in different for-
mats for different users or hide some data from certain users. They can also be used to
present a simplified version of the base tables when the logical view of the database is
particularly complex.
A familiar example of the view mechanism is the Microsoft Access select query. An
Access select query is defined using the relational data language Query By Example
(QBE) rather than SQL. The QBE definition of the view (that is, the MS Access select
query) named StaffSupport is shown in Figure 3.15. The StaffSupport view
106
IDWA_C03.QXD 10/5/07 11:08 AM Page 107
combines data from the two base tables Staff and SupportSession. Instances
of the view and the base tables it is derived from are shown in Figure 3.16. The
StaffSupport view includes some, but not all, of the columns from each base table.
The data in the fullName column of the StaffSupport view is derived by
concatenating the data in the fName and lName columns in the Staff table. Note
that if the data in the underlying base tables changes, then so will the data in the view.
Clearly, the relational view is one way to implement the external views of the
ANSI/SPARC database system architecture and, hence, achieve logical data indepen-
dence. An external view will consist of one or more relational views. Typically, users
will also have direct access to some of the persistent tables. While this violates logical-
data independence, it is necessary because of the problem of modifying the database
instance through a relational view. Consider what would happen if Selma Hutchins
got married and changed her surname to ‘Hutchins Davis’. If the users could only see
the StaffSupport relational view, they might try to change the fullName from
‘Selma Hutchins’ to ‘Selma Hutchins Davis’. The RDBMS would then have a problem.
Should the value of the lName column be ‘Davis’ or ‘Hutchins Davis’? Clearly, this
modification cannot be done through the StaffSupport relational view.
Under certain circumstances, it is possible to modify the underlying base table in-
stance through a relational view, either by adding a new row or modifying an existing
one. Such a relational view is called an updateable view. The rules for updateable views
are quite complex. In fact, the ISO SQL:2003 specification distinguishes at least four
different kinds of updateable view. As a general rule, a relational view is updateable
provided the following three conditions hold:
1 each column of the underlying base table appears at most once in the select list of
the database query defining the relational view
2 each row in the relational view instance can be traced back to a unique row in the
underlying base table – the underlying row
3 in each row of the relational view instance, the value of each column is taken di-
rectly from a single column in the underlying row, with no operators or functions
used.
One problem that all computer systems must address is the shortage of meaningful
names. The Staff Directory and Web Timetable databases at the Pennine University
both include a table named ‘Staff’, but these tables have different structures and hold
different data. Unless the DBMS can distinguish between two different tables with the
same name, creating databases will involve a constant search for synonyms – in this
case, calling one table ‘Staff’ and the other ‘Employee’. However, SQL provides a way
to organize tables that allows one DBMS to manage multiple databases without there
being conflicts over names. Each database corresponds to an SQL schema. Within a
particular SQL schema, each table must have a unique name, but tables in different
SQL schema can have the same name.
An SQL catalog is a container for SQL schema. The SQL standard does not define
a statement to create an SQL catalog. Each DBMS implements the catalog concept
in its own way. Inside every SQL catalog is a system-defined SQL schema called
INFORMATION_SCHEMA. This holds the meta-data for all other SQL schema in the
SQL catalog. In effect, the INFORMATION_SCHEMA form the DBMS data dictionary.
107
Figure 3.16 The StaffSupport view compared with the underlying base tables.
IDWA_C03.QXD
108
10/5/07
The DBMS must ensure that users can only see the meta-data in
INFORMATION_SCHEMA for the SQL schema (databases) they use. Figure 3.17 shows
the associations between SQL catalogs, SQL schemas and tables. Just as a table is com-
posed of columns, an SQL schema is composed of tables and an SQL catalog is
composed of SQL schema. However, a table must have at least one column, whereas an
SQL schema can be empty.
All these database objects will have a name. An SQL identifier is a valid name for
a database object. It must begin with a letter but can include letters, digits and the
underscore character, ‘_’. SQL regular identifiers are not case-sensitive, so
SupportSession, supportsession and SUPPORTSESSION are all the same
to SQL, which can make names difficult to read. To avoid this, some developers use the
underscore character to separate the words, like this support_session.
The ISO SQL standard also provides the delimited identifier to allow for case-sensitive
identifiers. To define a delimited identifier, simply enclose it in double quotes. Thus, the
identifier "SupportSession" is not the same identifier as "SUPPORTSESSION".
Some database applications may need to use delimited identifiers, but this book does
not. All SQL identifiers in this book are regular identifiers, written in mixed case, to
improve legibility.
One last point is that SQL supports Unicode character sets, so identifiers (and data)
can be written using any Unicode character set. For example, a database designed for
use in China will have identifiers and data written using Chinese ideographs. These in-
ternationalization aspects of SQL are beyond the scope of an introductory textbook like
this one, but one little quirk arising from character set support needs to be highlighted.
When an SQL statement is written in a text editor, the text editor saves the statement
using its default character set. Under Windows, this is usually the ISO-8859-1 character
set (also called Latin-1). When a DBMS reads a file, it assumes that the file is written in
its default character set and, for a DBMS used as part of a web database application, this
may well be UTF-8. Although these two character sets are the same for the most com-
mon characters, so the SQL keywords use the same character codes, special characters,
such as ‘TM’ have different character codes. Thus, the DBMS may not automatically rec-
ognize the ‘TM’ character when written as part of a string literal. Those DBMS that sup-
port multiple character sets can translate literals written using one character set – say,
ISO-8859-1 – into their default character set, but need to be told to do so. The DBMS
manual provides advice on how to request such character set translations.
109
IDWA_C03.QXD 10/5/07 11:08 AM Page 110
The relational data model is capable of representing the information required for any
application by organizing data into tables. Tables have a fixed structure – every row in
the table instance has the same columns. This can seem a little restrictive, especially
when there is lots of missing data. Using nulls to represent missing data works, but
why not simply omit the column for that row?
Semistructured data models take this alternative approach. The organization structures
have a basic shape, but there is the flexibility to omit parts of the structure in particular in-
stances. This seems a cleaner solution to the problem of missing data than the relational
data model, especially as nulls lead to a rather complicated three-valued logic.
Semistructured data models also deal quite naturally with repetition. If a member of
staff has two phone numbers, for example, then both can be included within the orga-
nization structure for that member of staff. This is rather like allowing one or two
phoneNo columns on some rows of the Staff table and none on others.
This flexibility in the actual shape of the organization structures means that semi-
structured data models can easily represent quite complex data. Not being able to do
this was a particular weakness of early implementations of the relational data model,
which were limited to columns with an atomic data type and tables. Contemporary re-
lational DBMS do implement a more sophisticated version of the relational data model
that is able to represent complex data quite easily, using non-scalar data types, but,
even so, semistructured data models seem more flexible than the relational data
model. This has led some to suggest that they provide a superior alternative to the
problem of database organization.
Another argument sometimes put against the relational data model is that its data
language, SQL, is not really a programming language. Many application developers
found SQL too different from the procedural programming languages they were famil-
iar with. They wanted to open the files and manipulate the data directly, not rely on a
DBMS to do this for them. This objection has also lost some of its force as the ISO
SQL:2003 standard defines a complete programming languages not just a data lan-
guage. Furthermore, the experience of web application developers with early, file-
based web applications has served to remind everyone of the significant advantages
offered by the database approach (see Chapter 1).
Perhaps the best argument in favour of an alternative to the relational data model is
the fact that, while it is a good fit for business data, the data for some specialist appli-
cations do not fit naturally into tables. Chemical formulae, musical scores and engi-
neering data are all areas where the relational data model is an awkward fit for the
information that must be held on the database. These niche applications often have
specialist database software using alternative data models to meet their particular
needs. However, for a long time, the relational data model remained the main ap-
proach to organizing data in databases. The release in 1998 of the XML specification
led to renewed interest in the possibility of a general-purpose semistructured data
model that could replace the relational data model. Before discussing the XML semi-
structured data model, it is worth reviewing its origins.
XML was not designed as a data model. As mentioned in Chapter 1, it is a language
for defining document models, in the form of mark-up languages, either from scratch
110
IDWA_C03.QXD 10/5/07 11:08 AM Page 111
111
IDWA_C03.QXD 10/5/07 11:08 AM Page 112
specified in the appropriate document model – that is, it is a valid instance of the ap-
propriate XML application. A validating XML parser needs a description of document
structures allowed by a particular document model, which is provided by a schema
document (just as a table schema describes the structure of a table in the relational
data model).
An XML document written using a particular document model is called an instance
document for that document model, and is said to be schema-valid – that is, meets all
the structural requirements set out in the document model’s schema. There will be
many different instance documents that conform to a given schema (just are there are
many instances of a table, all with the same general structure but holding different
data).
The original XML specification did not include a formal data model, though, within
a year, the Document Object Model (DOM) had been extended to cover XML docu-
ments. DOM is an object-orientated data model originally developed to allow
ECMAScript (browser-side scripting languages such as JavaScript and VBScript) pro-
grams to refer to the different parts of HTML documents. DOM is focused on manipu-
lating physical documents, not organizing data, so wasn’t considered a suitable basis
for a proper XML data model.
As new XML technologies were developed, it became clear that different groups,
even within the W3C, were interpreting the XML specification in subtly different
ways. This could have made different XML applications incompatible. To remedy this,
XML acquired a formal data model in 2001 with the adoption of the XML Information
Set (W3C, 2004b) as a W3C recommendation. This provided a formal description of
XML, ensuring a consistent interpretation across different XML initiatives. However,
its support for data types was poor and it still dealt with the physical organization of
XML documents. Thus, when the W3C began work on developing XML database tech-
nologies, another XML data model was developed. This data model, known as the
XPath 2.0 and XQuery 1.0 Data Model (XDM for short), is the one described here as it
is probably the most appropriate one for XML database applications. The description
focuses on those aspects of XML that can be regarded as relevant to the logical view of
the ANSI/SPARC architecture. In particular, the following important simplifications
have been made.
● An XML document can be split into two or more physical files. This is not some-
thing a data model needs to cover as the data model is concerned with the logical
structure of an XML document.
● XML documents can include processing instructions and comments. Again, these
are physical issues and are largely excluded from the discussion.
● The W3C’s XML namespaces recommendation allows one XML document to use
structures defined in two or more different document models without there being
any name conflicts. For example, a web page could include SVG and HTML content.
To simplify the discussion, it is assumed that each XML document has a single doc-
ument model and so uses a single namespace. (Chapter 12 discusses namespaces).
112
IDWA_C03.QXD 10/5/07 11:08 AM Page 113
113
IDWA_C03.QXD 10/5/07 11:08 AM Page 114
Figure 3.18 Part of the XML tree for an XML document. The attributes and children of the ele-
ment nodes labelled 11 and 17 are not shown.
1:Document
2:Element
node-name="staff"
appear as the result of queries on an XML document. For example, the result of the
query ‘List all the staffMember elements’ on the XML document in Figure 3.18 pro-
duces an XML document fragment, the document node of which has three element
nodes as its children (the nodes labelled 3, 11 and 17). Contrast this with the relation-
al data model, where the result of a query is always a complete table, not a fragment of
a table.
Each element node has a node-name, usually called the element name. The ele-
ment name indicates what kind of element the node represents. It is an XML name so
cannot contain spaces, nor can it begin with a colon – ‘:’ – dash – ‘–’ – or digit. Names
in XML serve the same purpose as SQL identifiers. Node 2 is an element node with the
node-name staff, indicated by the name-value pair in the lower compartment.
Hence, this node is a staff element. Other element nodes have different node
names. The node name needn’t be unique as there may be many elements of the same
kind. Nodes 3, 11 and 17 all have the same element name, staffMember, so are
staffMember elements.
114
IDWA_C03.QXD 10/5/07 11:08 AM Page 115
Element nodes also have a type-name. The type-name identifies the data type of the
element’s content. This may be a complex type, consisting of a sequence of element
and text nodes, each with its own data type. (A sequence is simply an ordered set.)
It is important to distinguish between the element’s node name, which indicates
what kind of element this is, and the elements type-name, which indicates what sort
of data the element contains. Two different kinds of element can have the same data
type. For example the staffNo and name elements both contain character string
data. They are different kinds of element, but the same data type.
An XML name is the simplest kind of identifier in XML. There are quite a few others,
mostly connected with the XML namespaces recommendation. These are discussed
in Chapter 12.
An element node may have other element nodes and text nodes among its children.
As mentioned above, when XML documents are used to hold data, they usually have
either element nodes or text nodes as children, not both. In Figure 3.18, node 3, a
staffMember element, has three element nodes as children – nodes 5, 7 and 9. Each
of these has a text node as their only child.
Rather annoyingly, an element’s attributes are not children of the element node.
Even so, the element node is considered to be the parent node for its attributes. Node 4
in Figure 3.18 is an attribute node and its parent is node 3, though node 4 is not among
the children of node 3. Attributes have a node-name and a value and that value has a
data type. Node 4 in Figure 3.18 has a node-name of lastUpdate and a value of
"2006-01-12", which has a data type of xs:date (one of the data types defined
in the W3C’s XML schema recommendation). An attribute node cannot have any chil-
dren of its own.
Strictly speaking, every attribute has two values – a typed value and a string value.
The typed value is the actual data value and the string value is a literal representing
this data value. The physical XML document can only contain Unicode characters, so
it is always the string value that is stored in the physical XML document. It is the re-
sponsibility of an XML parser to convert the literal to the appropriate value before ma-
nipulating it. This contrasts with a relational database, where the data value can be
stored using the representation the DBMS thinks is most efficient. The DBMS then
converts the data value to a suitable literal before showing it to the end user.
Text nodes represent the text content of elements. In a well-formed XML document,
the parent of a text node is always an element node, but in an XML document frag-
ment its parent may be the document node. Where an element has two text nodes,
they must have an element node (or a comment or processing instruction node) bet-
ween them, but such mixed content is rare in XML databases. The fact that element
nodes can appear ‘between’ text nodes indicates an important feature of the XDM. Un-
like the relational data model, where the order of columns is not significant, the order
of nodes is. In fact, the numbers of the nodes in Figure 3.18 indicate the document
115
IDWA_C03.QXD 10/5/07 11:08 AM Page 116
order for this XML document. Again, there is a nasty quirk with regard to attribute
nodes. The attribute nodes come after their parent element node and before this ele-
ment node’s children, but there is no particular order for the set of attribute nodes.
Different XML processors may place the attribute nodes of a given element node in
different orders.
Document order only really makes sense when considering a physical XML docu-
ment, but, before discussing that, there is a little more tree terminology to cover. This
terminology is used to describe how to navigate the tree to retrieve data from the
document, so is important for the data language. So far, the terms root, child and
parent have been explained. The siblings of a node are the other children of its par-
ent – that is, all the nodes at the same level in the tree. In Figure 3.18, node 11 has
two siblings – node 3 and node 17. The descendants of a node are all its children,
their children and so on. The descendants of node 3 are nodes 5, 6, 7, 8, 9 and 10.
Node 4 is not a descendant of node 3 because attribute nodes are not children of their
parent element. The ancestors of a node are its parent, its parent’s parent and so on.
The ancestors of node 4 are nodes 3, 2 and 1 (note that an element is an ancestor of
its attributes). A leaf node is a node that has no children. Attribute and text nodes are
always leaf nodes.
Consider the physical XML document shown in Figure 3.19. This is a textual repre-
sentation of the same XML document that is represented by the tree diagram shown in
Figure 3.18 (so, Figures 3.18 and 3.19 are two different representations of the same
XML document). However, in Figure 3.18, there wasn’t room to show all the nodes, so
the parts of the document tree below nodes 11 and 17 were omitted (that is why nodes
1 <?xml version="1.0"encoding="utf–8"?>
2 <staff>
3 <staffMember lastUpdate="2006–01–12">
4 <staffNo>10780</staffNo>
5 <name>Paul Smith</name>
6 <email>[email protected]</email>
7 </staffMember>
8 <staffMember lastUpdate="2005–03–07">
9 <staffNo>35054</staffNo>
10 <name>Selma Hutchins</name>
11 <phone>8706</phone>
12 <email>[email protected]</email>
13 </staffMember>
14 <staffMember lastUpdate="2005–03–07">
15 <staffNo>23257</staffNo>
16 <name>Freya Stark</name>
17 <phone>8660</phone>
18 <phone>8661</phone>
19 <email>[email protected]</email>
20 </staffMember>
21 </staff>
116
IDWA_C03.QXD 10/5/07 11:08 AM Page 117
12 to 16 are missing from that diagram). Each physical XML document is a file con-
taining Unicode characters and nothing else. These characters represent two different
sorts of text:
● mark-up is the term used to describe the structure and meta-data of the document,
such as elements and attributes, and appears between angle brackets
● character data is the actual content of the document – text nodes.
It is common practice to indent elements, as shown in Figure 3.19, though not strictly
necessary. In fact, the entire XML document could be written on a single line, with no
line breaks or indenting. However, as XML documents are supposed to be readable,
it makes sense to use white space, such as line breaks and spaces, to make it easier
for us to make out the document structure. White space is not significant and an XML
parser will ‘normalize’ it in text nodes. Every sequence of white space characters
(spaces, tabs, carriage returns and linefeeds) is replaced by a single space character
and any leading or trailing spaces are trimmed.
The first line of the physical XML document is the XML declaration. As it appears be-
tween angle brackets, it must be mark-up. An XML declaration is not strictly necessary
and not actually part of the logical structure of the XML document. However, it is
worth including in all XML documents as, without it, it is not clear whether the docu-
ment is intended to be XML, SGML or some proprietary mark-up language that hap-
pens to look a bit like XML. In a database, where the end user never looks at the actual
data files, such a statement is unnecessary as the DBMS knows the data model used for
its data and that is all that matters. However XML documents are intended to be read
by humans, so a clear statement that the file is an XML document is necessary. When
included, the XML declaration must be the very first thing in the XML document.
Another good reason for including the XML declaration is that XML parsers use the
first five characters in the file to try to identify the Unicode character set being used.
This information is actually included in the encoding attribute. In Figure 3.19 the en-
coding attribute tells the XML parser that the file was written using the utf-8 character
set. However, the encoding attribute’s name and value are themselves written using
utf-8 characters, so the XML parser must know that the document is written in utf-8
before it can read the encoding attribute and find out that the document is written
using utf-8. It’s a little like including the statement ‘This document is written in
English’ at the start of a document and expecting non-English speakers to understand
what this means.
The second line in Figure 3.19 is also mark-up. It is the start tag for the staff ele-
ment. In a physical XML document, every element is indicated by start and end tags.
The start tag begins with a left angle bracket followed immediately (no spaces allowed)
by the element name and ends with the right angle bracket (spaces are allowed
between the element name and the right angle bracket).
The end tag also begins with a left angle bracket, but it is followed immediately by
a forward slash and the element name (again, no spaces) and ends with the right angle
bracket – so '</staff>' in the last line shown in Figure 3.19 is the end tag for the
staff element.
117
IDWA_C03.QXD 10/5/07 11:08 AM Page 118
A document may have several instances of the same element. In Figure 3.19, there
are three instances of the staffMember element. Everything between the start and
end tags is the element content.
Because elements can contain other elements, they provide a natural way to repre-
sent complex data. The staff and staffMember elements only contain other ele-
ments and such elements are sometimes called complex elements. The staffNo,
name, phone and email elements only contain character data and these are called
simple elements.
In a well-formed XML document, every start tag has a matching end tag. Moreover,
if one element’s start tag appears as the content of another, then so must its end tag.
Put simply, one element can contain another, but they cannot overlap. So, for exam-
ple, this is not well-formed XML, because the staffNo and name elements overlap.
<staffMember lastUpdate="2006-01-12">
<staffNo>31210<name></staffNo>Paul Smith</name>
<email>[email protected]</email>
</staffMember>
An element’s attributes are listed as name-value pairs inside its start tag (the attribute’s
data type is not included in the XML instance documents). The attribute name comes
first, followed by an equals sign and the attribute value. If there are two or more attrib-
utes, then they are separated by spaces. In a well-formed XML document, attribute val-
ues will be character data enclosed in quotation marks. Also, the attribute name must
be unique within the element (though different elements can have attributes with the
same name).
Attributes were originally intended to hold meta-data, though there is some debate
as to what counts as meta-data. The staffMember elements in Figure 3.19 all have a
single attribute holding the date when the element content was last changed. This is
classic meta-data as it tells us nothing about the member of staff the element refers to,
only something about the element itself.
Many word processors include a custom, or smart, quotes feature. This replaces
straight quote marks – "like these" – with fancy, curly quote marks – “like these”. The
same holds for single quotes – 'straight quotes' are turned into ‘custom quotes’. Al-
though the two sorts of quote marks are equivalent in normal use, they are different
characters. The custom quote marks are not part of the standard ASCII character set
so only straight quotes can be used to delimit attribute values. So, if you write an
XML document in a word processor program, make sure you turn off the custom
quotes feature!
Character data within an element becomes a text node in the tree representation. The
first staffNo element contains the character data '10780', which becomes the con-
tent of the text node labelled 6 in Figure 3.18. In a well-formed XML document the
characters ‘<’ and ‘&’ cannot appear in character data. Use their character entity alter-
natives, ‘<’ and ‘&’ instead.
Returning to the subject of the document order, this is most easily understood by
describing how to take a textual representation of an XML document and turn it into
118
IDWA_C03.QXD 10/5/07 11:08 AM Page 119
a tree representation. This involves reading through the document, top to bottom, left
to right, and creating the appropriate nodes, just as an XML parser does. During this
process each node is given a number and these numbers are the document ordering.
Here is a general description of how to turn a textual representation of a well-formed
XML document into an equivalent tree representation. As usual, namespaces, com-
ments and processing instructions are omitted, to simplify the discussion.
(a) on encountering the start tag of an element, create an element node with the
current node as its parent, then make this new element node the current node
(b) on encountering a name-value pair inside a start tag, create an attribute node
with the current node as its parent (the parent remains the current node)
(c) on encountering a block of text, create a text node with the current node as its
parent (the parent remains the current node)
(d) on encountering the end tag of an element, make this element’s parent the
current node.
Following this process, the parent of a text node is the element containing the text –
the element represented by the enclosing start and end tags. An attribute node’s parent
is the element the start tag for which the attribute appears in. An element node’s par-
ent is either the document node or the element node the start and end tags of which
contain the whole of the element. So, for example, in Figure 3.18, the staff ele-
ment’s parent is the document node, and the parent of all three staffMember ele-
ments is the staff element. The document order is simply the order in which the
nodes were created. Notice that only the document node and element nodes can
become the ‘current node’. Attribute and text nodes never become the current node
because they cannot have child nodes.
The document order is called the preorder in the mathematical theory of trees.
In this version, the document order for attribute nodes is simply the order in which the
attributes appear within each element start tag. As the XML specifications do not pre-
scribe an order for attribute elements, this seems a reasonable choice. A more so-
phisticated approach is to order them alphabetically by their names, but this process
is harder to describe.
This process is the reason why start and end tags for two different elements cannot
overlap. If they did, it might not be possible to determine the parent for a particular
node. For example, which node does the text ‘Who is my parent?’ in the following ex-
ample belong to?
<staffMember lastUpdate="2006-01-12">
<staffNo>31210<name>Who is my parent?</staffNo>Paul
Smith</name>
<email>[email protected]</email>
</staffMember>
119
IDWA_C03.QXD 10/5/07 11:08 AM Page 120
It could belong to the staffNo element as it lies between that element’s start and
end tags. However, it also lies between the name element’s start and end tags, so could
have this element as parent. In the textual representation of an XML document,
elements must nest properly or else it is not possible to convert it into a tree.
As an example of the process, apply it to the textual representation of the well-
formed XML document shown in Figure 3.19.
Rule 1 says to create the document node and make it the current node. As this is the
first node created, it is node 1.
Now begin reading the document. The first thing in the document is the XML
declaration, but rule 2 says to ignore this.
The next thing in the document is the start tag of a staff element. Rule 3(a) says
to create an element node with the current node, node 1, as its parent. This is the sec-
ond node created, so is node 2. Make node 2 the current node.
Continuing to read the document, the next thing encountered is the start tag for a
staffMember element. Rule 3(a) applies again, so create a new element node with
the current node, node 2, as its parent. This is the third node created, so is node 3.
Make node 3 the current node.
Reading on, there is a name-value pair – lastUpdate="2006-01-12"– defined
in the start tag of the staffMember element. Rule 3(b) says to create an attribute
node with the current node, node 3, as parent. This is the fourth node created, so it is
node 4. The current node is still node 3, though.
Next comes a start tag for a staffNo element. Apply rule 3(a) and create an ele-
ment node with the current node, node 3, as its parent. This is the fifth node created,
so is node 5. Make node 5 the current node.
The next thing encountered is some text, 10780. Rule 3(c) says to create a text
node with the current node, node 5, as its parent. This is the sixth node created, so is
node 6. Node 5 remains the current node.
Next comes an end tag for the staffNo element, so rule 3(d) applies. The current
node is node 5. The parent of node 5 is node 3, so node 3 becomes the current node.
Continue to read through the document, applying the appropriate rule to each tag,
name-value pair and block of text. The result is the tree representation, part of which,
as we have seen, is shown in Figure 3.18, where each node has a number. This number-
ing is the document order.
The document order is important because it is used to define a third way of repre-
senting an XML document – as a sequence of nodes. The sequence of nodes that ap-
pear in the tree representation of an XML document provides a complete description
of that document, and one that it is easier for computers to manipulate than text. Fur-
thermore, a sequence of nodes can also represent a document fragment. As the result
of a query on an XML document may be a document fragment, sequences provide the
XDM and its query languages with a consistent representation of the data. Just as
everything in the relational model is a table, everything in the XDM is a sequence of
nodes.
So far, the discussion has focused on organizing data about things that the end user
is interested in. XML documents can also capture relationships between two such
things quite naturally, by representing both things as elements and including one ele-
ment in the content of the other. At the Pennine University, as we have seen, members
of staff offer support sessions. Data about support sessions can be captured using
120
IDWA_C03.QXD 10/5/07 11:08 AM Page 121
121
IDWA_C03.QXD 10/5/07 11:08 AM Page 122
elements, then there will be data duplication – some staff teach many modules, for
example. The same holds if the module elements are included in the staff element –
some modules are taught by many staff.
Another problem is that including one element inside another makes it hard to ex-
tract data about the included element. The query ‘Are there any support sessions of-
fered on a Friday afternoon?’ doesn’t need any information about staff, but the staff
data must be parsed to get at the support session data. To alleviate some of these prob-
lems the XDM includes a foreign key mechanism as an alternative way to represent
relationships.
It’s clear that the XML data model is much more complicated than the relational data
model. This is odd as the organization structures of document, element and attribute
should be pretty easy to assemble into an elegant, tree-structured data model. The vari-
ous XML recommendations make a poor job of this task because they must incorporate
a lot of information about the physical representation of XML documents. There was
also a desire to ensure backwards compatibility with SGML, both of which are reasonable
goals. That said, there are one or two points at which the choices made about terminolo-
gy lead to an unnecessarily complex description, such as the fact that an attribute is not
a child of its parent element. However, the underlying idea is clean, simple and elegant.
A document is structured into elements, which may contain other elements or text. Ele-
ments can also have attributes, which capture meta-data. An XML document provides a
very flexible way to hold the data and meta-data needed for any database application.
If the staff element’s jobType attribute has the value ‘Academic’, then the
staff element must contain a supportSessions element containing at
least one session element. Otherwise, the staff element should not contain
a supportSessions element.
122
IDWA_C03.QXD 10/5/07 11:08 AM Page 123
This sort of integrity constraint is obviously very powerful and there is nothing similar
to it in relational databases.
XML also allows the content of an element to be restricted to particular strings of
characters. For example an e-mail address must be a sequence of characters with an ‘@’
symbol somewhere in the middle. Regular expressions (see, for example, http://www.
regular-expressions.info) offer one way to define such integrity constraints. SQL:2003
also supports regular expressions by means of its SIMILAR comparison operator.
As an XML document can be represented as a tree, there is also scope to place con-
straints on the permitted tree structures. Such constraints can be based on the notion
of a path. A path is simply a particular route through the tree – starting at a given
node, follow the parent–child links to some other node. There is no back-tracking al-
lowed: a path always moves down the tree (assuming the root is drawn at the top of
the diagram). It is possible to define structural constraints that require a particular
path to exist or forbid a particular path or require certain structures to appear at the
end of a particular path. Finally, it should also be possible to enforce more general
business rules, possibly using combinations of the mechanisms described above.
As mentioned above, XML also includes analogues of the relational data model’s
primary and foreign key constraints. These allow one element to refer to another and
are an alternative to the inclusion mechanism for defining relationships between ele-
ments. The analogue of the foreign key is actually more flexible as it can include more
than one value. This allows the referencing element to refer to several different refer-
enced elements. Using this mechanism, it would be possible to model the relationship
between staff and the modules they teach.
The breadth and generality of these integrity constraints should indicate that there
is currently no single, agreed set of integrity constraints for XML documents. Work is
under way at ISO (the body that has standardized the SQL language) to develop an in-
ternational standard for Document Schema Definition Languages (DSDL) for defining
both the underlying structure and the integrity constraints. Schema languages form
part of the suite of data languages for XML and this is the topic of the next subsection.
123
IDWA_C03.QXD 10/5/07 11:08 AM Page 124
The solution to this problem is to use a validating XML parser to check the XML in-
stance document against a description of the document model. Any incorrect XML
instance documents can be rejected, usually with error messages stating what is wrong
with the XML instance document.
This way of working is quite different from relational databases. In a relational data-
base the DBMS controls all interactions with the database instance. One SQL state-
ment instructs the DBMS to create a table according to a relational table schema and
further SQL statements instruct the DBMS to add data to this table. There is no way
that the end user can add even a single row that does not conform to the table schema.
There are DBMS that use XML as their underlying data model, and they work in the
same way, but these technologies are not yet mature and there is no single approach to
‘creating’ a document.
It is worth taking a closer look at the split between the XML schema, which de-
scribes a document model, and the XML instance document, which realizes it. Con-
sider the XML instance document in Figure 3.19. This is just one possible instance
document for the Pennine University’s Staff Directory. As new staff join and old staff
leave, the data will change, but its basic structure will not. This basic structure is easy
enough to describe in English. There will always be a single Staff element, with one
or more staffMember elements. Each staffMember element will have a single at-
tribute, lastUpdate, and some elements in its content – exactly one staffNo,
name and email elements and zero or more phone elements. The staffNo, name,
email and phone elements will all have text content.
XML needs some way in which to express such descriptions so that a computer can
understand them. In the relational data model, this role is played by a single SQL state-
ment. Unfortunately, the added complexity of XML means that a single data language
statement simply isn’t expressive enough to describe any possible XML document. In-
stead, an entire language is required. A schema language for XML is a computer lan-
guage capable of describing the document model for a class of XML instance
documents. An XML schema is written in a particular schema language for XML and
describes the document structures allowed by a particular document model.
There are three general approaches to defining an XML schema, exemplified by
three different schema languages. The W3C’s XML Schema language (W3C 2004d and
2004e) (note the capital ‘S’ of ‘XML Schema language’, to distinguish this schema lan-
guage from a particular XML schema document) is based on describing the allowed con-
tent. Each element and attribute allowed by the document model is described in
detail. For example, an English language description of the session element in
Figure 3.20 would be: ‘The session element contains: a dayOfWeek element, which
contains text; a startTime element, which contains a time value; and an endTime
element, which contains a time value’. These descriptions are written in a special
language developed to allow schema writers to describe all the usual situations –
parent–child relationships, occurrence constraints and so on. The situations covered
are those that schema writers have, over the years, recognized as being important for
XML (and SGML) documents.
The clever thing about the W3C XML Schema is that the language is itself an XML
application and the schemas written in it are XML documents. This is a classic exam-
ple of XML’s flexibility: one XML document, the schema document, describes the al-
lowed content of a second XML document, the instance document. A validating XML
124
IDWA_C03.QXD 10/5/07 11:08 AM Page 125
parser can use both documents to check that the instance document conforms to the
description set out in the schema document. Once this is done, a second XML proces-
sor can read the XML instance document, which describes its own content, and deliver
that content in an appropriate form to the end users.
There is a further advantage of using XML to develop a schema language. A schema
language for XML is intended as a way of describing document models for XML appli-
cations. As the W3C XML Schema language is itself an XML application, it can de-
scribe its own document model. The W3C has published an XML schema, written in
the W3C XML Schema language, which describes the document model for the W3C
XML Schema language. This is a little like publishing, in English, a book on English
grammar. As long as you can read English, you can check whether or not you are writ-
ing it properly.
This is quite a complex idea and it is fair to ask why anyone would do this. The rea-
son is that someone has to write each XML schema and it is pretty likely that mistakes
will be made. Because each schema document written using the W3C XML Schema
language is itself an instance document, a validating parser can check it for errors. All
the parser needs is a schema document that describes the document model for the
W3C XML Schema language. As long as the W3C get this ‘master’ XML schema right,
everyone else who writes an XML schema can check that theirs is a valid instance of
the W3C XML Schema language. Thus, there should never be any syntax errors in
XML schema.
Unfortunately, the W3C XML Schema language is not actually up to the job of de-
scribing itself. Another problem with it is that it is fiendishly complex. It was devel-
oped as a replacement for the Document Type Definition (DTD) schema language of
SGML, so focused closely on replicating the facilities of that language. This led to a
complex and cumbersome syntax.
The DTD schema language itself is not really up to defining the structure of XML doc-
uments designed to hold data, so it will not be covered here or in subsequent chap-
ters. From the XML perspective, it may seem a little odd to ignore the DTD schema
language as it is very good for defining the structure of a narrative document and re-
mains the most widely used schema language. However, it simply does not make
sense to use it for XML documents designed to hold data when there are schema lan-
guages such as W3C XML Schema and RELAX NG available. DTD has limited sup-
port for data types and, then, only for attribute values. Worse, it does not really
support namespaces. Both W3C XML Schema and RELAX NG have a wide range of
data types, which can be applied to both attribute values and the content of simple
elements, and they support namespaces. They also deal more easily with unordered
element content, which is more common in databases than it is in documents.
Dissatisfaction with W3C XML Schema led to alternative proposals. Some of these
crystallized into RELAX NG (OASIS, 2001) an alternative schema language with a
much simpler syntax. It also has a solid basis in mathematics (something that the
W3C XML Schema language lacks). Like W3C XML Schema, RELAX NG schema docu-
ments are XML documents. However, rather than providing descriptions of elements
and attributes, RELAX NG schema describe patterns. Each instance document is
125
IDWA_C03.QXD 10/5/07 11:08 AM Page 126
checked to ensure that its structure matches the patterns defined in the RELAX NG
schema.
One significant advantage of using patterns rather than descriptions is that com-
plex patterns can be built from simpler patterns, following the underlying mathemat-
ical rules for combining patterns. Although the W3C XML Schema language does
provide facilities for combining descriptions, they add further complexity to an al-
ready complex schema language.
The elegance of RELAX NG owes a lot to its grounding in a mathematical under-
standing of the structure of XML documents. As with the the W3C XML Schema lan-
guage, there is a RELAX NG schema describing the document model of RELAX NG.
Again, however, this XML schema cannot describe the entire language.
The third approach to creating an XML schema is to provide a set of rules that the
instance document must comply with. Schematron (ISO/IEC JTC 1, 2004) takes this
approach. Both RELAX NG and Schematron form part of the ISO/IEC proposed inter-
national standard for schema definition languages. Rules in Schematron consist of two
parts:
In Schematron, both parts of the rule (location and criteria) are defined using the W3C
XPath language (see below). This language is not itself an XML application, so neither
is Schematron.
Another schema language for XML that takes the rule-based approach is the Docu-
ment Structure Description language, DSD2 (Møller, 2002). This rule-based schema lan-
guage is an XML application. Moreover, it is the only schema language for XML that
can fully describe its own syntax (at least, the only one the author has encountered).
Just as there is no single schema definition language, there is no single query lan-
guage for extracting data from an XML document. The W3C XML Path language,
XPath (W3C, 2005d) provides a way to extract data from an XML document by speci-
fying a path through the tree structure of the document to the required data. An XPath
expression may actually match more than one path through the document. At the end
of each matching path is a node and the result of an XPath expression is this sequence
of nodes, listed in document order. As the basic data structure of the XDM is a se-
quence of nodes, this means that XPath expressions always return values that the data
model can represent. For example, an XPath expression capturing the statement
‘move to the phone element of the last staffMember element’ would result in a se-
quence of two nodes, which would be the two phone element nodes in the instance
document shown in Figure 3.19 that represent Freya Stark’s phone numbers.
Although XPath can search through an XML document and generate a sequence of
nodes, it is more limited than the SQL select statement. For example, it can only ex-
tract data from one XML document at a time. Recognizing its limitations, the W3C has
also developed XQuery (W3C, 2005e), a query language for XML documents. (At the
time of writing XQuery was a W3C candidate recommendation rather than a formal
W3C recommendation, so is subject to changes. It seems likely to become a formal
W3C recommendation by the end of 2006 (Eisenberg and Melton, 2005).) XQuery is a
functional language that builds on XPath and provides the user with the ability to ex-
tract information from one or more XML documents. It is possible that XQuery will
126
IDWA_C03.QXD 10/5/07 11:08 AM Page 127
become an important query language for databases that use XML as their underlying
data model, but it is not yet clear to what extent it will be adopted by DBMS vendors.
A third approach to querying XML documents that has already been implemented
by relational DBMS vendors is SQL/XML. SQL/XML forms part of the ISO SQL:2003 in-
ternational standard for the SQL data language. SQL/XML provides a way to move be-
tween representations of data as relational database tables and XML documents. First,
it adds a new XML data type to the relational data model. A column with this data
type can hold all, or part, of an XML document or even a collection of XML docu-
ments. Additional operators for this data type allow queries such as ‘return all rows
where the value of a specific column is a valid XML document’. It is also possible to
embed XQuery expressions in the SQL statement, allowing parts of the XML data held
in a column to be extracted and used in the SQL query. In effect, these SQL facilities
take XML documents and present them as relational data (remember, the result of
every SQL query is a table, even when some column values are XML documents).
SQL/XML also includes functions that take relational data and turn it into XML, re-
versing the process just discussed. Using these functions, it is possible to generate all or
part of an XML document from data held in ‘ordinary’ columns.
The W3C Extensible Style Language Transformations (XSLT) (W3C, 2005i) provides
a view mechanism for XML databases. XSLT is a powerful stylesheet language that can
take an XML document written to one document model and transform it into an XML
document written to another document model. A typical use of XSLT is to take an
XML document written using a proprietary document model, such as the Staff Direc-
tory shown in Figure 3.19, and transform it into an XHTML document. This sort of
transformation allows any XML document to be delivered using a web browser.
One problem with this approach to implementing a view mechanism is that it actu-
ally creates a new, physical XML document. There are two problems with this. First, if
the original document is modified while the transformed document is being viewed,
there is no mechanism for automatically updating the transformed document. Sec-
ond, if the transformed document is saved to disk, there will be data duplication. Rela-
tional views can avoid both problems because they assume a stateful connection
between the end user and the DBMS.
All the data languages discussed focus on querying an XML document or creating
new XML documents from existing ones. At the moment, the main way to update an
XML document is to open it in a text editor and type out the changes by hand. This
weakness should be addressed in the near future, as the W3C has published a working
draft of the XQuery Update Facility (W3C, 2006b). The lack of an update function
does at least mean that using XPath and XSLT as a view mechanism for an XML data-
base is currently reasonably safe as the original document can’t easily be changed.
Other features of SQL that are missing from the equivalent XML technologies are the
ability to manage users and their privileges and the ability to handle concurrent
access. Both issues will need to be addressed by XML DBMS once the XQuery Update
Facility has been approved.
127
IDWA_C03.QXD 10/5/07 11:08 AM Page 128
are some good reasons for doing so. The tree structure of XML documents is a natural
fit for semistructured data. There are lots of pre-existing XML technologies, originally
developed to process documents, that can be adapted to process data. There is also the
fact that XML is a web technology, making it a good choice for putting databases on
the Web. These arguments are persuasive, but not conclusive as there are some serious
problems with using XML as a database technology.
One problem is that XML does not follow the ANSI/SPARC architecture for database
systems. There is a clear separation of the description of an XML document’s content
from the description of how to present that content – mark-up describes structure,
while stylesheets deal with presentation. This corresponds to the separation between
the external and logical views in the ANSI/SPARC architecture. However, the logical
and internal views both use the same structure – the XML document. Even worse,
from the database perspective, XML documents are specifically designed to mix logical
organization structures and the physical storage structures that lie beneath the three
levels of the ANSI/SPARC architecture. XML documents must contain only Unicode
characters and may include references to physical file locations, both file system issues
rather than database issues. The decision to mix logical and physical structures makes
sense in terms of the original motivation for XML, which was to produce a format for
web pages that both humans and machines could easily use. It remains a problem,
though, for those keen to use XML as a native format for databases as it can make pro-
cessing an XML document inefficient. In particular, the fact that all the data is stored
as strings, regardless of its actual data type, makes it harder for XML processors to ma-
nipulate the data. For example, a database using the XDM as its data model will store
all data items as strings of characters. This means that the database stores literals not
values. When the end user requests data from the database, the DBMS must first con-
vert the literals to values of the correct type before it can manipulate them.
There is also the problem of updating an XML document. In a true database, end
users never get their hands on the physical data files, so cannot break the logical struc-
ture. Allowing end users to type mark-up directly into an XML document means that
it would be prone to structural errors. Validating XML parsers can find such errors, but
why allow users to put them there in the first place?
Perhaps the biggest problem for XML is that relational database technologies are sta-
ble, efficient and widely used. Native XML DBMS are a new technology and still devel-
oping. They lack concurrency, security and optimization technologies that could
match those of the commercial RDBMS. With the development of SQL/XML, they risk
being beaten on their own territory. Now that relational databases can deal with XML
data, there may be no need for a DBMS that uses the XDM as its underlying data model.
Chapter summary
■ This chapter has explored approaches to organizing data. The chapter began by ex-
amining data items – the individual facts from which useful information is created.
It emphasized the difference between values and literals, and introduced the no-
tion of a variable as a named container for data values. The first step taken towards
organizing data items into meaningful structures was to develop a range of data
128
IDWA_C03.QXD 10/5/07 11:08 AM Page 129
Further reading
types – sets of data values that share some common characteristic. The most com-
mon scalar and non-scalar data types were introduced, along with operators for
manipulating data values of specific data types. The section on data items also in-
troduced the important notion of a Boolean expression.
■ The discussion then moved on to the relational data model. This remains the most
commonly used data model today. The section began with a careful discussion of
the single data structure, the table. The consequences of allowing a column to be
null were explored, including the horrors of three-valued logic. Nulls also cropped
up in the section on relational integrity constraints, particularly in the discussion
of referential integrity. This section concluded by looking at the SQL data language
and the view mechanism for the relational data model.
■ The final section discussed XML as an emerging database technology. The prob-
lems of separating the physical and logical features of XML documents were dis-
cussed, along with the tree-structured data model provided by the XDM. The
immaturity of many of the data-focused aspects of XML technology meant that the
discussion of integrity constraints and data languages could not be as detailed as it
was for the relational data model. The section concluded with a brief discussion of
the pros and cons of using XML documents as the native organization structures of
a database.
Further reading
The classic reference for the relational data model is Date (2004). Discussed, at some
length, are the many advantages of the relational approach and the several weaknesses
of the data language SQL and commercial RDBMS. The discussion of data types extends
to complex data types not covered here – user-defined types, table types and so on.
Hoffer et al. (2004) offer a more practical approach to the subject, focusing on business
applications.
Otherwise, any database textbook aimed at university students will cover the relational
model in reasonable detail. Database books aimed at practitioners tend to take a
programmer’s view of databases and generally cover the relational model very briefly, if
at all.
Discussions of XML tend to be from the practitioners’ perspective. Probably the best
introductory book on XML and related technologies is Møller and Schwartzbach
(2006).
Harold and Means (2004) is well thought of, though it is not really an introductory
text.
Bourret (2003) provides a nice, gentle introduction to the relevance of XML for data-
base practitioners, while the first chapter of Steegmans et al. (2004) provides a fuller
account of these issues.
129
IDWA_C03.QXD 10/5/07 11:08 AM Page 130
Review questions
3.2 What is a data type? What distinguishes user-defined data types from system-
defined data types?
3.4 Explain the following terms used in the relational data model:
(a) candidate key
(b) candidate key constraint
(c) primary key
(d) simple key
(e) composite key.
3.6 What is the purpose of nulls? In what ways do the introduction of nulls complicate
the relational data model?
3.7 Explain the difference between base tables and derived tables. How does a view dif-
fer from an ordinary derived table?
3.8 Describe the rules for SQL identifiers (that is, the names for database objects).
Explain the differences between SQL regular and delimited identifiers.
130
IDWA_C03.QXD 10/5/07 11:08 AM Page 131
Investigations
3.10 Explain the following terms, each of which applies to nodes in an XML document
(you may need to draw a suitable diagram):
(a) root
(b) child
(c) parent
(d) sibling
(e) descendant
(f) ancestor
(g) leaf.
Exercises
3.11 Explain the difference between scalar and non-scalar data types. Give an example of
each kind of data type to illustrate your explanation.
3.12 Explain the connections between business rules, integrity constraints and Boolean
expressions.
3.13 Using a suitable example, explain how a relational DBMS would enforce a foreign
key constraint under the following circumstances:
(a) adding a new row to the referencing table
(b) removing an existing row from the referenced table
(c) changing the value of a matching candidate key.
Use the terminology introduced in this chapter and include a discussion of referential
actions in your answer.
3.14 Explain the purpose of views in the relational data model. Discuss how effectively
they implement the external views of the ANSI/SPARC database systems architec-
ture.
3.15 Describe how data is organized in the XPath 2.0 and XQuery 1.0 data model (XDM).
How does it deal with missing information?
3.16 Use the textual representation of the XML document shown in Figure 3.20 to pro-
duce a tree-structured representation of this same document, so that it is in the same
style as that shown in Figure 3.18.
Investigations
3.17 Research the ways in which SQL deviates from the relational data model. Critically
evaluate both the reasons for and the effects of these deviations. (Date (2004) is a
good initial source for this.)
131
IDWA_C03.QXD 10/5/07 11:08 AM Page 132
3.18 Investigate the different approaches an RDBMS can take to implementing relational
views. What approaches do the major commercial RDBMS take?
3.19 Use a range of sources to compare and contrast the relational data model and the
semistructured XML data models, such as the XDM. Critically evaluate the various
arguments put forward in favour of each data model. Which side of the debate do
you find most convincing? Why?
3.20 Investigate one of the many data models not discussed in this chapter, critically eval-
uating its potential for use in web database applications. Possible data models for
investigation include the associative data model (see www.lazysoft.com/index.
html), object-orientated and object-relational data models (Date (2004) or Elmasri &
Navathe (2007) are good starting points.)
132
IDWA_C04.QXD 10/5/07 11:12 AM Page 133
Chapter objectives
➔ Discuss how the PHP application server and HTTP server work together to
generate an HTTP response message.
➔ Review the use of HTML forms to gather user input.
➔ Introduce the main features of the PHP scripting language – variables and
statements.
➔ Discuss how to handle user input – validation and data cleansing.
➔ Discuss approaches to code reuse – functions and include files.
➔ Introduce the PHP functions for communicating with the MySQL DBMS.
Chapter outline
Any web database application needs user input. The users must tell the application
what they want it to do and provide any data values the application needs to complete
this task. Searching for data in a database is by far the commonest use of web database
applications. This chapter uses a simple database search to illustrate how data passes
from the web browser to the web server and on to the DBMS. The DBMS retrieves the
requested data from the database and passes it back to the web server, which uses this
data to generate a web page that is sent to the web browser.
For simple web database applications, the application developer can assume
that the networking and operating system technologies will work efficiently. More
complex applications may need to be written to make the most of the specific
technologies that support the application – in particular the TCP/IP suite of protocols.
Chapter 2 outlined the different approaches to handling server-side dynamic con-
tent. This chapter focuses on extending the capabilities of the web server with the PHP
application server.
133
IDWA_C04.QXD 10/5/07 11:12 AM Page 134
The PHP application server – called the Zend Engine – is an open source develop-
ment, now managed by Zend Technologies Inc. (www.zend.com). It supports the PHP
language, a server-side scripting language designed to make it easy to embed code with-
in HTML mark-up and so speed up web application development. The PHP code, em-
bedded in the HTML mark-up, can generate additional HTML elements or content for
those elements. Once the PHP application server has processed the script, the HTTP
server sends the generated web page to the web browser. One important advantage of
server-side scripting is that the end user never sees the actual code used to generate a
web page, they only see the generated HTML.
Each PHP script is, in essence, a set of instructions for generating an HTTP response
message. When a web browser requests a PHP script, the HTTP server and the PHP ap-
plication server work together to produce this HTTP response message. The PHP code
is interpreted and any output it generates is merged into the HTML. The resulting
HTML is sent as the entity body of the HTTP response. The application developer can
even instruct the PHP application server to generate HTTP headers, though many PHP
scripts only generate the HTML of the entity body.
The power of server-side scripting is that a single PHP script can produce many dif-
ferent versions of a web page. PHP scripts are mostly used to generate web pages that
include content drawn from a database. As the database instance changes, so does the
web page generated by the PHP script. The PHP application server can also use data
gathered from the user, typically using an HTML form, as parameters to the PHP script.
A different web page is generated in response to different user data.
The PHP application server does not actually interpret the raw PHP code written
by the developer. This code is parsed, to ensure that it is correct PHP, then the in-
structions are compiled into an intermediate format. It is this intermediate format
that is executed by the PHP application server. The complied PHP code is faster
to execute than raw, human-readable PHP. However, the description above
captures the essence of how PHP works.
■ Section 4.1 reviews the ways in which HTML forms gather data from the user and
pass this data to the web server. The example also serves to review the basics of the
HTML mark-up language, although readers not familiar with HTML should do
some background reading before tackling this chapter. See the Further reading sec-
tion at the end of this chapter for recommended sources of information on HTML.
■ Section 4.2 introduces the PHP server-side scripting language. It explains how to
embed PHP code within an HTML document using the special PHP tag <?php ...
?>; include comments within PHP code to explain its function; generate HTML or
content for the web page using the PHP echo statement; use PHP variables to store
and manipulate data values; access parameter values passed to the PHP script from
an HTML form using the PHP superglobal variables; and how to use string literals.
■ Section 4.3 looks at how to handle data values passed to the PHP script from an HTML
form. It discusses the important topics of data validation and data cleansing, clearly
134
IDWA_C04.QXD 10/5/07 11:12 AM Page 135
distinguishing between them. This section discusses the PHP if ... else ...
statement and PHP functions and so completes the introduction to the basic PHP
syntax.
■ Section 4.4 introduces the PHP functions that allow a web developer to instruct the
PHP application server to communicate with a MySQL DBMS. It describes a six-step
process for this, introducing the PHP loop constructs while, foreach and for.
The section ends by discussing how to deal with errors that occur when communi-
cating with the MySQL DBMS.
HTML forms are the obvious way to gather data from the end user and pass it to the
web server for processing. An HTML form can have a number of controls. Figure 4.1 (a)
shows a very simple HTML form, displayed in the Firefox web browser. It has two con-
trols: a text box and a submit button. Text boxes allow the user to type character
string, date and numeric literals, so can be used to gather these types of data. In Fig-
ure 4.1 (a), the form gathers a single character string literal from the user via the text
box labelled ‘Surname’ and here the literal that has been entered is ‘Smith’. Once the
user has typed in the literal he or she clicks on the submit button (the Firefox web
browser gives submit buttons the caption ‘Submit Query’ unless another value is spec-
ified in the HTML) and the browser sends the value to the web server. Obviously, the
intention is to search the Staff Directory database for staff whose surname is the same
as the value entered by the user. Data about all matching members of staff will be sent
back to the user as a web page.
To begin with, a simple ‘stub program’ is often written to allow the HTML form to
be tested without the need to connect to the DBMS. This stub program accepts the
data from the HTML form, then simply generates an appropriate message. Figure 4.1 (b)
shows a typical stub program response. Note that this stub program is a PHP script
called ‘SurnameStub.php’ (see the URL in Figure 4.1 (b) to confirm this). All the dy-
namic web pages in this chapter are generated by such PHP scripts.
An HTML form can use either of the HTTP methods GET and POST to pass data to
the web server. The POST method sends the data as part of the entity body of the HTTP
request. The GET method sends the form data in the query component of the URL.
This query component consists of a name-value pair for each control on the HTML
form. The web browser’s address line in Figure 4.1 (b) shows part of this URL. The full
URL is:
http://www.pennine.ac.uk/StaffDirectory/SurnameStub.php?surname=Smith
The query component is the part following the ‘?’ character. Here there is a single name-
value pair, with name ‘surname’, which is the name of the text box, and ‘Smith’ value.
This is one weakness of the GET method – if there are lots of controls, the query
component can be very long. For simple forms that gather data sent across the Inter-
net as plain text (the data is neither confidential nor business critical), use the GET
method. For forms with several controls or that gather confidential data, use the POST
method.
135
IDWA_C04.QXD 10/5/07 11:12 AM Page 136
Figure 4.1 Illustrating the use of HTML forms and the HTTP GET method.
(a) A simple HTML form to gather data and submit it for processing.
(b) The response from the web server when the ‘Submit Query’ button is clicked on.
The actual HTML for the surname search web page is stored in the file
‘SurnameSearch.html’, shown in Figure 4.2. The line numbers down the left-hand side
of the image are not part of the HTML, but have been included to make it easier to
discuss the content of the file.
136
IDWA_C04.QXD 10/5/07 11:12 AM Page 137
The very first thing in the file is the document type declaration in lines 1 and 2. This
declares that the file contains HTML mark-up written to the XHTML1.1 standard,
rather than to any other version of the HTML language. The <html> start tag in line
3 is the start tag for the html element. The html element ends with the </html> end
tag in line 28.
In XHTML1.1, all tags must be written with lower-case letters, so using the tags
<HTML> ... </HTML> to delimit (that is, mark the start and end of) the html ele-
ment is not valid XHTML1.1.
Every html element contains two other elements. The head element starts in line 4
with the <head> start tag and ends in line 7 with the </head> end tag. The body
element starts in line 8 with the <body> start tag and ends in line 27 with the
</body> end tag.
These three elements – html, head and body – define the basic structure of any
HTML document.
In XHTML1.1, the head element must include a title element, as shown in line 6 of
Figure 4.2. Failing to include a title element generates a validation error. Typically, the
content of the title element is displayed in the title bar of the web browser’s window,
often followed by the name of the web browser software. This is how the Firefox web
browser behaves, as shown in Figure 4.1 (a).
The meta element is used to provide additional information about the document.
The meta element in line 5 tells the web browser that this document contains text,
structured using html, and that the character set used is utf-8. (The utf-8 character set is
a Unicode character set recommended by the W3C as the one to use for web pages.)
Usually this information is included automatically by the HTTP server as HTTP response
137
IDWA_C04.QXD 10/5/07 11:12 AM Page 138
message headers. The meta element overrides the HTTP server’s default settings, so
should be used with care. If in doubt, omit it.
Different languages use different sets of characters. For example, English uses an al-
phabet of 26, letters, while Greek uses a different alphabet of 24 letters, ␣, , ␥ and
so on. Some character sets (such as ASCII) cannot represent the Greek letters, while
others can (utf-8, for example). So, a page written in Greek might use a different char-
acter set from one written in English. Korpela (2001) and W3C (2005a) are good intro-
ductory tutorials on character sets. (The official website for the Unicode Consortium
is www.unicode.org)
Note that it is not enough to just state that a web document uses a particular charac-
ter set, but must actually be stored using that character set. Many web development
tools allow the user to choose which character set to use when saving a file, but
some do not. The Unicode Consortium maintains a list of Unicode enabled products
(Unicode Consortium, 2005).
The meta element is interesting as it cannot contain any content. In older versions of
HTML, such empty elements did not have an end tag. In XHTML all elements must
have an end tag. Strictly speaking, the correct form for the meta tag in line 5 is:
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </meta>
However, requiring web authors to type end tags for empty elements goes against
human nature, so the W3C included the shortened form in the XHTML specification:
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8" />
Notice that there is a space character before the /> combination that closes this tag.
This is necessary to avoid confusing older browsers, which might not understand this
shortened form for empty elements. Other examples of empty elements, written using
the short form, include the horizontal rule element – <hr/> in line 17 – and the break
element – <br/> in line 19. The horizontal rule element instructs the browser to draw
a line across the web page, while the break element instructs it to start a new line.
The body element can be empty, though there is little point in a web page with no
content in its body.
The body of the SurnameSearch.html file includes several elements. In line 9, the
heading1 element includes the main heading for this page. Notice that in the HTML
mark-up there is a leading space character in the element content – the space before
the word ‘Search’. The web browser ignores leading, and trailing, spaces in most
element content. It also ignores multiple spaces and new lines (that is why HTML in-
cludes the break element, to force a new line in the web browser). The content of the
heading1 element is presented in a large, bold, serif font in the web browser. This de-
fault presentation is not particularly good, from a visual design perspective, so some
web developers would use the heading3 (<h3> ... </h3>) element instead, as this
usually looks better. However, ‘Search by staff surname’ is the top-level heading on this
138
IDWA_C04.QXD 10/5/07 11:12 AM Page 139
web page, so the heading1 element correctly identifies it as such. A tool to generate a
‘table of contents’ for a web page could easily be confused if the wrong heading ele-
ment was used here. To alter the way the web browser presents the element content,
use a stylesheet. Don’t use the wrong element just because the default presentation of
that element looks better.
Line 11 is the start tag for the form element. The form element has two required
attributes and they provide additional information about the element. An element’s
attributes are listed as name-value pairs inside the element’s start tag. The method at-
tribute tells the web browser whether to send the form’s data using the HTTP GET or
POST method. The form in Figure 4.2 uses the GET method. The action attribute tells
the web browser where to send the form data, so its value should be a valid URL. In
line 11, the action attribute has the value ‘SurnameStub.php’. This is just the file name
of the PHP script that processes the form data, not a full URL. The web browser as-
sumes that the file SurnameStub.php is located in the same directory as the current
web page. As the current page has the URL:
http://www.pennine.ac.uk/StaffDirectory/SurnameSearch.html
action="http://www.pennine.ac.uk/StaffDirectory/SurnameStub.
php"
The ability of a web browser to correctly interpret a relative URL makes maintaining a
website simpler. If all the Staff Directory files are moved to a new directory on the web
server, the form element’s action attribute will still point at the right PHP script as its
value is defined relative to the location of the SurnameSearch.html file.
Within the form element is a paragraph element (the start tag is in line 12 and the
end tag is in line 14). This acts as a container for the form’s controls and any text used
to label those controls. In line 12, the text ‘Surname:’ acts as a label for the text box,
indicating to the user what sort of text is being gathered. Immediately after this text
comes an input element. The input element is an empty element that must be en-
closed within a form element. Each input element includes a type attribute. In line 12,
the type attribute has the value ‘text’, indicating that the web browser should display
a text box. In line 13, the type is ‘submit’, so the web browser will display a submit
button. Notice that, in Figure 4.1 (a), both input elements appear on the same line. A
web browser always ignores line breaks. To make the submit button appear below the
text box, use a break element (<br/>).
When the user clicks on the submit button, the value of the text box is sent to the
file identified by the form’s action attribute. Submit buttons don’t need any other at-
tribute values, but text boxes must include a name. This is used to create the name-
value pair sent to the server (see the address line of the web browser in Figure 4.1 (b)).
In line 12, two optional attributes are used to control the behaviour of the text box.
The size attribute sets the width of the text box to 25 characters and the maxlength at-
tribute means that, in this case, a maximum of 20 characters can be typed into the text
box by the user. The maxlength attribute is useful for limiting data values gathered
from the user to the maximum length of the corresponding data in the database (such
as a column of a table).
139
IDWA_C04.QXD 10/5/07 11:12 AM Page 140
When the web browser receives the SurnameSearch.html from the Pennine Universi-
ty’s web server (host name www.pennine.ac.uk), it reads through the HTML and finds
the image element in lines 23 and 24. The image file is not delivered with the HTML
file, so the web browser creates a new HTTP request to retrieve this image. Note that
this second request is sent to a different web server, the W3C’s web server with host
name www.w3.org
XHTML 1.1 is the latest stable version of the HTML language and will form the basis
for future development (W3C, 2001). It is important to ensure that any web pages in-
tended for public viewing are written in valid HTML as invalid HTML may be dis-
played incorrectly in some web browsers. A more detailed justification for validating
HTML pages is given in W3C (2004a).
Web pages with dynamic content are harder to validate than static web pages. With
a static web page, the structure and content of the document are fixed. With a dynamic
web page, both the structure (the HTML elements) and the content may change from
one request to the next. This is because a PHP script can output a range of HTML doc-
uments, depending on the precise processing involved. To ensure that a PHP script
generates valid HTML for each valid input, it must be tested. As with any computer
program, a range of typical values for the inputs are tested as it is not possible to test
all possible inputs. For example, the HTML form in Figure 4.1 (a) accepts a string of
20 characters. There are around 100 different characters (letters, numerals, punctua-
tion marks and so on) that the user can type, so over 10020 different inputs to the
SearchStub.php script. Validating the HTML output by SearchStub.php for all these
possible inputs is simply not feasible. Instead, a representative sample of test cases is
used. If the HTML generated for the input value of each test case is valid, then it is rea-
sonable to assume that the PHP script will always generate valid HTML.
The data entered into the HTML form in Figure 4.2 is passed to the web server
using the HTTP GET method. There, it becomes a parameter to the PHP script
SurnameStub.php, which uses it to generate a message. This message simply tells the
user what he or she typed into the form. Figure 4.1 (a) shows the HTML form with the
data ‘Smith’ entered into the text box and Figure 4.1 (b) the response generated from
the PHP script by the PHP application server when this data is submitted. The actual
140
IDWA_C04.QXD 10/5/07 11:12 AM Page 141
Figure 4.3 First draft of a PHP script to accept data from the SurnameSearch.html form.
PHP script SurnameStub.php is shown in Figure 4.3. The line numbers are not part of
PHP, but are used here to refer to specific lines in the script.
It’s clear that HTML mark-up and PHP code are interleaved in this script. The PHP
code is easily identified, as it must be enclosed in the special PHP tag <?php ... ?>.
Although this PHP script is very simple, it illustrates many of the features of the PHP
language, so a line-by-line discussion of the script is worthwhile.
The script begins with a section of PHP code, between the <?php in line 1 and the ?>
in line 19. This long section does not actually do anything; it is just a comment. The
comment begins in line 2 with the PHP start-of-comment symbol /* and ends in line 18
with the end-of-comment symbol */. Everything between these two symbols is ignored
by the PHP application server. It reads through this whole section without generating
anything. Even so, this section provides useful information for the application
developer. It gives a title for this script file and a brief comment explaining its purpose.
Next comes some version control information – this file is version 0.1, created on 5 No-
vember 2005 by M. Prigmore. Later versions would include comments stating what has
been changed. Finally, there is a list of parameters that the script expects to receive. In
this case, there is just one – a parameter named ‘surname’, which is passed using the HTTP
141
IDWA_C04.QXD 10/5/07 11:12 AM Page 142
GET method. It is not necessary to include a script header like this, but many application
developers find them useful. Trying to figure out what a script (or any program) does
simply by looking at the code can be tricky, so good comments, including script headers,
help to make maintaining scripts much easier than is the case when they are missing.
The characters // also begin a PHP comment, but this sort of comment ends at the
end of the current line. Use the /* ... */ form for all multiline comments.
Lines 20 to 28 are not PHP at all – they are plain HTML. When the PHP application
server comes across a line in the script file that is not enclosed within a PHP tag, it
writes this line as HTML directly into the entity body of the HTTP response message it
is generating from the script. Thus, the very first thing in the entity body is a doctype
declaration (line 20). This is just what is needed – every HTML document should begin
with a doctype declaration. The remaining lines, up to line 28, are also written directly
into the entity body.
This raises the question, what about the HTTP headers? In Chapter 2 it was made
clear that the HTTP headers must precede the entity body in any HTTP message.
Although the script has not instructed the PHP application server to generate any
HTTP headers, there will be some. The HTTP response message will include any default
headers generated by the HTTP server itself. Once the PHP application server starts to
generate the entity body, it cannot generate any further HTTP headers. Even if there is
a single blank line before the first PHP tag, attempting to generate an HTTP header in
the PHP that follows will cause an error. This can be a problem in advanced PHP
scripts, but, for now, the scripts are simple enough that there’s no need to worry about
the HTTP headers.
In line 29, the script slips back into PHP code and line 30 is a PHP echo statement.
Notice that the statement is terminated with a semicolon – ‘;’. Every PHP statement
must end with a semicolon, otherwise the PHP application server cannot properly in-
terpret the script.
The echo statement instructs the PHP application server to write the string '<p>'
directly into the entity body of the HTTP response. In other words, it generates the
start tag for an HTML paragraph element. The \n represents the new line character
and means that the <p> tag will be written in a new line of the entity body. Using new
line characters makes the HTML generated from the PHP script easier to read. Without
them, all the generated content would appear in one long line, rather than in succes-
sive lines. Character combinations such as '\n' are called escape sequences (see the
PHP manual at www.php.net for a list of PHP escape sequences).
Missing out a semicolon is a common cause of errors in PHP scripts. If your script
does not produce anything at all, then check for missing semicolons. The PHP applica-
tion server will write an error message to the PHP error log when it encounters a miss-
ing semicolon. The PHP error log is very useful for debugging PHP scripts. For
example, when the semicolon is missed off the end of line 30, the message is:
142
IDWA_C04.QXD 10/5/07 11:12 AM Page 143
All error messages follow this format. First comes the date and time the error occurred,
followed by the type of error. A ‘PHP Parse error’ means that the code written by the
application developer is simply wrong. The PHP application server then gives an ex-
planation of the error. In this case, it came across a variable (the $surname that starts
line 31) when it was expecting either a comma or a semicolon. As there are two possi-
bilities, the PHP application server does not know what to do, so stops compiling the
script. Notice that the line number is 31, which is the line after the one with the miss-
ing semicolon. This is because it is only when the PHP application server encounters
the $ character of $surname in line 31 that it realizes something has gone wrong.
Chapter 7 covers PHP error reporting in more detail.
The error log is often just a text file. The PHP application server may create this text
file in the same directory as the script that generated the error, though this depends
on how PHP was installed. The PHP administrator will know the default error logging
behaviour of your PHP application server.
The <p> and </p> tags, generated in lines 30 and 33 respectively, delimit an HTML
paragraph element that encloses the content generated by the echo statement in line 32;
Figure 4.4 shows this paragraph element enclosing the generated content ‘You entered
the surname "Smith"’. It is important that any content generated by PHP is enclosed in
an appropriate HTML element, otherwise the generated HTML will not be valid.
Figure 4.4 The HTML file generated from the SurnameSearch.php script,
with input parameter surname=Smith.
143
IDWA_C04.QXD 10/5/07 11:12 AM Page 144
If your PHP script runs, but is blank where you expected text, check that you have
spelt variable names correctly. PHP actually generates a ‘Notice’ in the error log
when it encounters an undeclared variable. Using PHP’s error log to debug scripts is
discussed in Chapter 7.
The data type of a PHP variable is not fixed. The following code is valid PHP:
144
IDWA_C04.QXD 10/5/07 11:12 AM Page 145
PHP is not Unicode enabled and only recognizes the 256 characters that can be de-
fined using a single byte. The underlying operating system will determine the specific
character encoding, though it is probably safe to assume ASCII. This means that the
meta element in line 24 of Figure 4.3 is not entirely honest. The output of this PHP
script will probably be in the ASCII character set, not utf-8. Fortunately, ASCII is a
subset of the utf-8 character set, so any character generated by PHP should be cor-
rectly interpreted by the web browser.
In line 31 of Figure 4.3, the variable $surname is assigned the value $_GET
["surname"]. $_GET is one of PHP’s superglobal variables. The value of a super-
global variable can be accessed from anywhere in the PHP script – they automatically
have global scope. In contrast, the scope of other variables may be restricted to just a
small portion of the script. This is the case for variables declared within a function
(discussed below). The PHP application server stores the name-value pairs sent using
the HTTP GET method in the $_GET superglobal variable. $_GET is an associative
array, meaning that each element in the array has an index (its position in the array),
a key (it’s unique identifier) and a value. Each name-value pair is turned into an
element of $_GET and the name becomes the element key and the value becomes the
element value. For example, in Figure 4.1, the HTML form uses the HTTP GET method
to send a single name-value pair ‘surname=Smith’ to the PHP script SurnameStub.php.
The $_GET superglobal for SurnameStub.php will be:
Index 0
$_GET = Key surname
Value Smith
Thus, in this instance, the value referred to by the array expression $_GET
["surname"] is ‘Smith’, so that is the value assigned to $surname in line 31. If the
HTML form used the HTTP POST method, its name-value pairs would become the ele-
ments of the $_POST superglobal. There are other superglobals, but $_GET and
$_POST are the two that contain data gathered from the user via an HTML form.
Always use the element key to refer to an element of a superglobal. There is no guar-
antee that all the name-value pairs you are expecting have actually been supplied
and so there is no way to be sure what index is given to a particular name-value pair.
Line 32 of Figure 4.3 is a further example of the echo statement and PHP character
string literals. PHP allows character string literals to be enclosed either in single or dou-
ble quotes. Character string literals enclosed in double quotes can include single
quotes. Similarly, a character string literal enclosed in single quotes can include double
quotes. For example, the following echo statements work fine:
145
IDWA_C04.QXD 10/5/07 11:12 AM Page 146
This is a nice feature as names such as O’Connor and sentences that include apostro-
phes, can be represented using a double-quoted string literal without the need to
worry about how PHP will interpret the apostrophe.
Many word processors include a custom, or smart, quotes feature. This replaces
straight quote marks, "like these", with fancy quote marks, “like these”. The same
holds for single quotes, so 'straight quotes' are turned into ‘custom quotes’. Al-
though the two sorts of quote marks are equivalent in normal use, they are different
characters and the custom quote marks are not part of the standard ASCII character
set. This means that PHP cannot recognize custom quotes. Only straight quotes can
be used to delimit string literals. So, if you must write PHP code using a word proces-
sor, make sure that you turn off the custom quotes feature!
String literals enclosed in single quotes are treated as simple strings of characters. Most
escape sequences are not recognized. For example, the new line character \n is not
treated as a new line character, but as a backslash followed by the letter ‘n’. The PHP
statements:
SQL only allows single-quoted string literals and does not use the same escape se-
quences as PHP. See Chapter 5 for details.
When PHP encounters a double-quoted string literal, it recognizes all the escape se-
quences, including \n for a new line and \" for a double quote mark. Both these
occur in the string literal in line 32 of Figure 4.3.
PHP also recognizes variable names within a double-quoted string literal and re-
places the variable name with its current value. Given that $surname has the value
‘Smith’, line 32 is equivalent to:
146
IDWA_C04.QXD 10/5/07 11:12 AM Page 147
Again, this is a nice feature, making it much easier to work with strings in PHP than in
some other languages. PHP can even replace a reference to an array element inside a
double-quoted string with that element’s value, but the array reference must be en-
closed in curly brackets. For example, lines 31 and 32:
$surname = $_GET["surname"];
echo "\nYou entered the surname \"$surname\".";
This latter code is shorter, but not quite as easy to understand. Which to use is a
matter of personal style and may depend on exactly how the array is being used.
When a PHP script accepts data from an HTML form, there is no guarantee that the
data entered is the sort of data the PHP script expects. For example, if the user clicks on
the submit button on the SurnameSearch.html form before entering a surname, they
see the web page shown in Figure 4.5 (a). In this case, the result gives a poor impres-
sion to the user rather than causing a problem in the script. Similar sorts of problem
occur when the user enters ‘30/02/2005’ for a date or ‘£3.50p’ for a price when the PHP
script expects a plain number. In these cases, the problem may cause the script to gen-
erate the wrong result.
All these are examples of everyday mistakes that an end user might innocently
make. A PHP script can cope with such problems by performing basic data validation
on all user input. The goal of data validation is to correctly interpret the data entered
by the user and generate an appropriate response. This may mean asking the user to
try again.
A more serious problem occurs when the user deliberately sets out to hack a web
application. Figure 4.5 (b) shows what happens when the user enters the text:
<script>alert("Ha!");</script>
into the surname text box, then clicks on the submit button. While this particular ex-
ample is fairly harmless, the same isn’t true when the text box gathers data that is
stored in a database and used in future to generate further web pages. In that situation,
users other than the original user will see the message box and if the message is insult-
ing or offensive, it could prove very serious for an e-commerce business. Potential cus-
tomers would be put off using a website that was clearly insecure.
To avoid these problems, the PHP script should also perform data cleansing, to
remove any potentially malicious data before it is processed by the script. Data cleans-
ing is distinct from data validation. Data cleansing is about stripping away potentially
malicious data, while data validation is about generating an appropriate response to
invalid, confusing or unexpected data.
147
IDWA_C04.QXD 10/5/07 11:12 AM Page 148
(a) The web page generated when the user leaves the surname box blank.
148
IDWA_C04.QXD 10/5/07 11:12 AM Page 149
Actually, the version of SurnameSearch.html shown in Figure 4.2 can’t be hacked in this
way. The surname text box will only accept up to 20 characters and <script>
alert("Ha!");</script> has 30 characters. However, a knowledgeable user
could simply type the URL into the web browser as SurnameSearch.
html?surname="<script>alert("Ha!");</script>". This gets round the
limit placed on the text box, so data cleansing is still necessary.
Two technical points arise from Figure 4.5 (b). First, the data passed from the
SurnameSearch.html form has been encoded by the web browser: ‘<’ is represented
by ‘%3C’ and ‘>’ by ‘%3E’. This is shown in the address bar. Second, the EC-
MAScript for the alert box appears part-way through the HTML, just after the first
double quote mark (see Figure 4.3). When the browser reaches the HTML script ele-
ment, it stops rendering the page until the alert has been responded to by the user.
This is why the second double quote mark, the horizontal rule, and the page footer
are all missing from the page in Figure 4.5 (b).
The simplest form of data validation is to check whether or not the user has entered
any data at all. The second version of the SurnameStub.php script (Figure 4.6) adds
this simple form of data validation. If the user does not enter a surname, then the
script generates a message to indicate this. Otherwise, it generates the same message
as before.
Version 0.2 of the SearchStub.php script introduces three concepts not met in ver-
sion 0.1 – the ‘if’ statement, code blocks and PHP built-in functions.
The if ... else ... statement is simply a way of instructing the PHP applica-
tion server to carry out one of two actions, based on some condition. The basic struc-
ture is shown in Figure 4.7. This is the same basic structure as is used in most
programming languages. The statement begins with the keyword if, immediately
followed by the test condition. The test condition is a Boolean expression, so is either
true or false. In PHP, the test condition must be enclosed in parentheses, like so
(...). Next – typically this part is written on the following line of the script file –
comes a code block. A code block is simply a sequence of one or more PHP statements.
Any PHP statement can occur in such a code block. If there is more than one state-
ment, the code block must be enclosed in braces – {...} – but, if there is only one
statement, the braces can be omitted. The first code block in Figure 4.7 is enclosed in
braces and is the code executed when the condition evaluates to TRUE.
For some if ... else ... statements, there is nothing to do when the condi-
tion is false. In this case, the else part is simply omitted. For example, the if ...
else ... statement:
if ($count = 0)
echo "Nothing to count.";
will generate the message ‘Nothing to count’ if the variable $count is zero. If
$count is not zero, then the code block is skipped.
More generally, the keyword else indicates that there is a code block to execute
when the condition is false, as well as when it is true. In Figure 4.7, the second code block
has a single PHP statement, so the braces are omitted. Notice that the statements in both
149
IDWA_C04.QXD 10/5/07 11:12 AM Page 150
If condition
150
IDWA_C04.QXD 10/5/07 11:12 AM Page 151
code blocks are indented slightly (by two spaces). Like comments, the PHP application
server ignores indentation, but it does make the code easier to read, especially when a
second if ... else ... statement occurs inside one of the code blocks.
The statement shown in Figure 4.7 is intended to highlight the structure of the PHP
if ... else ... statement. Neither the condition, nor the statements in the two
code blocks are valid PHP. It isn’t possible just to write English sentences as the PHP
application server will not understand them.
Lines 33–39 in Figure 4.6 illustrate how to use an if ... else ... statement for
real. Notice that there is no semicolon at the end of the line 33. That is because the
if ... else ... statement doesn’t end until the end of the final code block in line
39. Also, there is no semicolon after the } in line 39 as the closing brace itself indicates
the end of the code block and, hence, the end of the if ... else ... statement.
The condition in line 33 tests whether or not $_GET["surname"] has a value.
Even a single space character will do. If the user did type something, then the condition
would be FALSE so the PHP application server would skip past the first code block to
the else keyword and execute the second code block. The second code block is simply
the same code as in version 0.1 of the script (Figure 4.3). If there is no value for
$_GET["surname"], the user must have failed to type anything, so the message
‘You did not enter a surname.’ would be generated. No braces are needed to delimit this
code block as it has only the one statement. Note that in this case, the PHP application
server will now skip past the else keyword and its code block to line 40. Only one of
the if ... else ... statements’ two code blocks can be executed on any run
through the script.
Line 33 includes a call to one of PHP’s built-in functions – the empty() function.
A function accepts a list of parameters and returns a single value. Functions can be
used much like variables as they represent a value, but the value can change depend-
ing on the values of the parameters. In general, a parameter can be a literal, variable,
function or any other kind of expression (though not always – see below). The PHP
user manual (PHP Documentation Group, 2005) describes all functions in the same
way (Figure 4.8). First comes the data type of the value returned by the function, then
the function’s name. The parameters are presented as a comma-separated list enclosed
in parentheses.
151
IDWA_C04.QXD 10/5/07 11:12 AM Page 152
if (!empty($_GET["surname"]))
{
$surname = $_GET["surname"];
echo "\nYou entered the surname \"$surname\".";
}
else
echo "\nYou did not enter a surname.";
Whether you ask ‘Is this empty?’ or ‘Is this non-empty?’ is, once again, a matter of
personal coding style (it used to affect performance, but this is less of an issue with
modern programming languages). Testing for the most likely situation seems sensi-
ble, but often means coding with the ‘not’ operator (‘!’).
The code in Figure 4.6 deals with the situation where users submit the data without
entering anything, but not where they type a string of spaces. The data validation
should deal with this in the same way as an empty submission.
PHP includes the trim() function to strip away any leading or trailing spaces from
a string (it also strips away leading tab characters, new lines and other white space).
Leading spaces are spaces that appear before any other characters, trailing spaces ones
that appear after them. The trim() function’s return value is a copy of the original
string, but with leading and trailing spaces stripped away. The description of this func-
tion in the PHP manual is:
152
IDWA_C04.QXD 10/5/07 11:12 AM Page 153
trailing spaces are not stripped. Some examples should clarify how the basic trim()
function works.
if (empty($_GET["surname"]) or
empty(trim($_GET["surname"])))
echo "\nYou did not enter a surname.";
The condition first tests whether or not $_GET["surname"] is empty, then whether
or not trim($_GET["surname"]) is empty. If either the first or the second test
returns TRUE, then the whole condition is TRUE. Unfortunately, this won’t work. While
most functions can accept any valid expression as a parameter, empty() is intended
to ascertain whether a variable has a value or not, so it only works if the parameter is a
variable name. In the function call empty(trim($_GET["surname"])) the para-
meter is a function return value. Calling the empty() function with any expression
other than a variable will mean that the PHP application cannot run the script. In-
stead, it writes a ‘PHP fatal error’ message to the error log. Always check the PHP user
manual (PHP Documentation Group, 2005) for the full description of a particular
built-in function and any restrictions on how it can be used.
It’s easy to get round this restriction on empty(). The following code does work:
$surname = trim($_GET["surname"]);
if (empty($_GET["surname"]) or empty($surname))
echo "\nYou did not enter a surname.";
else
echo "\nYou entered the surname \"$surname\".";
PHP includes two Boolean OR operators – or and ||. They behave alike, except in
terms of operator precedence. See PHP Documentation Group (2005) for details.
There is a problem with the above code, though. When the user does not enter anything
into the surname text box, the PHP application server recognizes that the array key
surname does not exist for the superglobal $_GET. It then writes the error message:
‘PHP Notice: Undefined index: surname in SurnameStub.php
on line 33’
to the error log. While such notices don’t cause the application to fail, some PHP pro-
grammers like to eliminate them, as well as more serious errors. One way to do this is
153
IDWA_C04.QXD 10/5/07 11:12 AM Page 154
always to use empty() to check that the user entered a value before using $_GET to
access that value. This means coding a nested if ... else ... statement:
if (empty($_GET["surname"]))
echo "\nYou did not enter a surname.";
else
{
$surname = trim($_GET["surname"]);
if (empty($surname))
echo "\nYou did not enter a surname.";
else
echo "\nYou entered the surname \"$surname\".";
}
The first if ... else ... statement tests whether $_GET["surname"] is empty.
If it is, it generates the usual message. If not, it then strips out leading and trailing
spaces and tests whether that string is empty.
Even this code is not ideal. There are two different points at which it has been ascer-
tained that the user entered a blank surname, so the same message is generated in two
different places. Some PHP developers would prefer to ignore PHP notices, and use the
simpler code above. It is possible to instruct the PHP application server to do just this,
so no notices are written to the error log, only more serious errors. See PHP Documen-
tation Group (2005) for details.
The PHP application server does not write an error message to the log when it
encounters the code empty($_GET["surname"]). As the purpose of the empty
() function is to test whether a variable has a value or not, it is unnecessary to gen-
erate an error message for those variables that don’t. The function will tell the appli-
cation developer this directly, by returning the value FALSE.
Testing whether a string is or is not empty is very common. Another common test is
whether or not the user entered numeric data into a text box. For example, at the
Pennine University, staff use a web database application to record the marks awarded to
students for assignments. Figure 4.9 shows the HTML form that gathers marks for an in-
dividual student. Notice that each assignment has a maximum mark. In the example
shown, the assignment is marked out of 80 rather than 100. This means that the PHP
script should convert the mark into a percentage before saving it in the database.
All data sent from an HTML form, whether it is sent using the HTTP GET or POST
method, arrives as a string. Thus, the mark awarded to Emma Kirkby arrives as the
string value "62", not as a numeric value 62. As it is a string that represents a numeric
literal, PHP can recognize it and change the type of the variable when it’s used in a
calculation. Assume that the text box is named ‘mark’ and form data is passed using
the GET method. Also assume that the PHP script has a local variable $maxMark that
stores the maximum mark for this assignment. The following code calculates the
percentage mark awarded for the assignment:
$mark = ($_GET["mark"]/$maxMark)*100;
154
IDWA_C04.QXD 10/5/07 11:12 AM Page 155
Figure 4.9 The HTML form used to enter a student’s mark into the assess-
ment system.
When the PHP application server encounters this code, it automatically converts the
string value of $_GET["mark"] to a numeric data type and calculates the correct
percentage.
Rather than rely on PHP’s automatic type conversion, the following code explicitly
changes the type of the string value of $_GET["mark"] to a floating point number
(to allow for marks such as 43.5) before doing the calculation:
$mark = (float)$_GET["mark"];
$mark = ($mark/$maxMark)*100;
The second assignment statement illustrates a common feature of computer languages –
a variable can appear on both sides of an assignment. This line says ‘Make the new
value of $mark equal to the old value of $mark divided by the value of $maxMark
then multiplied by 100’. The (float) syntax in the first statement instructs the PHP
application server to change the data type of the variable to floating point number.
Changing the type of a variable is known as type casting and may be familiar from
other programming languages. The PHP application server can correctly interpret
155
IDWA_C04.QXD 10/5/07 11:12 AM Page 156
most numeric string literals, but if it does not recognize a string as representing a num-
ber, then it casts it to the numeric value 0 (zero).
As there is no guarantee that the user won’t enter the string ‘Hello’ in the text box
in Figure 4.9, checking for numeric strings is another data validation issue. The follow-
ing code checks that a non-empty numeric value has been entered before performing
the calculation. The code also illustrates the use of PHP’s and Boolean operator:
if ((!empty($_GET["test"])) and
(is_numeric($_GET["test"])))
{
$mark = (float)$_GET["test"];
$mark = ($mark/$maxMark)*100;
}
else
echo "\nYou did not enter a number.";
PHP includes two Boolean AND operators – and and &&. They behave alike, except
in terms of operator precedence. See PHP Documentation Group (2005) for details.
Date literals entered by the user should also be validated before being used. Unlike
most data languages, programming languages like PHP don’t usually include a date or
time data type. Instead, PHP stores combined datetime values as an integer – that is,
the number of seconds elapsed since 1 January 1970. So, for example, ‘16 November
2005, 2:24:57 pm’ is stored as the integer 1132151097 because at this date and time
one billion one hundred and thirty two million one hundred and fifty one thousand
and ninety seven seconds have elapsed since 1 January 1970. Thankfully, PHP provides
a number of functions to manipulate such date literals.
Data validation deals with innocent mistakes. Data gathered from the user is vali-
dated to see whether or not it is the sort of data expected and, if it is not, a suitable
message is generated. Data cleansing deals with potential attempts to hack the applica-
tion. All data gathered from users are stripped of any potentially harmful text. Only
when potentially harmful text has been removed is the data validated and any further
processing carried out.
As shown in Figure 4.5 (b), HTML tags in data gathered from users can cause prob-
lems for web database applications. PHP includes the strip_tags() function to
remove HTML tags from a string:
echo strip_tags("<strong>bold</strong>");
//Generates the text "bold"
echo strip_tags("<?php echo 'Ha'; ?>");
//Generates an empty string
156
IDWA_C04.QXD 10/5/07 11:12 AM Page 157
echo strip_tags("<strong><em>bold</em></strong>",
"<strong>");
//Generates the text "<strong>bold</strong>"
echo strip_tags("<?php echo '<strong>Ha</strong>'; ?>",
"<strong>");
//Generates an empty string
In the first statement, only the HTML <strong> tags are allowed, so the <em> tags
are removed. In the second statement, the <strong> tag is removed because it ap-
pears inside a <?php ... ?> tag, which is not allowed. When a tag is removed,
everything between the < and >, which delimit the start and end of the tag, is re-
moved. For a special PHP tag, that means everything is removed.
A related problem occurs with characters that have a special meaning in HTML or can-
not easily be typed using a standard keyboard. The HTML mark-up language includes
HTML character entities to represent such characters within an HTML document.
One hard-to-type character is the copyright symbol – © – which has the HTML char-
acter entity ©. This is used in the page footer of the HTML and PHP scripts in
this chapter (see, for example, Figure 4.2, line 19).
Special characters include the < character. A web browser will interpret the < char-
acter as the start of an HTML tag. If the < character needs to be used on a web page –
for example, to show a mathematical expression – it must be written as the HTML
character entity <.
Table 4.1 lists the four commonest HTML character entities. When data gathered from
users are written into a new HTML document – as the surname is – then any special char-
acters in the users’ data should be replaced by the corresponding HTML character entity.
The PHP function htmlspecialchars() replaces any occurrence of these four
characters in the parameter str with its equivalent HTML character entity:
See the PHP manual (PHP Documentation Group, 2005) for details of the optional
parameters.
157
IDWA_C04.QXD 10/5/07 11:13 AM Page 158
An HTML character entity is nothing to do with an HTTP entity body, which is the part
of an HTTP message that carries data between the client and server. The word ‘entity’
is used to represent two quite different concepts.
Figure 4.10 shows the first half of SurnameStub.php Version 0.3, which adds data
cleansing functionality to the script. The script declares a user-defined function
cleanse_data(). A user-defined function works in just the same way as a PHP
function – it accepts a list of parameters and returns a single value. The user-defined
function cleanse_data() expects a single string parameter and returns a copy of
that string with special characters replaced by their HTML character entities and all
HTML and PHP tags removed. Using the same style as the PHP manual, the description
of cleanse_data() is:
string cleanse_data(string value)
Line 23 in Figure 4.10 shows how to declare a user-defined function in PHP. The key-
word ‘function’ is followed by the name of the function and a list of parameters in
158
IDWA_C04.QXD 10/5/07 11:13 AM Page 159
return htmlspecialchars(strip_tags($value));
This statement says ‘Pass the value of the variable $value to the PHP function
strip_tags(). Pass the value returned by strip_tags() to the PHP function
htmlspecialchars(). Finally, set the return value of cleanse_data() to the
value returned by htmlspecialchars()’ This single line is equivalent to the three
PHP statements:
$strip = strip_tags($value);
$replaceEntities = htmlspecialchars($strip);
return $replaceEntities;
Note that a ‘return’ statement terminates the function. No code after a ‘return’ state-
ment will ever be executed.
Line 55 of Figure 4.11 shows how to use the user-defined function
cleanse_data() in the main body of the PHP script. Notice that the value passed
to cleanse_data() has already been tested to see if it is empty. This seems to make
the test in line 33 of Figure 4.10 redundant. However, a function can be used many
times in a single script so there is no guarantee that every value passed to it will have
been tested in this way. The function should be written to cope with anything (well,
almost anything).
The cleanse_data() function will be needed in every script that handles data
gathered from users. As it stands, it is only available in the script SearchStub.php. This
is because user-defined functions have local scope (they can only be used in the script
where they are declared). One solution is to declare the function separately at the start
of every script, but that is not very sensible. If the function must change – and most
functions are modified at some point – then every copy of the function, in every
script, must be changed individually.
A better solution is to use an include file. Include files are simply PHP scripts that
store useful code and can be included automatically in other scripts. Often they are
used to store utility functions. These are useful, general-purpose functions that will be
159
IDWA_C04.QXD 10/5/07 11:13 AM Page 160
used in lots of different scripts. If a function will only ever be used in one script, then
it should be declared as a local function in that script. Include files are also used to
store web page header and footer sections as these, too, may need to change and it is
simpler if they are stored in a single file.
Figure 4.12 shows the PHP include file DataCleansing.inc. The ‘.inc’ file name ex-
tension is usually used to indicate an include file. Notice that all the content of the file
160
IDWA_C04.QXD 10/5/07 11:13 AM Page 161
is enclosed in a single PHP tag. This is because the PHP application server assumes that
the include file contains HTML. Without the PHP tag, it would simply write the con-
tents of the include file directly to the HTTP response. With the PHP tag, it knows that
they should be processed as PHP. An include file can have HTML mark-up, just like an
ordinary PHP script, though an include file that stores utility functions tends not to.
The actual function declaration is exactly the same as in Figure 4.10.
Line 24 of Figure 4.13 shows how to include the content of DataCleansing.inc in
the script file SurnameStub.php. You should always enclose the file name in quotes.
The include statement works by copying the included file, DataCleansing.inc, into
the script SurnameStub.php at line 24. It is important, when using include files, to
avoid declaring a function twice. In PHP5, doing so will cause a fatal error.
161
IDWA_C04.QXD 10/5/07 11:13 AM Page 162
PHP provides a range of facilities to allow web database developers to retrieve data
from a database and merge this dynamic content with static content on a web page.
Figure 4.14 shows the architecture of a typical web database application. There are
three tiers. The first tier is the web client, typically a web browser. The middle tier is
the web server, which receives HTTP requests for web resources from the web client
and generates an HTTP response. The third tier is the database server, which receives
requests for data from the web server and provides that data. It includes the actual
database (where the data are stored) and the DBMS, which manages all access to the
database. The web server manages these two channels of communication with differ-
ent components. The HTTP server manages communication with the web client using
HTTP. The application server manages communication with the database server via
the DBMS API (see Chapter 2). These two components of the web server work together
to create the HTTP response.
The data validation in this version of the PHP script deals with the situation where –
users submit the data without entering anything, but not where they type a string of
spaces.
WebServer DatabaseServer
RequestWebPage
Web Client HTTP Server Database
RequestData
Application Server DBMS
DatabaseClient
162
IDWA_C04.QXD 10/5/07 11:13 AM Page 163
Figure 4.15 Version 1.0 of SurnameSearch.php: querying a database and displaying the
result.
The code in Figure 4.15 illustrates each of these these six steps.
Line 45 establishes a connection to the MySQL DBMS, using the PHP function
mysql_connect(). The function description is:
resource mysql_connect([string server[, string userName
[, string password]]])
163
IDWA_C04.QXD 10/5/07 11:13 AM Page 164
PHP can connect to other DBMS, either directly or using ODBC. See the PHP user
manual (PHP Documentation Group, 2005).
One MySQL DBMS can manage several different databases. Each connection can be as-
sociated with a particular database so that any queries sent to the MySQL DBMS over
that connection will be run against the chosen database. The PHP function
mysql_select_db() associates a database with a connection. The description is:
bool mysql_select_db (string databaseName
[, resource connection])
The first parameter is mandatory and names a MySQL database. If the MySQL DBMS
doesn’t recognize the name, then the function returns the value FALSE, but otherwise
it returns TRUE. The second parameter is the connection to the MySQL DBMS, passed
using the variable $connection. If this is omitted, then the PHP application server
uses the most recent connection made. Line 47 shows this function in use. Notice that
there is no attempt to capture or test the return value. Even if the database name is
wrong, the script will continue to the next step. This is an obvious weakness and so
will be remedied in the next script.
Now that a database is associated with the connection to the DBMS, the script can
query this database. A database query is simply an instruction to the DBMS to retrieve
certain data from the database. Any such instruction must be made using the DBMS
data language. For the MySQL DBMS, this data language is SQL. The PHP function
mysql_query() instructs the PHP application server to pass an SQL statement to the
164
IDWA_C04.QXD 10/5/07 11:13 AM Page 165
MySQL DBMS. The MySQL DBMS executes the SQL statement and passes the result back
to the PHP application server. This result is a derived table called the result set – in ef-
fect, it is a set of rows. The result set is another external resource. The mysql_query()
function returns a reference to this resource. The function description is:
resource mysql_query (string sqlStatement
[, resource connection])
Line 49 shows how to use this function. The first parameter is a string literal, defining
the SQL statement to pass to the MySQL DBMS. The second parameter to the function
is the connection to the MySQL DBMS established in line 45. The variable $result
will hold (a reference to) the result set returned by the DBMS.
The actual SQL statement needs to be explained. The first part is the select list:
SELECT fName, lName, email
This tells the DBMS to retrieve the values of the columns fName, lName and
email, but not the values of any other columns. The from clause:
FROM Staff
indicates that these columns come from the Staff table. The last part is the where
clause. The where clause specifies a condition on the data retrieval operation. Without
a where clause, the DBMS will retrieve the values of the columns in the select list for
every row of the Staff table. The where clause instructs it to retrieve the values of the
columns in the select list, but only for those rows where the condition is TRUE. The
where clause in line 49 is:
WHERE lName = '$surname'
This restricts the result to those rows of the Staff table where the value of the lName
column is exactly the same as the string literal '$surname'. Of course, the PHP
application server will automatically replace the variable name $surname with its
value before it passes the string literal to the MySQL DBMS. For example, if the value
of $surname is Smith, then:
"SELECT fName, lName, email
FROM Staff
WHERE lName = '$surname'"
becomes:
"SELECT fName, lName, email
FROM Staff
WHERE lName = 'Smith'"
This is the SQL select statement sent to the MySQL DBMS. Notice the use of single
quotes around the SQL string literal, ‘Smith’. SQL does not allow you to use double
quotes to delimit a string literal. Chapters 5 and 6 look in detail at the SQL data ma-
nipulation language and the SQL SELECT statement in particular. Note that column
values in SQL are case-sensitive, so the two statements:
165
IDWA_C04.QXD 10/5/07 11:13 AM Page 166
Although the ISO SQL:2003 standard states that SQL is case-sensitive, with respect
to column values, a DBMS may allow the DBA to alter this behaviour. For some
databases SQL may treat 'Smith' and 'SMITH' as identical values.
As the result set returned by the mysql_query() function in line 49 is another ex-
ternal resource it can’t be manipulated directly by PHP. To use the data from the result
set it must be fetched, one row at a time, into a PHP array. The PHP function
mysql_fetch_array() in line 51 does just this. The function description is:
array mysql_fetch_array (resource result
[, int result_type])
Notice that mysql_fetch_array() does not need the connection variable. This is
because the result set has already been passed from the DBMS to the PHP application
server. It is an external resource, but directly available to the PHP application server.
The second, optional, parameter specifies what sort of array to return. In line 51, the
PHP constant MYSQL_ASSOC tells the function to return an associative array. This as-
sociative array is stored in the variable $row. Notice that, in this case, $row really
does hold a PHP array and can be manipulated directly by PHP statements. The keys
are the column names from the query’s select-list and the values are the column val-
ues. This makes it very easy to work with result sets as the column names can often be
used as labels for their values.
It’s important to realize that mysql_fetch_array() only fetches one row at a
time. In line 51, it will fetch the first row from the result set. The data in this row needs
to be displayed to the end user, so must be written to the HTTP response. This means
writing out the key and value of each element of the array $row.
The simplest way to do this is with a PHP foreach statement, as shown on lines
53 and 54. A foreach statement loops through the elements of an array, from start to
finish, and executes the code block once for each element. The syntax is explained in
Figure 4.16. The keyword foreach begins the statement and is followed by a list of
parameters in parentheses. The array to loop through comes first (typically stored in a
166
IDWA_C04.QXD 10/5/07 11:13 AM Page 167
variable, but this could be a function call that returns an array). The keyword as
comes next, then a variable that will hold the key for the current element. The symbol
=> comes next and it is followed by a variable that holds the value of the current ele-
ment. These two variables can be used inside the code block, but nowhere else – their
scope is local to the foreach statement.
The code block of the foreach statement, in line 54, is a single echo statement,
so does not need to be enclosed in braces. It simply writes the key and value for the
current element, separated by a colon and space. The web page generated by this script
is shown in Figure 4.17. Only one row of data is displayed because only the first row
was fetched from the result set. If there is more than one row, then the script should
make further calls to mysql_fetch_array().
The final step in querying a database is to close the connection to the DBMS once
all processing has been completed. This is done in line 56, using the PHP function
mysql_close(). In fact, the connection could have been closed in line 50 as, once
the query has been run, there is no further communication with the DBMS. Whether
to close a connection as soon as possible or at the end of the script is largely a matter
of personal preference. However, if the script is likely to take a long time to run, with
no further DBMS communications, then it makes sense to close the connection as this
frees resources on the database server.
Version 1.0 of the PHP script SurnameSearch.php illustrates all six steps involved in
querying a database from a server-side web script. The main problem with this first at-
tempt is that it only ever displays data from the first row of the result set. This isn’t re-
ally satisfactory as users will want to see all matching rows of data. A fully working
script must loop through all the rows of a result set, fetching one row at a time into an
array. It can then use the foreach statement to loop through the data in this array
and display one element at a time.
167
IDWA_C04.QXD 10/5/07 11:13 AM Page 168
A Boolean expression tested before each loop through the code block.
When the expression evaluates to FALSE, the looping stops.
while( stopCondition )
{
Code block do this statement;
executed once do this statement;
for each loop change the stop condition;
}
PHP provides both the standard loop statements while and for. The standard form
for the while statement is shown in Figure 4.18. After the keyword while comes a
Boolean expression stopCondition (in parentheses). Provided this expression
evaluates to TRUE, PHP executes the statements in the code block (enclosed in curly
brackets). It then checks the value of stopCondition again and, if it is still TRUE,
executes the code block for a second time. PHP continues to loop through this process
until stopCondition evaluates to FALSE. At this point, it skips past the code
block to the next line of code. Because PHP executes a while statement over and
over again until the value of the Boolean expression stopCondition becomes
false, at least one of the statements in the code block should affect the value of
stopCondition. Otherwise, the script will get stuck in an infinite loop. In this
case, the PHP application server eventually decides that the loop has gone on too
long, and terminates the script with a fatal error.
Version 1.1 of SurnameSearch.php (Figure 4.19) demonstrates how to use the PHP
while statement to loop through the rows of an SQL result set. In line 53 the first
row is fetched from the result set. The while statement starts in line 54 with the
keyword while followed by the stop condition $row <> FALSE. This expression
uses the not-equal-to comparison operator <> and is true when the value of $row is
not the Boolean value FALSE. This test seems a bit odd as $row ought to contain
the first row fetched from the result set (in line 53). However, if there is no row to
fetch, then the PHP function mysql_fetch_array() actually returns the
Boolean value FALSE. So, if the result set is empty, $row equals FALSE and the
stop condition evaluates to FALSE. This means that the code block of the while
statement is never executed. If there is a row to fetch from the result set, then, in
line 54, the value of $row will be an array. This is definitely not equal to the
Boolean value FALSE, so the code block of the while statement is executed. This
code block displays the data held in the array (lines 57 and 58), then calls the PHP
function mysql_fetch_array() again (line 60). This fetches the second row
from the result set, changing the value of $row. It is this statement that ensures the
while statement will eventually terminate as, once all the rows have been fetched
from the result set (and their data displayed), $row will be FALSE. The echo state-
ment in line 59 ensures that, in the generated web page, the data for each row of the
result set appears on a separate line (Figure 4.20).
168
IDWA_C04.QXD 10/5/07 11:13 AM Page 169
Figure 4.19 Version 1.1 of SurnameSearch.php: using the PHP while statement to loop
through an SQL result set.
Using a PHP while statement to loop through an SQL result set always follows this
pattern:
1 fetch the first row from the result set into a variable
2 while this variable actually holds a row of data:
(a) display the data
(b) fetch the next row from the result set.
Fetching the first row before beginning the while statement ensures that, if there are
no rows in the result set, the code block of the while statement is never executed.
The for statement offers an alternative for those who don’t like the while state-
ment. A for statement is designed to execute a block of code a fixed number of times,
then stop. This contrasts with the while statement, which can get into an infinite
169
IDWA_C04.QXD 10/5/07 11:13 AM Page 170
loop. The general format is shown in Figure 4.21. The keyword for is followed by
three expressions, in parentheses. The first declares a loop counter, which is a variable
that indicates the number of times the code block has been executed. The second indi-
cates when to stop – typically, this states that the loop counter must be less than a
fixed value. The third indicates how much to increase the loop counter by when the
code block has been executed, which usually is 1.
An example of how to use the for statement to loop through an SQL result set is
shown in Figure 4.22. Using a for statement, there is no need to fetch the first row
from the result set to check that there are some rows in there. Instead, in line 53, the
PHP function mysql_num_rows() is used to find out how many rows there are in
the result set. The function description is:
170
IDWA_C04.QXD 10/5/07 11:13 AM Page 171
Figure 4.22 Version 1.2 of SurnameSearch.php using the PHP for statement to loop
through an SQL result set.
The number of rows in the result set, stored in the variable $numRows, is used to place
a limit on the number of times the for statement in line 55 loops through its
code block. This statement declares a loop counter – the variable $i – and sets its ini-
tial value to 1. Before each execution of the code block, the value of $i is compared
with the value of the variable $numRows and the code block is executed, provided
$i<=$numRows (<= is the PHP less-than-or-equal-to comparison operator). If there are
no rows, then $numRows is 0 and, as 1<=0 is false, the code block is never executed. If
there are some rows, then the code block is executed once for each of them. After exe-
cuting each code block, the final expression $i=$i+1 instructs PHP to add 1 to the
loop counter $i. If there are four rows in the result set, then the for statement will
loop through its code block four times. The code block runs from line 56 to 62. In line
57, the PHP function mysql_fetch_array() fetches the next row from the result set.
The foreach statement in lines 59 and 60 displays the data in this row and the echo
171
IDWA_C04.QXD 10/5/07 11:13 AM Page 172
Figure 4.23 Basic formatting of search results, showing full staff names,
e-mail and a hyperlink to their full details.
statement in line 61 ensures that the data for different rows appear on separate lines in
the generated web page.
There is no noticeable difference in performance between the while and for
statements. Which to use is a matter of personal style. The syntax of the while state-
ment is a little simpler, but the first row must be fetched before the loop starts and the
stop condition can be difficult to understand. The for statement has a more complex
syntax and the number of rows in the result set needs to be checked before starting the
loop, but it is easy to understand how many times the loop will be executed.
Using a PHP foreach statement to display the name and value of each column
does not really deliver the data in an effective way. In particular, the point of listing
out the names and e-mail addresses is to allow the end user to pick one member of
staff and view his or her full contact details. One approach is simply to list the names
and e-mail addresses followed by a hyperlink, as shown in Figure 4.23. Compared with
the list in Figure 4.20, the latter list is much clearer. Each ‘full details’ hyperlink points
to the URL StaffDetail.php?staffNo=nnnnn, where nnnnn is the staff num-
ber of the member of staff. When the user clicks on a hyperlink, the web browser re-
quests the StaffDetail.php script, passing it the staff number of the member of staff
whose details are required. StaffDetail.php can then retrieve that person’s full contact
details and deliver them as another web page.
The code in SurnameSearch.php can easily be adapted to deliver the data in this
more useful format. Figure 4.24 shows the changes to the code for steps 3 and 4. In
line 53, the SQL select statement has been altered to retrieve the staffNo from the
database. In line 60, rather than churn out the column names and values with a
172
IDWA_C04.QXD 10/5/07 11:13 AM Page 173
foreach loop, the script displays the values of the name and email columns of the
current row in a more natural format – full name followed by e-mail address. Line 61
assembles the URL for the ‘full details’ hyperlink. The URL path is just the name of the
PHP script. This is a relative URL, so the web browser will assume that the file
StaffDetail.php is located in the same directory as SurnameSearch.php.
The query string is more complex as it needs to include the value of the staffNo
column as part of a name-value pair. The value of a name-value pair may include char-
acters that have a special meaning in a URL. For example, the ‘&’ character separates
two different name-value pairs in the query string. If the value of a name-value pair
includes an ‘&’, then the web browser will get confused. So, for example, the ‘&’ (and
the spaces) in the name-value pair:
PHP includes two functions to encode parts of a URL that need to be encoded –
rawurlencode() and urlencode(). The only difference is that urlencode()
replaces each space character with a ‘+’ character instead of its hexadecimal
representation.
173
IDWA_C04.QXD 10/5/07 11:13 AM Page 174
174
IDWA_C04.QXD 10/5/07 11:13 AM Page 175
The first step in communicating with the DBMS is to establish a connection. There
are two things that can go wrong when trying to connect to the DBMS. PHP may not
be able to find a MySQL DBMS on the given host if either the MySQL DBMS isn’t
running or the host name is wrong. If the host can be found, then the login details
(username or password or both) might be incorrect. In either case, the mysql_
connect() function in line 50 returns the value FALSE. Line 52 checks if
$connection is FALSE:
if ($connection == FALSE)
Notice that the equality comparison operator in PHP is two consecutive equals sym-
bols – ==. A single equals symbol is the variable assignment operator. (SQL uses the
single equals symbol for both these operators, which can be a little confusing.) Using
‘=’ when it should be ‘==’ is another common PHP bug. The assignment operation
($connection = FALSE) always evaluates to FALSE, so an if...else...
statement using this as its condition will always execute the else code block.
175
IDWA_C04.QXD 10/5/07 11:13 AM Page 176
The equality comparison operator ‘==’ can compare values that have different data
types. So, for example, TRUE == 1 is true as PHP regards the numeric value 1 as
equivalent to the Boolean value TRUE (in fact, any value other than 0 is equivalent to
TRUE; see PHP Documentation Group (2005) for details). PHP includes the special
operator ‘===’ to mean ‘are equal and the same type’. TRUE === 1 is false as the
two operands are of different types. This is another situation where PHP’s relaxed
attitude to data typing means developers must write their code carefully.
When an error occurs, the statement exit() (line 53) ends the script. Although
exit() looks like a function, it isn’t really; it is part of the core PHP language (like
echo, if...else... and so on). It instructs the PHP application server to write a
message (the string parameter) to the HTTP response, then end the script gracefully.
The PHP application server will not read any more of the script file after executing an
exit() statement. The message generated in line 53 begins with the string literal
“\nCould not connect to MySQL DBMS. \n<br />MySQL error”. Note
that this string literal includes an HTML <br /> element, so the message is format-
ted nicely on the final web page. Next comes a call to the function mysql_errno().
This function returns the error number associated with the last function to commu-
nicate with the MySQL DBMS, which, in this case, is mysql_connect(). This
number is concatenated on to the first part of the string. The string literal ‘:’ is
added next, purely to make the final message look better. Finally, the function
mysql_error() returns the error message associated with the last error. An exam-
ple is shown in Figure 4.27. Here, the hostname of the MySQL DBMS is incorrect –
‘locahost’ instead of ‘localhost’. Notice that the page footer is missing from this web
page (compare it with Figure 4.20). The HTML for the page footer comes after the
exit() statement, so, when an error occurs, it is not included in the web page
generated by the PHP script. If there were no errors connecting to the MySQL DBMS,
then the statement in line 55 would be executed. This simply writes an HTML
comment to the generated web page. Strictly speaking, this sort of thing is not neces-
sary, but can be useful during debugging.
176
IDWA_C04.QXD 10/5/07 11:13 AM Page 177
There is always an ‘error number’ when PHP communicates with the MySQL DBMS. If
the communication went smoothly, then the ‘error number’ would be set to zero (0).
Thus, if a call to mysql_errno() returns the error number 0, then that means there
were no problems communicating with the MySQL DBMS. If your error-handling
routines keep reporting an error of 0, then check the if...else... condition. You
have probably got the logic wrong.
There are three further points where the PHP script in Figure 4.26 communicates with
the DBMS. In line 57, the PHP script states which database it wishes to use and, in line
64, it asks the MySQL DBMS to run an SQL select statement for it. At both points, sim-
ilar error-handling code is included. In line 81, PHP communicates with the MySQL
DBMS for the final time as it tells the DBMS that it is closing the connection. There is
no error handling at this point. If the connection fails to close, there is nothing the
developer can do; the PHP application server will tidy everything up once the script
has finished. Even so, it is good practice to close any DBMS connections as soon as
they are done with – keeping them open takes up resources and can slow down the
database server’s response times.
Figure 4.28 shows the HTML generated by a successful execution of the script file
shown in Figure 4.26. Notice that the HTML has a line break before each <br /> tag.
This is because there is a \n character before the <br /> in line 76 of Figure 4.26 (the
<br /> only ensures that they are displayed in a separate line on the web page). The three
HTML comments generated by the debug code are clearly visible. When there is no data,
it is clear that the SQL select statement has been executed, but no rows were returned.
177
IDWA_C04.QXD 10/5/07 11:13 AM Page 178
One last point about DBMS communication errors that relates to the
mysql_query() function. A MySQL DBMS communication error occurs with this
function only if the SQL statement is not valid. There are many reasons for an SQL
statement not being valid – the keyword SELECT might be spelt wrongly or one of the
column names in the select list might not appear on a table in the from clause, for ex-
ample. Sometimes, the SQL statement is valid, but wrong. For instance, if the variable
name $surname is spelt wrongly in line 64, the PHP application server will replace it
with an empty string rather than the value entered by the user into the search form.
The actual SQL statement passed to the MySQL DBMS will be:
178
IDWA_C04.QXD 10/5/07 11:13 AM Page 179
SupportSession table and, if the modification succeeds, generates the web page
shown in Figure 4.29 (b).
The design is a little naïve. For one thing, there is no login procedure, so anyone
who knows a valid staff number can add support sessions for that member of staff.
This weakness is addressed in Chapter 7, which discusses ways to limit access to web
resources by forcing users to log in. Another weakness is the rather clunky visual de-
sign, but, as stated earlier, visual design issues aren’t addressed in this textbook.
179
IDWA_C04.QXD 10/5/07 11:13 AM Page 180
Figure 4.30 shows the PHP code from version 0.1 of the SupportSessionInsert.php
server-side script. It illustrates the five steps needed to modify the database instance
from a PHP script:
1 open a connection to the DBMS
2 specify which database to use
3 pass a database modification statement to the DBMS and capture the result
4 inform the user that the modification succeeded, or display an error message if it
failed
5 close the connection to the DBMS.
These are largely the same steps as those used to query the database, though, in this
case, there is no need to scroll through the result set. For a data modification
statement, the PHP function mysql_query() always returns a Boolean value: TRUE
if the statement succeeded, and FALSE if it failed.
180
IDWA_C04.QXD 10/5/07 11:13 AM Page 181
181
IDWA_C04.QXD 10/5/07 11:13 AM Page 182
Chapter summary
■ This chapter has reviewed the use of HTML forms and discussed how to deal with
data sent from an HTML form to a PHP script. While doing this, the basics of the
PHP scripting language were introduced and explained. As this is a book about
databases, not programming, the code has been kept pretty simple. However, it
is robust enough to build many standard web database applications. Later chapters
will introduce more sophisticated programming techniques, building on the basics
introduced here.
■ The main purpose of the chapter has been to explain how a PHP script instructs the
PHP application server to communicate with a MySQL DBMS and use the data re-
trieved to generate an HTTP response message (a web page with dynamic content).
It is important to keep the roles of the three players clear:
1 the PHP script is a set of instructions written by a web application developer
2 the PHP application server follows these instructions, which may include
communicating with a DBMS
3 the DBMS responds to requests for information from the PHP application
server – these requests may ask the DBMS to retrieve data from the database
instance or make changes to it.
Further reading
The World Wide Web Consortium (W3C) oversees the HTML standard and related
technologies, such as cascading stylesheets (CSS). Its website (www.w3.org)
includes the formal standards as well as discussions of current developments and
some tutorials. The official PHP website (www.php.net, with a UK mirror at
182
IDWA_C04.QXD 10/5/07 11:13 AM Page 183
Review questions
http://uk.php.net) has a searchable online manual, which provides details on all the
functions covered in this chapter.
There are many good books on HTML and PHP and plenty of Web resources. The best
way to find a book you like is to head to the library, type ‘HTML’ or ‘PHP’ into the
search part of the catalogue, then browse the relevant shelves. The approach varies
from student-focused to practitioner-focused and from a strong emphasis on efficient
programming techniques to the lighter, conceptual touch of this chapter.
Some PHP books the author liked are Bulger et al. (2004), Williams and Lane (2004)
and Welling and Thompson (2005). A favourite introduction to HTML and CSS is
Castro (2007), with Gourley and Totty (2002) as a good reference book.
One good online source is the W3Schools website (www.w3schools.com), which has
introductory tutorials on HTML, SQL, PHP and CSS. Vaswani (2004) is a good, on-line
introduction to PHP.
Review questions
4.1 What is the main difference between sending HTML form data using the HTTP GET
method and the HTTP POST method? How is the data accessed within a PHP
script?
4.2 For each of the following PHP code snippets, write down what you’d expect to be
displayed on the web page generated by PHP. Give a brief explanation of your
answer in each case.
(a) $name='Fred';
echo 'Hello $name.';
(b) $name='Fred';
echo "Hello $name.";
(c) $name='Fred';
echo "Hello $nane.";
(d) $name='Fred';
echo 'Hello', $name;
4.3 Explain the difference between data validation and data cleansing. In what ways are
they similar?
4.4 In HTML, what is the purpose of character entities? How do HTML character entities
differ from the character encoding used for URLs?
4.5 Explain the purpose of the PHP built-in functions strip_tags() and
htmlspecialchars().
4.6 List the six steps for querying a MySQL database, and the PHP DBMS communica-
tion functions associated with each step. If the step does not involve communicating
with the DBMS, then state this.
4.7 List the three PHP loop statements discussed in the chapter and explain their purpose.
183
IDWA_C04.QXD 10/5/07 11:13 AM Page 184
4.8 List the PHP comparison operators used in the chapter. What others would you
expect PHP to include?
4.9 List the PHP Boolean operators used in the chapter. What others would you expect
PHP to include?
Exercises
Before attempting these exercises, you must create the Staff Directory database. Code for
this is included in Appendix A.
A number of these exercises introduce PHP functions not covered in the main text of
the chapter. Check the PHP online manual at www.php.net (or the UK mirror site at
http://uk.php.net) for information on these functions.
4.11 Using a diagram (a UML sequence diagram such Figure 2.28, for example), explain
how data from an HTML form is sent to the database. Include the web browser,
HTTP server, PHP application server and DBMS in your diagram.
4.12 This exercise introduces some more of the PHP string manipulation functions. Using
SurnameSearch.html as a template, write an HTML form to gather a single string of
data from end users. Using version 0.1 of the SurnameStub.php script (Figure 4.3) as
a template, write a PHP stub program to process the data entered as described
below. In each case, display the results to end users.
(a) Use the built-in PHP function strlen() to find the length of the string entered.
(b) Use the built-in PHP functions strtolower() and strtoupper() to change
the case of the letters entered by the user.
(c) Use the built-in PHP function strpos() to find the position of the first occur-
rence of the letter ‘e’ in the string. Then find the position of every letter ‘e’. (Hint:
You can use a loop statement to do this.)
(d) Use the built-in PHP functions strpos() and substr() to split the string at
the first space character and display the two halves on separate lines of the web
page.
(e) Suggest why these string manipulation functions can be useful in a web data-
base application.
4.13 This exercise extends the data validation to deal with literals for strings and numbers
(date validation is covered in Chapter 7). It provides practice with PHP built-in func-
tions and with if ... else statements. Using SurnameSearch.html as a template,
write an HTML form with two text boxes labelled ‘Character string’ and ‘Number’.
Using version 0.2 of the SurnameStub.php script (Figure 4.6) as a template, write a
PHP stub program to process the data entered and echo the values back to the user.
The data validation code should meet the following requirements.
(a) Use the built-in PHP function array_key_exists() to test that both the array
elements are present. What happens if the user submits the form with nothing
(not even a space character) in the text boxes? Under what circumstances would
the array_key_exists() function actually do some good?
184
IDWA_C04.QXD 10/5/07 11:13 AM Page 185
Exercises
(b) Use the built-in PHP functions trim() and empty() to check that the charac-
ter string text box has data in it, not just white space characters.
(c) Repeat this validation for the number text box. Then use the built-in PHP function
is_numeric() to test if the data in the number text box is numeric.
(d) Start a new PHP script. Turn this basic data validation into two functions that you
can reuse in other scripts – one to validate string literals and one to validate nu-
meric literals. The functions should return a Boolean value – true if the literal
passes validation, false if it fails. (Alternatively, create a single function capable of
validating both string and numeric literals.)
4.14 This exercise integrates the data validation code from Exercise 4.13 into a database
search and tidies up the presentation of the search results. Using version 1.1 of
SurnameSearch.php (Figure 4.19) as a basis, write a PHP script to implement the
following requirements. To test your script, you will need a copy of Surname
Search.html that submits its data to this PHP script.
(a) Use the validation function created in Exercise 4.13 (d) to validate the data en-
tered into the surname text box. For data that passes validation, remember to
trim off any leading or trailing spaces and to carry out data cleansing.
(b) Retrieve the fName, lName and email columns from the Staff table. At the start
of the list, display a count of the number of matching rows found. (Note: You will need
to pay careful attention to nesting of the HTML paragraph elements in your script.)
(c) Provided that some rows were found, display the rows in an HTML paragraph
element. Begin with a header row, displaying the text ‘Name, E-mail’ in bold.
(Hint: Use the HTML strong element, <strong>...</strong>.) Then, for each
row, show the full name and the e-mail address, using a break element to start
each row of data on a separate line.
(d) Start a new PHP script. Display the rows in an HTML table element, rather than a
paragraph element, so that the names and e-mail addresses of different rows line
up nicely. (Alternatively, if you are familiar with CSS, use it to control the layout.)
(e) Try validating the web pages generated by your PHP scripts using the W3C
HTML validator, located at http://validator.w3.org. Adjust your PHP script to elim-
inate any HTML validation errors.
If your web pages are on a public web server, just type the URL of the page you wish
to validate into the ‘Validate by URL’ text box. If not, then view your web page in one
browser window, select ‘View source’ and copy the HTML. Open the HTML validator
in another browser window and paste your HTML into the ‘Validate by direct input’
text area.
4.15 This exercise deals with DBMS communication errors. Using your solution to
Exercise 4.14 (or version 1.3 of SurnameSearch.php – the actual code is shown in
Figure 4.26) add code to handle DBMS communication errors. Your code should
meet the following requirements.
(a) When an error occurs while communicating with the DBMS, the error number and
associated message are displayed to the end user.
(b) The script includes a PHP constant DEBUG. If no error occurs and DEBUG is TRUE,
then write a suitable notice as an HTML comment. (Hint: The PHP statement
define("DEBUG", true); names a constant called DEBUG and sets its value
185
IDWA_C04.QXD 10/5/07 11:13 AM Page 186
to TRUE. To refer to the constant, simply use its name – for example, the PHP
statement echo DEBUG; writes the value of the constant DEBUG to the HTTP
response.)
(c) When querying the database, if DEBUG is TRUE, then the SQL statement should
be written out as an HTML comment whether or not there is a DBMS communi-
cation error.
4.16 Use the experience gained in the previous exercises to improve the
SupportSessionInsert.php script. In particular:
● ensure that the staff number is a five-digit numeric literal and inform the user of
any errors
● check that the day is one of the string literals ‘Monday’, . . . ‘Friday’ and inform the
user of any errors
● only run the database query if the staff number and day pass validation
● handle any DBMS communication errors as in Exercise 4.15.
Investigations
4.17 String manipulation and data validation with the functions covered in this chapter are
rather cumbersome. Regular expressions provide a much more elegant solution. In-
vestigate regular expressions and the associated PHP functions.
4.18 Typing ECMAScript into a text box is only one way in which hackers can try to sub-
vert a PHP script. Investigate other ways hackers seek to subvert web applications.
The UK government agency Centre for the Protection of the National Infrastructure
(CPNI) has produced a briefing document on securing web applications (NISCC,
2006), available at www.cpni.gov.uk/docs/secureWebApps.pdf and it includes a
detailed technical discussion of ‘data validation’, which this book calls ‘data
cleansing’.
4.19 The functions used to connect to the MySQL DBMS in this chapter are actually quite
old (and, hence, reliable). The PHP language includes the MySQL Improved Exten-
sion for connecting to the MySQL DBMS version 4.1 or later. This extension supports
both procedural and object-orientated approaches to coding (see http://uk.
php.net/manual/en/ref.mysqli.php). The PHP Extension and Application Repository
(PEAR) project provides a prebuilt set of database connectivity functions in the
MDB2 package (see http://pear.php.net/package/MDB2). Investigate one (or both) of
these, comparing them to the older set of functions used here with regard to ease of
use, maintainability and robustness.
4.20 PHP is only one of a number of server-side scripting languages. Others include Ruby
on Rails, Microsoft ASP.NET, JSP (Java Server Pages) and Macromedia ColdFusion.
Investigate one of the others, comparing it to PHP.
186
IDWA_C05.QXD 10/5/07 11:17 AM Page 187
Chapter objectives
Chapter outline
SQL is the standard data language for relational DBMS and widely used by non-
relational products. It evolved from a language developed by IBM in the mid 1970s
and was adopted as an international standard in the late 1980s. The International
Organization for Standards (ISO) oversees development of the language and pub-
lishes the definitive standard. At the time of writing, the latest version was
SQL:2003.
SQL is a large language and, since the mid 1990s, has been computationally com-
plete. The specification includes if ... else statements, loops and so on. In 1999,
it acquired object-orientated features and, in 2003, facilities to deal with XML. Some of
these advanced features are covered in later chapters. However, many DBMS do not im-
plement the full language and most database applications only use the basic data lan-
guage features.
This chapter focuses on the SQL data language, the original core of SQL. The SQL
data language has two parts:
1 the SQL data definition language for creating and modifying the database schema
2 the SQL data manipulation language for querying and modifying the database in-
stance.
187
IDWA_C05.QXD 10/5/07 11:17 AM Page 188
The basic features of both are introduced in this chapter and more advanced fea-
tures are discussed in the following chapter. As all web database developers will need
to query the database instance, but some may never create the database schema, SQL
DML is covered first.
■ Section 5.1 introduces the SQL DML select statement for querying the database in-
stance. It explains the basic options for manipulating the result set, which are how
to define which columns to include, how to order the rows of the result set and
how to remove duplicate rows from a result set. It also discusses how to include ex-
pressions, rather than simple column values, in the rows of a result set.
■ Section 5.2 discusses the where clause. This allows the developer to identify which
rows the DBMS should include in the result set.
■ Section 5.3 concludes this chapter’s look at the SQL DML by discussing how to mod-
ify the database instance using the SQL DML insert, update and delete statements.
■ Section 5.4 introduces the SQL DDL statements to create, and remove, databases
and tables. These are create schema and create table.
■ Section 5.5 discusses the SQL DDL alter table statement, used to modify a table
definition.
The SQL select statement allows users to query the database. The simplest form of the
select statement is shown in Figure 5.1. The statement begins with the SQL keyword
SELECT, which is followed by a comma-separated list of column names. This is called
the select list. Immediately after the last column name in the select list comes the
from clause. The from clause indicates which database table the columns in the select
list come from. The result of any SQL select statement is a derived table, called the
result set. The data in a derived table is stored in memory, not on disk.
The Staff Directory web database application can use a simple SQL select statement
to retrieve a list of staff and their e-mail addresses from the database (Code listing
5.1.1, below).
The select list includes three columns – fName, lName, email. When the select
statement is executed, the from clause tells the DBMS which table to retrieve data
from. The select list tells the DBMS to retrieve the values of the columns fName,
lName and email, but not the values of any other columns.
Figure 5.2 (b) shows the result set (that is, the derived table) for the select statement
in Code listing 5.1.1. Because the derived table is generated in response to a database
select list
188
IDWA_C05.QXD 10/5/07 11:17 AM Page 189
query, it doesn’t have a name and there are no data types shown for the columns. The
data type of the derived table’s columns is determined by the underlying base table.
Figure 5.2 (a) shows the instance of the Staff table (the underlying base table)
being queried. Notice how every row of the Staff table instance is represented in the
result set. Simple select statements, with just a select list and a from clause, always
include the requested column values for every row of the underlying base table.
In the SQL statements in this book, SQL keywords are written in upper-case letters
and identifiers in mixed case. This is simply to make the code easier to read. ISO SQL
is not case-sensitive for keywords or identifiers. Some DBMS, including the MySQL
DBMS, can be case-sensitive for identifiers. For example, the identifiers STAFF and
Staff are the same in ISO SQL, but can be different in the version of SQL imple-
mented by the MySQL DBMS. The DBA can decide whether a particular instance of
a DBMS is, or is not, case-sensitive for identifiers.
Figure 5.2 Querying the Staff table. Examples of a table instance and query result.
Staff
staffNo:Varchar fName:Varchar IName:Varchar email:Varchar phone:Varchar photo:BLOB
10780 John Smith [email protected] NULL [BLOB]
14443 Helen Abbot [email protected] 8032 [BLOB]
23257 Freya Stark [email protected] 8660 [BLOB]
25447 John Smith [email protected] 5104 [BLOB]
25448 Judith Anne Smith [email protected] 7709 [BLOB]
31210 Paul Smith [email protected] NULL [BLOB]
33509 Helen Timms [email protected] 8661 [BLOB]
33935 Padma Brar [email protected] 6641 [BLOB]
35054 Selma Hutchins [email protected] 8706 [BLOB]
35155 Helene Chirac [email protected] NULL [BLOB]
45965 Mikhail Sudbin [email protected] 5553 [BLOB]
55776 Gurpreet Choudhury [email protected] 5454 [BLOB]
56673 Joshua Bittaye [email protected] 7782 [BLOB]
56893 Ruth Bapetsi [email protected] 8022 [BLOB]
77712 Frank Rose [email protected] 8871 [BLOB]
78893 Jo Karen O’Connor [email protected] 8871 [BLOB]
89987 Dan Lin [email protected] 8514 [BLOB]
189
IDWA_C05.QXD 10/5/07 11:17 AM Page 190
A user making a database query such as this one probably wants to see the results in al-
phabetical order. The select statement’s order by clause allows the user to specify how
to order the rows returned by the select statement. The order by clause comes after the
from clause.
Code listing 5.1.2 shows how to use an order by clause. It asks the DBMS to retrieve
the values in the fName, lName and email columns of all rows from the Staff
table and order the rows in the result set by lName and fName.
190
IDWA_C05.QXD 10/5/07 11:17 AM Page 191
The result set for the select statement in Code listing 5.1.2 is shown in Figure 5.3.
Notice that four staff members whose surname is ‘Smith’ now appear together, ordered
by their first names. The order by clause can include any columns from the table men-
tioned in the from clause, not just those in the select list. So, Code listing 5.1.3 is also
a valid SQL statement.
Code listing 5.1.3 Ordering the result set by a column that isn’t in the
select-list
SELECT fName, lName, email
FROM Staff
ORDER BY staffNo;
The select list is not limited to column names, but can include expressions. For exam-
ple, the list of staff and their e-mail addresses shown in Figure 5.3 may look a bit odd
as the first and last names are in separate columns. The SQL concatenation operator al-
lows the character string values of these two columns to be added together into a sin-
gle character string. Code listing 5.1.4 shows a first attempt at this and Figure 5.4 the
result set for this select statement. Notice that there are now just two columns in the
result set. For each row of the Staff table, the values from the fName and lName
columns have been merged to form a single value in the result set’s derived table.
191
IDWA_C05.QXD 10/5/07 11:17 AM Page 192
192
IDWA_C05.QXD 10/5/07 11:17 AM Page 193
name email
Helen Abbot [email protected]
Ruth Bapetsi [email protected]
Joshua Bittaye [email protected]
Padma Brar [email protected]
Helene Chirac [email protected]
Gurpreet Choudhury [email protected]
Selma Hutchins [email protected]
Dan Lin [email protected]
Jo Karen O’Connor [email protected]
Frank Rose [email protected]
John Smith [email protected]
John Smith [email protected]
Judith Anne Smith [email protected]
Paul Smith [email protected]
Freya Stark [email protected]
Mikhail Sudbin [email protected]
Helen Timms [email protected]
aliases, the column names of the result set can be used directly to label column values
on the web page. For example, Code listing 5.1.6 gives the alias ‘name’ to the expres-
sion fName ||''|| lName. A PHP script can use this alias to label the column val-
ues. Figure 5.5 shows the result set for Code listing 5.1.6. Note that both the problems
identified in Figure 5.4 have been corrected.
The concatenation operator for ISO SQL:2003 is ‘||’. However, this symbol is the
Boolean OR operator in the MySQL DBMS – one of the few cases where the dialect of
SQL implemented by the MySQL DBMS conflicts with the ISO standard. The MySQL
DBMS uses the CONCAT() function to concatenate string values. For example, the
MySQL equivalent of Code listing 5.1.6 is:
193
IDWA_C05.QXD 10/5/07 11:17 AM Page 194
Figure 5.6 Listing the days on which staff offer support sessions.
staffNo dayOfWeek
23257 Monday
31210 Wednesday staffNo dayOfWeek
35054 Monday 23257 Monday
35054 Monday 31210 Wednesday
45965 Monday 35054 Monday
45965 Wednesday 45965 Monday
55776 Monday 45965 Wednesday
56673 Friday 55776 Monday
56673 Friday 56673 Friday
56673 Thursday 56673 Thursday
56893 Thursday 56893 Thursday
56893 Tuesday 56893 Tuesday
(a) Result set for code listing 5.1.7 (b) Result set for Code listing 5.1.8 (note
(note the duplicate rows). that there are no duplicate rows).
Suppose now that the user wants a list showing the days when members of staff offer
support sessions. The SupportSession table only includes the staff numbers as
the names are given in the Staff table. Section 6.1 investigates how to use SQL to
retrieve data simultaneously from two tables, but, for now, let us assume that the staff
number is sufficient.
Code listing 5.1.7 is a first attempt at this query. It asks the DBMS to retrieve the val-
ues in the staffNo and dayOfWeek columns of all rows in the SupportSession
table, ordered by the staffNo. The result set for this query is shown in Figure 5.6 (a).
Code listing 5.1.7 Listing the days on which members of staff offer support
sessions
SELECT staffNo, dayOfWeek
FROM SupportSession
ORDER BY staffNo;
The problem here is that there are some duplicate rows in the result set. The staff member
with staffNo ‘35054’ offers two support sessions on a Monday and ‘56673’ offers two
on a Friday. The list should simply show which days a member of staff offers support ses-
sions, not how many. The duplicate rows need to be removed from the result set.
Placing the SQL keyword DISTINCT before the select list, as in Code listing 5.1.8,
instructs the DBMS to do exactly that (see the result set in Figure 5.6 (b)). Using
DISTINCT only instructs the DBMS to remove duplicate rows from the result set, so
there are still two different rows with staffNo ‘56673’ and five rows with dayOfWeek
‘Monday’, but none of these rows is a duplicate.
194
IDWA_C05.QXD 10/5/07 11:17 AM Page 195
Most database queries do not retrieve data from every row of a table. Instead, they instruct
the DBMS to retrieve data from only those rows where a given condition is true. This
condition is specified as a Boolean expression in the select statement’s where clause. A
typical condition is that a column value for the row must satisfy some restriction – the
value of the lName column must equal ‘Smith’, for example. When the DBMS executes
the select statement, it will check each row of the underlying base table in turn. If the
condition is true, then the DBMS retrieves values for the columns (and expressions) in the
select list. In this way, the DBMS constructs the result set row by row, scanning through
the underlying base table from top to bottom and adding a row to the result set for every
row of the underlying base table that satisfies the condition in the where clause.
The select statement is not the only SQL statement to use a where clause, so it is
worth covering this clause in detail. The general format for an SQL select statement
with a where clause is shown in Figure 5.7. Notice that the various parts of the select
statement must be placed in the correct sequence:
● select list
● from clause
● where clause
● order by clause.
The simplest condition is an equality condition – that is, the value in a particular
column must equal some given value. Code listing 5.2.1 uses an equality condition
to request a list of members of staff who have the surname ‘Smith’. The where clause
Figure 5.7 SQL select statement with a where clause and order by clause.
195
IDWA_C05.QXD 10/5/07 11:17 AM Page 196
Code listing 5.2.2 A select statement that always returns an empty result set
SELECT fName, lName, email
FROM Staff
WHERE phone = NULL;
196
IDWA_C05.QXD 10/5/07 11:17 AM Page 197
This is clearly a problem and SQL gets round it by introducing a special operator for
comparing values to null – IS NULL. This takes a single expression, of any data type,
and outputs TRUE if the expression is null or FALSE if it isn’t. The IS NULL operator
never outputs the value UNKNOWN. Usually, the expression is just a column name.
Code listing 5.2.3 shows how to use the IS NULL operator to list members of staff
for whom the phone number column is null. Nulls and three-valued logic make SQL
more complicated. Always think carefully about how a where clause will treat nulls
and whether or not this is the behaviour that’s required.
● expression IS TRUE
● expression IS FALSE
● expression IS UNKNOWN.
These three operators output TRUE if the expression has the truth value stated or
FALSE if it doesn’t. It is important to use these IS operators in preference to the =
comparison operator for Boolean columns. This is because, in ISO SQL:2003, the
Boolean value represented by the literal UNKNOWN is the same as null (represented by
the SQL keyword NULL). This means that when a Boolean column is assigned the
value UNKNOWN, it is actually made null. That is not the most sensible idea. Any column,
no matter what its data type, can be null. Only a column the data type of which is
Boolean can hold the Boolean value UNKNOWN. Even worse, although the comparisons
columnName=TRUE and columnName=FALSE work as expected, columnName=
UNKNOWN does not! This is because UNKNOWN is the same as null, so columnName=
UNKNOWN is the same as columnName=NULL. As noted above, this always evaluates
to UNKNOWN. There is a good discussion of these issues in Date (2004, pp. 575–63).
Support for the ISO SQL:2003 Boolean data type in commercial DBMS is rather
patchy. The Oracle database does not include a Boolean data type and the MySQL
DBMS silently converts columns defined as BOOLEAN to TINYINT(1), a single-digit
numeric data type. The MySQL DBMS manual (MySQL AB, 2007) states that the nu-
meric value 0 equates to FALSE and all other numeric values equate to TRUE– there
is no equivalent of the ISO SQL:2003 Boolean value UNKNOWN. In practice, the
MySQL DBMS version 5.0.15 only treats 1 as TRUE; other numeric values are simply
not recognized as being either TRUE or FALSE.
Another common kind of database query asks whether the value of a column lies in a
particular range – Show support sessions that start between 09.00 and 12.00, for example.
Code listing 5.2.4 shows how to write this query. The result set will include values
from every column of the SupportSession table for those rows where the value of
197
IDWA_C05.QXD 10/5/07 11:17 AM Page 198
the startTime column lies between 09.00 and 12.00. Syntax like this is what makes
SQL a nice language – it is almost, though not quite, plain English. Understand the
structure of the select statement and it is easy to write simple database queries.
Both IS NULL and BETWEEN are called ‘predicates’ in the ISO SQL standard. A
predicate is simply a statement that a particular object (or objects) have a particular
property. In other words, it is a Boolean expression. IS NULL and BETWEEN are real-
ly predicate symbols, used to form predicates such as phone IS NULL and start-
Time BETWEEN '09:00' AND '12:00'. The distinction between an ‘operator’
and a ‘predicate symbol’ is important in mathematical logic, but not in web database
development. In this book, the term ‘operator’ is used for all these symbols.
When a select statement needs to retrieve data from every column of a table, the select
list can be replaced with an asterisk. The select statement:
● Placing unnecessary column values into a result set can slow things down as more
data is being transferred than is needed.
● If a PHP script is set up to expect four columns in a result set and suddenly finds that
there are five, this will generate unexpected results and could cause the script to fail.
198
IDWA_C05.QXD 10/5/07 11:17 AM Page 199
A where clause can include more complex conditions than those seen so far by using
the Boolean operators NOT, AND and OR to create longer Boolean expressions. For
example, the range comparison expression:
Code listing 5.2.5 Using AND to test that two unrelated conditions
are both true
SELECT fName, lName, email
FROM Staff
WHERE (phone IS NULL) AND (lName='Smith');
The brackets in the where clause of Code listing 5.2.5 are not required, but can make the
where clause easier to understand. The DBMS uses the truth tables discussed in Chapter 3
to evaluate the value of the complex Boolean expression by first evaluating the values of
the two simpler Boolean expressions phone IS NULL and lName='Smith'. The
complex expression is TRUE when phone IS NULL is TRUE and lName='Smith'
is also TRUE, FALSE if either of these is FALSE and UNKNOWN if either is UNKNOWN.
The result set is shown in Figure 5.9. Only two rows satisfy both the simpler expres-
sions. Other staff members satisfy one of the conditions, but are excluded from the
result set because they don’t satisfy the other. For example, Helene Chirac’s phone
199
IDWA_C05.QXD 10/5/07 11:17 AM Page 200
number is null, but her surname is not Smith, so she is excluded from the result set.
Judith Anne Smith is also excluded from the result set because her phone number is
7709, which is clearly not null.
The difference between AND and OR is illustrated by Code listing 5.2.6 and Figure 5.10.
With OR, rows that satisfy either (or both) of the simpler expressions are included
in the result set. Rows for Helene Chirac and Judith Anne Smith are both included in
Figure 5.10 because they each satisfy one of the two conditions, though both were
excluded from Figure 5.9 as neither satisfies both conditions.
Code listing 5.2.6 Using OR to test that either one of the distinct
conditions is true
SELECT fName, lName, email
FROM Staff
WHERE (phone IS NULL) OR (lName='Smith');
Some database queries require both the AND and OR Boolean operators to define
which rows should be included in the result set. Suppose the user wants to see a list of
support sessions that are offered either on a Friday afternoon or Monday morning. The
Boolean expression:
200
IDWA_C05.QXD 10/5/07 11:17 AM Page 201
dayOfWeek='Monday'
AND (startTime<'12:00' OR dayOfWeek='Friday')
AND startTime>='12:00'
No rows satisfy this expression. The first and final conditions mean that the support
session must run on a Monday afternoon, but the second condition means that it
must run on either a Friday or in the morning. Clearly, these conditions cannot all be
met. Always use parentheses to indicate the correct interpretation of a complex
Boolean expression.
The third Boolean operator is NOT. The select statement in Code listing 5.2.8 re-
sult in all support sessions apart from those that run on a Monday morning being
listed.
NOT (surname='Smith')
is usually written using the not-equal-to operator – <>:
surname<>'Smith'
Range comparisons can also be negated and SQL allows a fairly natural way to write
such expressions. For example:
201
IDWA_C05.QXD 10/5/07 11:17 AM Page 202
There is one more comparison operator provided by SQL that is regularly used in data-
base queries. The LIKE operator performs pattern matching for character strings. It rec-
ognizes two wild card characters. The percentage character – % – matches any group of
characters. For example, the select statement in Code listing 5.2.9 lists all members of
staff whose e-mail addresses begin with the letter ‘j’. The result set is shown in
Figure 5.11. If the comparison expression is changed to email LIKE 'j.b%', then
only the first and fourth rows shown in Figure 5.11 would appear in the result set.
Code listing 5.2.9 Pattern matching with the LIKE string comparison operator
SELECT fName, lName, email
FROM Staff
WHERE email LIKE 'j%';
As with the range comparison expression, SQL allows a natural way to negate a LIKE
comparison. The Boolean expression:
202
IDWA_C05.QXD 10/5/07 11:17 AM Page 203
Figure 5.13 An instance of the Equipment table from the Web Timetable database.
Equipment
assetNo:varchar assetType:varchar description:varchar building:varchar room:varchar portable:boolean cost:numeric acquired:date
1570131 Data projector HP vp6100 Wilson 205 FALSE 1300.00 2004-07-14
1799131 Data projector HP vp6100 NULL NULL TRUE 1300.00 2005-01-12
4560293 OHP 3M OHP 1608 Locke 24 FALSE NULL NULL
4503993 PC Dell OptiPlex™ GX280 Wilson 205 FALSE 625.00 2004-04-30
5010009 OHP 3M OHP 2000 NULL NULL TRUE 185.00 2001-08-20
2992220 Data projector ToshibaTDP-SW20 Priestley G12 FALSE 1100.00 2005-01-12
2892112 Laptop SONY X505 NULL NULL TRUE NULL NULL
‘[email protected]’ and the underscore only matches single characters (for the
same reason, John Smith, with the e-mail address ‘[email protected]’, also does
not appear in the result set). If the underscore were to be replaced with a ‘%’, then all
four Smiths would appear in the result set.
203
IDWA_C05.QXD 10/5/07 11:17 AM Page 204
fourth and fifth rows of the table. However, if the date and time value for the first row
of the acquired column is actually '2004-07-14 09:45', then this row will not be
included in the result set for the query in Code listing 5.2.11. This problem does not
occur for DBMS that have separate data types for date and date and time values, as the
ISO SQL standard specifies. Section 6.2 looks at some of SQL’s built-in value functions,
including functions to manipulate date values, which can help deal with this problem.
The SQL DML includes three statements that modify the database instance (in fact,
each modifies the table instance of a particular table):
● the insert statement adds a new row to the table instance of a particular table
● the delete statement removes one or more rows from the table instance of a par-
ticular table
● the update statement modifies one or more rows of the table instance of a particu-
lar table.
In ISO SQL:2003, the only way to modify the table instances of two different tables is
to write two separate SQL statements.
The dialect of SQL used by the MySQL DBMS (Version 4.0.4 onwards) does allow
delete statements and update statements that affect multiple tables. See MySQL AB
(2007) for details.
The SQL DML insert statement adds a new row to a table instance. Figure 5.14 shows
the basic format of the insert statement. It starts with the keywords INSERT INTO
followed by the name of the table the new row will be added to. Next comes a comma-
separated list of the column names the insert statement is supplying data values for. If
no value is supplied for a particular column, it is set to the default value for that col-
umn (or to null, if there is no default value specified).
The keyword VALUES comes next, followed by a comma-separated list of the values
being supplied for the columns in the column list. Obviously, the two lists must be the
same length. Code listing 5.3.1 shows how to use the insert statement to add contact
Column list
Value list
204
IDWA_C05.QXD 10/5/07 11:17 AM Page 205
details for a new member of staff to the Staff Directory database’s Staff table. The
first value – '89988' – is the value for the staffNo column in the new row, the sec-
ond value, 'Hasif', is the value of the fName column on the new row, and so on.
Code listing 5.3.2 An equivalent insert statement to that in Code listing 5.3.1
INSERT INTO Staff (staffNo, fName, lName, email, phone)
VALUES ('89988', 'Hasif', 'Choudhury',
'[email protected]', NULL);
The column list is optional. Code listing 5.3.3 shows an insert statement with no col-
umn list. If the column list is omitted, then the value list must include a value for every
column on the table. The first value is inserted into the first column, the second value
into the second column and so on.
Code listing 5.3.4 Adding a new row with all the data values in the wrong
columns
INSERT INTO Staff
VALUES ('Hasif', 'Choudhury', '[email protected]',
'89988', NULL, NULL);
205
IDWA_C05.QXD 10/5/07 11:17 AM Page 206
Update list
UPDATE TableName
SET columnName1 = expression,
columnName2 = expression
WHERE BooleanExpression;
The SQL DML update statement changes column values in selected rows of a table in-
stance. The general form is shown in Figure 5.15. The statement begins with the key-
word UPDATE, followed by the name of the table the instance of which is being
modified. Next comes the keyword SET, followed by a comma-separated list of assign-
ment statements. Each assignment statement sets the value of one of the columns in the
table to the value given by the expression. Finally, there is a where clause. Just as a where
clause in a select statement tells the DBMS which rows to retrieve, the where clause in an
update statement tells the DBMS which rows of the table instance to modify. If there is
no where clause, then the update statement will affect every row in the table instance.
This is rarely what is required, so most update statements include a where clause.
Suppose that Judith Smith, with staff number 25448, gets married and changes her
surname to Smith-Atkins. This also means that her e-mail address changes. The Staff
table in the Staff Directory database must be updated to reflect these changes in the
real world.
Code listing 5.3.5 shows how the update is effected. The where clause ensures that
only one row of the table is affected – the row that holds data about this particular
member of staff. The update list mentions just two columns, so the column values for
the staffNo, fName, phone and photo columns are unchanged.
206
IDWA_C05.QXD 10/5/07 11:17 AM Page 207
5.4 Creating, and removing, databases and tables with SQL DDL
The delete statement removes one or more rows from a table instance. The general for-
mat is shown in Figure 5.16. Notice that the delete statement only deletes whole rows –
you don’t ‘delete’ a single column value from a row. If the value of a column no longer
applies (say Judith Smith-Atkins has moved office and no longer has a telephone), set
the column to null using an update statement.
The delete statement begins with the SQL keywords DELETE FROM, followed by
the name of the table. This can be followed by a where clause, though this is optional.
Without a where clause, the delete statement will remove every row from the table.
For example, suppose that Judith Smith-Atkins has resigned. Code listing 5.3.7
shows how to remove the row that holds data about her from the Staff table in-
stance. Notice that, in this case, only one row is removed because the where clause
identifies a single row (the staffNo is the primary key for this table). If the where
clause identifies more than one row, then all are removed from the table instance.
Code listing 5.3.8 Removing all rows from the SupportSession table
DELETE FROM SupportSession;
Databases and tables need to be created before data can be added to them or retrieved
from them. A database corresponds to an SQL schema. Figure 5.17 shows the basic
format of the SQL DDL create schema statement. The SQL keywords CREATE SCHEMA
are followed by the name of the new SQL schema. In fact, this is all that is required.
207
IDWA_C05.QXD 10/5/07 11:17 AM Page 208
The authorization clause, which indicates which database user owns the new SQL
schema, is optional. If included, it begins with the SQL keyword AUTHORIZATION
(spelt with a ‘z’, not an ‘s’), followed by a valid authorization identifier. If the autho-
rization clause is omitted, then the DBMS assumes that the user executing the state-
ment owns the SQL schema.
Code listing 5.4.1 shows the SQL DDL statement to create an SQL schema called
WebTimetable, owned by the user with the authorization identifier Web-
TimetableDba. The WebTimetable SQL schema holds all the tables and other
database objects needed by the Web Timetable application.
Constraint definition
208
IDWA_C05.QXD 10/5/07 11:17 AM Page 209
5.4 Creating, and removing, databases and tables with SQL DDL
3 default clause – the keyword DEFAULT is followed by a default value for this col-
umn, enclosed in parentheses, which is usually a literal, but can be a more complex
expression
4 column constraint – the keyword NOT NULL places a not-null constraint on the
column.
Only the column name and data type are required. In Figure 5.18, column1 has a min-
imal column definition, while the other four include one or both of the two optional
characteristics. The column constraint for column3 is a not-null constraint. There are
other valid column constraints, but, as a matter of coding style, only not-null con-
straints will be defined as column constraints. The column definition for column5
has all four characteristics in the correct order.
In the dialect of SQL used by the MySQL DBMS, the expression that defines the default
value should not be enclosed in parentheses. See MySQL AB (2007) for details.
The table constraint definition in Figure 5.18 defines a primary key constraint. It
begins with the keyword CONSTRAINT, followed by an SQL identifier that names the
constraint. This part (CONSTRAINT constraintName) is optional and, if it is omit-
ted, then the DBMS assigns the constraint a system-defined identifier. It is always best
to give a constraint a meaningful name, rather than relying on the DBMS to generate
one as it is often easier to remove a constraint later if its name is known.
After the constraint name, the keywords PRIMARY KEY indicate that this is a pri-
mary key constraint. The columns that form the primary key are then listed in paren-
theses. In this book, primary key constraint names consist of the table name prefixed
by pri. This is just another coding convention. Naming a constraint priRoom does
not make it a primary key, the keywords PRIMARY KEY do this.
Any valid SQL table constraint can be included in a create table statement, but, in
this book, only primary key constraints are defined in the create table statement.
Other constraints are added once the table has been created, using the SQL DDL alter
table statement (see below). This is another coding convention and makes it easier to
define foreign key constraints.
The Room and Equipment tables from the Web Timetable database use most of
the different available data types and include columns with default values and not-
null constraints. Sample table instances are shown in Figure 5.19. Note that the
Equipment table now includes both the costs of the equipment and the dates when
they were purchased. The SQL DDL statement to create the tables is shown in Code
listing 5.4.2.
SQL does not require the column and constraint definitions to appear on separate
lines, but it is much easier to read the statement when it is presented with line breaks.
Lining up the column names and data types is another good way to make the state-
ment more readable, but is not required.
209
IDWA_C05.QXD 10/5/07 11:17 AM Page 210
Figure 5.19 Sample instances of the Equipment and Room tables from the Web Timetable
database.
Equipment
assetNo:varchar assetType:varchar description:varchar building:varchar room:varchar portable:boolean cost:numeric acquired:date
1570131 Data projector HP vp6100 Wilson 205 FALSE 1300.00 2004-07-14
1799131 Data projector HP vp6100 NULL NULL TRUE 1300.00 2005-01-12
4560293 OHP 3M OHP 1608 Locke 24 FALSE NULL NULL
4503993 PC Dell OptiPlex™ GX280 Wilson 205 FALSE 625.00 2004-04-30
5010009 OHP 3M OHP 2000 NULL NULL TRUE 185.00 2001-08-20
2992220 Data projector ToshibaTDP-SW20 Priestley G12 FALSE 1100.00 2005-01-12
2892112 Laptop SONY X505 NULL NULL TRUE NULL NULL
Room
building:varchar roomNo:varchar capacity:numeric
Wilson 205 25
Wilson 113 25
Wilson 105 100
Priestley G12 150
Priestley 113A 20
Locke 24 30
Locke 27 25
Locke 14 30
210
IDWA_C05.QXD 10/5/07 11:17 AM Page 211
5.4 Creating, and removing, databases and tables with SQL DDL
The first create table statement in Code listing 5.4.2 creates the Room table. There are
three column definitions and a primary key constraint. The first column definition
defines a column called building with data type VARCHAR(15), so, in each row of
the table, the value of the building column will be a character string of varying
length, with a maximum length of 15 characters. The not-null constraint means that,
in every row of the Room table, the building column will have a value; it is never
null. Similarly, the value of the roomNo column will be a character string of varying
length with a maximum of four characters and is never null.
The capacity column will have the data type INTEGER. The values of this nu-
meric data type are integers. There is no precision defined by SQL (no limit on the
number of digits in an INTEGER value), though the operating system software will
impose a limit on the size of integers. Obviously, there is no scale for an integer –
there are no digits after the decimal point (see Chapter 3 for a discussion of preci-
sion and scale). The capacity column is allowed to be null, although, in the table
instance shown in Figure 5.19, every row has an integer value in the capacity
column.
Finally, the table-constraint definition shows how to define a composite primary key.
The second create table statement in Code listing 5.4.2 creates the Equipment
table. Note that the data type and size of the Equipment table’s building and
room columns must exactly match those of the corresponding building and
roomNo columns of the Room table.
The first five column definitions should be familiar. The sixth column definition
defines the portable column to have its data type as Boolean. Values in this col-
umn, therefore, will be Boolean values, which, in SQL, are TRUE, FALSE or UNKNOWN.
The default clause ensures that if a row is added to the Equipment table with no
value specified for portable, then the DBMS will use the default value FALSE. A col-
umn that does not have a default clause has a default of null, such as the description
column of the Equipment table.
The last part of the column definition for portable is a not-null constraint. This
means that it can never be null. In ISO SQL:2003, the truth value UNKNOWN is actually
represented by null (Date, 2004, p. 592), so the portable column can only be either
TRUE or FALSE.
A good rule of thumb is to always define Boolean columns to be not null. If a column
value needs three different states – for example, an assignment from a particular stu-
dent can be ‘submitted’, ‘marked’ or ‘unknown’ – then use a character data type and
apply a check constraint to restrict the column values.
The column definition for cost has the data type NUMERIC(7,2). Here, the preci-
sion is 7, meaning that the value in the cost column can have up to seven digits. The
scale is 2, so two of these digits must come after the decimal point. Thus, the cost col-
umn can hold values from 0 to 99999.99. (In fact, it can also hold negative values in
the same absolute range.)
The last column defined is acquired with the data type DATE. Both cost and
acquired can be null, as shown in Figure 5.19.
211
IDWA_C05.QXD 10/5/07 11:17 AM Page 212
The SQL standard insists that no column forming part of a primary key can ever be
null. This means that the not-null column constraints on the building and roomNo
columns of the Room table, and on the assetNo column of the Equipment table, are
unnecessary. Including them does no harm, however, and makes clear that these
columns cannot be null, even to people who don’t understand SQL that well.
Depending on the DBMS, it may even be necessary, so always include them.
The distinction between SQL column and table constraints is purely syntactic: both
restrict what rows can exist in a table instance. In fact, a column constraint is just
shorthand for a table constraint. Code listing 5.4.3 shows how to define the Room
table using SQL check constraints on the table, rather than the not-null constraints on
the columns.
212
IDWA_C05.QXD 10/5/07 11:17 AM Page 213
5.4 Creating, and removing, databases and tables with SQL DDL
belong to some SQL schema. The way it works is that there is an SQL schema associated
with each session as, when users log on to the DBMS, they must tell it what SQL
schema they will be working on. Any create table statement executed during that
session creates a table in the associated SQL schema. Users can change SQL schema
during a session – the MySQL DBMS includes the USE statement to do this, for example.
One more final point about the create table statement. There is a lot more to it than
has been discussed here. However, most of the other options are not usually needed
and may not be implemented by some commercial DBMS. Furthermore, commercial
DBMS usually include additional options in their versions of the SQL create table
statement. Every database developer should read the manual for the DBMS being
used to become reasonably familiar with all the options of the create table statement
that are available. This goes for the other statements, too.
Sometimes, it is necessary to remove a table entirely from a database. Tables are re-
moved using the SQL DDL drop table statement. Removing a table has two effects:
1 all rows in the table instance are removed from the database.
2 all meta-data relating to the table definition are removed from the data dictionary.
Code listing 5.4.5 shows how to remove the Room table. The RESTRICT keyword
means that if the Room table is referred to by a foreign key or check constraint on
another table, then the DROP statement will fail. The DBMS will instead display an
error message. This is the default behaviour, so the RESTRICT keyword is optional.
Code listing 5.4.5 Two equivalent SQL DDL statements to remove a table
(a) Explicitly using the RESTRICT option
DROP TABLE Room RESTRICT;
Code listing 5.4.6 Removing a table and all tables that refer to it
DROP TABLE Room CASCADE;
Sometimes, whole SQL schemas need to be removed. Code listing 5.4.7 shows how to
remove an SQL schema with the SQL DDL drop schema statement. As with the drop
table statement, the keyword RESTRICT instructs the DBMS to remove the SQL
schema only if there are no tables (or other database objects) in it.
213
IDWA_C05.QXD 10/5/07 11:17 AM Page 214
Code listing 5.4.8 Removing an SQL schema and all related database objects
DROP SCHEMA WebTimetable CASCADE;
The SQL create table statements in this book do not include foreign key or enterprise
constraints. As discussed in Chapter 3, the columns {building, room} are a foreign
key on the Equipment table that refers to the Room table. To define the associated
foreign key constraint when the Equipment table is created, the Room table must
already exist. More generally, to define foreign key constraints as part of the create
table statement, the referenced tables must be created before the referencing tables.
With a large database, lots of tables and many foreign keys, working out which
tables to create first becomes a chore. It is much better if the initial create table state-
ments only include the not-null and primary key constraints. Other constraints are
added using the SQL DDL alter table statement once all the tables have been created.
The ISO SQL:2003 standard only allows one modification to a table schema in an
alter table statement. Some DBMS allow you to make several modifications in a single
alter table statement (the Oracle database and MySQL DBMS both allow this). All the
examples in this book stick to the ISO SQL:2003 standard.
The format of the alter table statement changes slightly depending on what modifi-
cations are being made to the table. Figure 5.20 shows the format of the alter table
statement used to add a foreign key constraint to a table.
The statement begins with the SQL keywords ALTER TABLE, followed by the name
of the table being modified. The keywords ADD CONSTRAINT indicate that a new
table constraint is being added. They are followed by the constraint name.
The next two lines in Figure 5.20 form the constraint definition. The keywords
FOREIGN KEY indicate that a foreign key constraint is being added rather than any
other table constraint. Next, comes a list of the foreign key columns. Of course, if there
is only a single column in the foreign key, then there will be only a single column in
this list.
The keyword REFERENCES is followed by the name of the referenced table and a
list of the matching candidate key columns. The order of the columns in these lists is
significant. The first foreign key column is matched with the first matching candidate
214
IDWA_C05.QXD 10/5/07 11:17 AM Page 215
Figure 5.20 Adding a foreign key constraint using the SQL DDL alter table
statement.
key column and so on. Get the order wrong, and the foreign key constraint will prob-
ably not work at all or, certainly, not as expected.
The constraint definition ends by specifying the referential actions. ON UPDATE tells
the DBMS what to do with rows in the referencing table when the matching row in the
referenced table is modified. Similarly, ON DELETE says what to do when a row of
the referenced table is removed. In both cases, the referential action (indicated by
updateAction and deleteAction in Figure 5.20), must be one of NO ACTION,
CASCADE, SET NULL, SET DEFAULT and RESTRICT.
The first four were discussed in Chapter 3, but RESTRICT is new. RESTRICT is
very similar to NO ACTION and the difference is in how the DBMS behaves while it is
running an SQL DML statement. Every modification to the database instance – for ex-
ample, to delete some rows from a table or modify column values on some rows of a
table – goes through three stages.
1 The first stage is the set-up stage. During set-up, the DBMS identifies the target
rows – those that will be modified directly by the SQL statement. Modifying these
rows may cause indirect modifications to other rows in the form of referential
actions and these matching rows are also identified.
2 Next comes the execution stage, during which the DBMS modifies the target rows,
but does not change the matching rows.
3 The final stage is clean-up. During clean-up, the matching rows are modified, based
on the relevant referential action.
The referential action NO ACTION says that the foreign key constraint should only be
checked at the end of the clean-up stage and, if it is violated at that stage, then all the
modifications must be undone. However, it is possible for the foreign key constraint to
be violated during the execution stage; the constraint is only checked right at the end
of the clean-up stage. In contrast, the RESTRICT referential action insists that the
foreign key constraint is never violated, not even temporarily, during the execution
stage. Table 5.1 summarizes the referential actions supported by SQL:2003 and explains
their effects when a row in the referenced table is modified or removed.
215
IDWA_C05.QXD 10/5/07 11:17 AM Page 216
Code listing 5.5.1 shows how to add the foreign key constraint to the Equipment
table. The SQL identifier frnEquipmentRoom names the constraint and the name
indicates that the constraint is a foreign key on the Equipment table, which refer-
ences the Room table. Notice that the building column of the Equipment table is
matched with the building column of the Room table, while the room column of
the Equipment table is matched with the roomNo column of the Room table. Note
that NO ACTION is the default referential action, so the last line in Code listing 5.5.1
could be omitted.
216
IDWA_C05.QXD 10/5/07 11:17 AM Page 217
Figure 5.21 Adding a check constraint using the SQL DDL alter table
statement.
condition is the same as the Boolean expression in the where clause in an SQL DML
statement. This allows a wide range of business rules to be enforced, as the following
examples indicate. However, SQL does not support transition constraints, so all the
examples are of state constraints.
The first example uses a check constraint to enforce the business rule:
There are two obvious ways to do this, illustrated by the two alter table statements
in Code listing 5.5.2. The first alter table statement creates a constraint called
chkBuildingName1; – using the prefix chk to begin the names of check constraints
is simply a coding convention. The check condition on this alter table statement uses
three equality comparisons joined by OR to express the business rule. It is true – and,
hence, the constraint is satisfied – if the value of the building column is one of the
three string literals about ‘Wilson’, ‘Priestley’ and ‘Locke’.
The second alter table statement uses the SQL IN operator to form its check condition.
The Boolean expression building IN ('Wilson', 'Priestley', 'Locke')
asks the question. ‘Is it true that the value of building is one of the values in the list
('Wilson', 'Priestley', 'Locke')?’
Clearly, both these constraints enforce the business rule. Which to use is a matter of
personal preference, though some DBMS take longer to evaluate the IN operator.
Code listing 5.5.2 Enforcing the business rule ‘The building name must be
“Wilson”, “Priestley” or “Locke”’
ALTER TABLE Room
ADD CONSTRAINT chkBuildingName1
CHECK (building='Wilson'
OR building='Priestley'
OR building='Locke');
ALTER TABLE Room
ADD CONSTRAINT chkBuildingName2
CHECK (building IN ('Wilson', 'Priestley', 'Locke'));
It’s important to note that both these check constraints would allow a row with the
building column set to null. When a row is added, or modified, the DBMS deter-
mines the truth of the statement ‘The value supplied for the building column is
equal to one of the allowed values.’ The constraint is only violated if this statement is
FALSE, in which case the DBMS rejects the change to the table instance. When the
column would be null, the truth value of the statement is UNKNOWN as any compari-
son involving nulls evaluates to UNKNOWN (except those using the special IS NULL
217
IDWA_C05.QXD 10/5/07 11:17 AM Page 218
operator). Thus, the constraint is not violated and the table instance would be
changed, except for the fact that there is also a not null constraint on the building
column. It is this constraint that stops the building column from being null.
The MySQL DBMS does not enforce check constraints, although it does recognize
the syntax. Instead, it includes an enum data type. Like a check constraint, this data
type allows nulls, unless there is a not null constraint on the column. See MySQL AB
(2007) for details.
the value of the capacity column must lie in the range 10 to 250.
As discussed above, range comparisons can be written using the SQL BETWEEN opera-
tor or the <= and >= comparison operators. Both versions are shown in Code listing
5.5.3. The check condition (capacity BETWEEN 10 AND 250) is self-explanatory.
Note that (capacity BETWEEN 250 AND 10) is a valid Boolean expression, but
never true as it states that the capacity must be greater than 250 and less than 10,
which is impossible. Again, both forms of this check constraint allow capacity to be
null and, as there is no not-null constraint, the capacity of a particular room may not
be recorded in the database.
Code listing 5.5.3 Enforcing the business rule ‘The value of the capacity
column must lie in the range 10 to 250.’
ALTER TABLE Room
ADD CONSTRAINT chkRoomCapacity1
CHECK (capacity BETWEEN 10 AND 250);
ALTER TABLE Room
ADD CONSTRAINT chkRoomCapacity2
CHECK (capacity <= 250 AND capacity >= 10);
SQL also includes the unique constraint (Figure 5.22). This is used mostly to enforce
candidate key constraints other than the primary key constraint. The keyword UNIQUE
is followed by a list of columns. Like a primary key constraint, it is the combination of
column values that must be unique. At the Pennine University, the business rule:
every member of staff has an e-mail account with a unique e-mail address
can be enforced by a unique constraint. Code listing 5.5.4 adds this unique constraint
to the email column of the Staff table.
Figure 5.22 Adding a unique constraint using the SQL DDL alter
table statement.
218
IDWA_C05.QXD 10/5/07 11:17 AM Page 219
Figure 5.23 Using the SQL DDL alter table statement to remove a constraint.
Some DBMS allow a constraint to be disabled and then enabled. This is less drastic
than removing and recreating it, so is a useful feature if there is lots of legacy data to
import.
Unlike removing a table or an SQL schema, removing a constraint never has knock-on
effects on other tables. Even removing a foreign key constraint will not affect the ref-
erenced table. Even on the referencing table there is no immediate effect on the table
instance – the data in the table instance conforms to the constraint, whether it is
enforced by the DBMS or not. For this reason, there is no need to include the CASCADE
and RESTRICT keywords.
The other use of the alter table statement is to add, modify or remove columns.
Adding a new column (Figure 5.24) is much like adding a constraint – the keywords
ADD COLUMN are followed by a column definition, in the same format as was the case
for a create table statement. Also as in the create table statement, only the column
219
IDWA_C05.QXD 10/5/07 11:17 AM Page 220
Figure 5.24 Adding a column to a table using the SQL DDL alter table
statement.
Column definition
name and data type are required. Code Listing 5.5.6 adds a new column to the Room
table. Notice that the default value is zero. In this situation, zero indicates that the
floorspace is not known, but, usually, null is used for this purpose. Using a valid value
to indicate missing information is not usually a good idea as it can lead to ambiguity.
However, in some circumstances, such as this one, there is no possibility of confusion –
the floorspace in a room is always greater than zero.
Even a column that was originally defined without a default clause can be given one
using the SET DEFAULT keywords in the alter table statement. Code listing 5.5.7
shows how to modify, and remove, the default value on the Room table’s floorSpace
column.
Figure 5.25 Using the SQL DDL alter table statement to modify a column.
220
IDWA_C05.QXD 10/5/07 11:17 AM Page 221
Figure 5.26 Removing a column using the SQL DDL alter table statement.
The final modification supported by the alter table statement is that of removing an
entire column. Removing a column, like removing a table, can have knock-on effects.
For example, if the column is part of a primary key, then removing the column means
that the primary key constraint must also be removed as the remaining columns can-
not form a candidate key for the table (remember: all candidate keys are irreducible).
For this reason, the CASCADE and RESTRICT keywords are part of the DROP COLUMN
clause, as shown in Figure 5.26. Code Listing 5.5.8 shows how to remove a column.
The first statement will fail if there are knock-on effects, as RESTRICT is the default.
The second will always succeed, but may have knock-on effects.
221
IDWA_C05.QXD 10/5/07 11:17 AM Page 222
SQL is still the best basis for learning the language as most DBMS support most of the
features discussed above.
Chapter summary
■ This chapter has covered the basic features of the SQL data language. The SQL DML
select, insert, update and delete statements were introduced, with examples. The
where clause was discussed separately, including the problems caused by nulls and
three-valued logic. The SQL DDL create schema, create table, alter and drop state-
ments were also introduced.
■ Although the ISO SQL standard forms a good basis for learning SQL, it’s important
to always check the manual for a particular DBMS. Few DBMS implement the whole
ISO SQL standard, and all include extensions and additional features that may
prove useful. In particular, support for the Boolean data type is very patchy. (This is
odd as it is the only data type actually required by the relational data model.)
Further reading
The main reference work on the ISO SQL standard is Melton and Simon (2002), although
this covers the earlier SQL:1999 standard.
The actual ISO SQL standard can be purchased from the ISO website (www.iso.org),
though it is unlikely to be much use to beginners.
Date (2004) covers SQL, but is not very enthusiastic about the language.
Websites include the always excellent W3Schools (www.w3schools.com) and the sites
of the major DBMS vendors.
A copy of the BNF grammar for the ISO SQL:2003 standard is available online from
Savage (2005).
Review questions
5.1 Explain the following terms used in the definition of the SQL data language:
(a) SQL schema
(b) catalog
(c) information schema
(d) SQL DDL
(e) SQL DML.
222
IDWA_C05.QXD 10/5/07 11:17 AM Page 223
Exercises
5.2 Name the four main statements of the SQL DML and explain their purpose.
5.3 Which parts of the select statement must always be included? Why?
5.4 Explain the purpose of the order by clause of the select statement.
5.5 In the context of the select list, what is an alias? Why might you use one?
5.6 Name three of the SQL DDL statements discussed in this chapter and explain their
purpose.
5.7 List the four main characteristics of a column definition in a create table statement
and explain them.
5.8 Explain the purpose of the RESTRICT and CASCADE keywords in the drop statement.
5.9 Explain the purpose of the SQL check constraint and unique constraint.
5.10 Summarize the functionality of the SQL alter table statement discussed in this
chapter.
Exercises
These exercises use the Equipment and Room tables to provide practice in writing SQL
statements. Sample instances of these are shown in Figure 5.19.
5.11 The following questions provide practice in using the most basic form of the select
statement, where data from all the rows in the table are included in the result set. All
solutions have the same basic structure – an instruction to the DBMS to ‘SELECT the
data in these columns FROM this table’. There are no where clauses needed in any
of the select statements for this question.
(a) Write an SQL statement to display the data in the assetNo and assetType
columns from all rows of the Equipment table. Do not display data from the
other columns.
(b) Write an SQL statement to display the data in the assetNo, assetType and
description columns from all rows of the Equipment table. Do not display
data from the other columns.
(c) Write two different SQL statements to display the data in the building, roomNo
and capacity columns from all rows of the Room table. One statement should
use column names and one not.
(d) Display the data in the building and room columns from all rows of
the Equipment table. Ensure that if two rows have the same data in the build-
ing and room columns, then that data is shown only once.
(e) Display the data in the building and roomNo columns from all rows of the Room
table. Again, ensure that if two rows have the same data in the building and
roomNo columns, then that data is shown only once.
223
IDWA_C05.QXD 10/5/07 11:17 AM Page 224
5.12 The following questions provide practice in writing select statements that have a
where clause. The where clause specifies which rows of the table should be includ-
ed in the result set. All solutions will have the same basic structure – an instruction to
the DBMS to ‘SELECT the data in these columns FROM this table WHERE the data
on the row satisfies this Boolean expression’.
(a) Display the data in the assetNo, assetType and description columns for
those rows of the Equipment table where the value of the building column is
Wilson.
(b) Display the data in the assetNo, assetType and description columns for
those rows of the Equipment table where the value of the building column is
Wilson and the value of the room column is 205.
(c) Display the data in the assetNo, assetType and description columns for
those rows from the Equipment table where the value of the building column
is Wilson or the value of the building column is Locke.
(Note: Usually the above instructions would be phrased like this: ‘List the asset
number, type and description for equipment in the Wilson and Locke buildings.’
Note that, although the informal description uses ‘and’ logically this is an ‘or’ as
one piece of equipment cannot be in both buildings at the same time, so the
query requires a list of equipment that is either in the Wilson building or in the
Locke building. This is a style of writing database queries that you will need to get
used to. Both forms are included for the rest of this question, then only the less
formal version.)
(d) List the asset number, type and description for all equipment where the date
it was acquired is unknown. Phrasing this more formally, display the data in
the assetNo, assetType and description columns for those rows from the
Equipment table where the acquired column is NULL.
(e) List the asset number, description and cost of all equipment costing £500 or
more. List the equipment alphabetically by description. Phrasing this more for-
mally, display the data in the assetNo, description and cost columns for
those rows from the Equipment table where the cost is greater than or equal to
£500. Order the results alphabetically by description.
(f) List the asset number, description and cost of all equipment costing between
£500 and £1500, inclusive. Phrasing this more formally, display the data in the
assetNo, description and cost columns for those rows from the Equipment
table where the cost is greater than or equal to £500 and less than or equal
to £1500.
5.13 The following questions provide practice in using expressions and column aliases in
the select list and the LIKE comparison operator in the where clause.
(a) Display a list of equipment showing the asset number, type, description and loca-
tion for each piece of equipment. The location is defined as being the combina-
tion of the building and room number, such as ‘Wilson 205’. Ensure that the
columns of the result set have suitable names.
(b) The equipment budget for 2005 was £10,000. Display a list of equipment pur-
chased in 2005 showing the asset number, type, description and the percentage
of the budget spent on each item. Ensure that the columns of the result set have
suitable names.
(c) List the asset number, type and description for all equipment where the descrip-
tion begins with the string ‘HP’. (This should include all equipment manufactured
by Hewlett Packard.)
224
IDWA_C05.QXD 10/5/07 11:17 AM Page 225
Exercises
(d) List the asset number, type and description for all equipment where the descrip-
tion includes the string ‘HP’.
(e) List the asset number, type and description for all equipment where the asset
type ends with the string ‘HP’.
(f) List the asset number, type and description for all equipment where the asset
type includes the string ‘projector’ and the description begins with the string ‘HP’.
(g) List the asset number, type and description for all equipment where the asset
type includes the string ‘projector’ or the description begins with the string ‘HP’.
5.14 This question gives you some practice in modifying the table instance. In each case,
write an appropriate SQL DML statement to complete the task.
(a) Add the following row of data to the Room table:
Leave the remaining columns null. Why doesn’t the foreign key constraint prevent
this row, which does not match any row on the Room table, being added?
(c) Now set the building and room for the piece of equipment added in part (b) to
Priestley 114.
(d) Increase the capacity of every room by 5.
(e) The Priestley building has burnt down, destroying all the equipment stored there.
The insurance claim has been paid and a new building is under construction. Re-
move all data associated with this building from the Web Timetable database.
5.15 This question provides practice in writing SQL create table and alter table state-
ments. For each table schema, write a suitable create table statement to implement
the table schema shown below. This statement should implement all the column de-
finitions and the primary key constraint. Do not include constraint definitions to im-
plement candidate keys, foreign keys or business rules in the create table statements.
Add each of these constraints using an alter table statement.
( Note: There is no foreign key linking the Course and Module tables as each mod-
ule may be used by many courses and each course delivers many modules. Chap-
ters 8 and 9 explain how to deal with such ‘many-to-many’ relationships between
tables.)
225
IDWA_C05.QXD 10/5/07 11:17 AM Page 226
Investigations
5.17 Investigate the differences between the ISO SQL standard and the version of the
SQL language implemented by the DBMS you use. You may choose to investigate
differences in syntax or look at the different conformance packages of the ISO SQL
standard itself and the level of conformance your DBMS claims. How do other DBMS
compare to yours in terms of compliance with the standard? (Melton and Simon
(2002) is a useful source on the ISO SQL standard.)
226
IDWA_C05.QXD 10/5/07 11:17 AM Page 227
Investigations
5.18 What tools exist to help write, and debug, SQL statements? To what extent is the
data language query by example (QBE) used as a basis for such tools? You should
look into query builders and administration tools. Critically evaluate the usefulness of
such tools for a novice web database developer.
5.19 The ISO SQL specification states that every SQL catalog must contain at least one
SQL schema – the INFORMATION_SCHEMA. This is the data dictionary for all the SQL
schema in the catalog. Investigate the INFORMATION_SCHEMA and the extent to
which your DBMS implements the approach to managing meta-data taken in the ISO
SQL standard. (Melton and Simon (2002) is a useful source for the ISO SQL stan-
dard.)
5.20 SQL is the most widely used data language for the relational data model, but the for-
mal data languages defined for the relational data model are the relational algebra
and the relational calculus. Date (2004) has a chapter on each of these. Investigate
these languages, comparing them to SQL.
227
IDWA_C06.QXD 10/5/07 11:22 AM Page 228
Chapter objectives
Chapter outline
This chapter continues the discussion of the SQL data language. It introduces some
more advanced SQL features for manipulating both the database instance and database
schema. Until recently, not all commercial DBMS implemented all the features dis-
cussed here. In particular, the MySQL DBMS only implemented subqueries in Ver-
sion 4.1 and views in Version 5.0. Prior to these releases, web database developers
using the MySQL DBMS had to use alternative approaches to achieve the same effect.
Often this meant coding a table join instead of a subquery or using procedural code
(such as PHP) to manipulate result sets rather than using a view. Now that the MySQL
DBMS supports these features it makes sense to use them, but be aware that older code
will still use the alternative approaches. It may simply not be feasible to recode all the
existing web pages to make use of these recently supported features.
■ Section 6.1 discusses join queries. A join query allows data to be retrieved from
two or more tables at once, so join queries are quite common in database
applications.
228
IDWA_C06.QXD 10/5/07 11:22 AM Page 229
■ Section 6.2 discusses some of the built-in SQL value functions. There are useful SQL
value functions to manipulate string values and convert string literals to dates and
back. It is often better to use these than pull the raw data back to the PHP script
and manipulate the values there.
■ Section 6.3 looks at the SQL aggregate functions, which allow queries to retrieve
summary information from tables.
■ Section 6.4 explores subqueries, which are now supported by all the major com-
mercial DBMS. A subquery allows data from one table to be manipulated in ways
determined by data values from another. Subqueries can be used in both data ma-
nipulation and data definition, and examples of both uses are explained.
■ Finally, having introduced a range of more complex queries, Section 6.5 shows
how to hide this complexity using SQL views.
Relational databases organize data into tables. Typically, there will be relationships
between the tables in a database. For example, the Staff and SupportSession
tables of the Staff Directory database are related as each support session is offered by a
particular member of staff (Figure 6.1). This means that the data only really make
sense when they are joined together, to show both the data held about the member of
staff and the data held about their support sessions. It is very easy to retrieve data from
two tables using SQL – simply name both the tables in the from clause. Code listing
6.1.1 shows how. Note that the select list is rather long, so has been split over three
lines (SQL allows line breaks in an SQL statement). A select statement that retrieves
data from two (or more) tables is called a join.
Code listing 6.1.1 Retrieving data from two tables – a cross join
SELECT Staff.staffNo, fName, lName, phone,
email, SupportSession.staffNo,
dayOfWeek, startTime, endTime
FROM Staff, SupportSession;
Code listing 6.1.1 illustrates the SQL qualified identifier – any column name can be
prefixed with the name of the table that it belongs to. The format is
tableName.columnName,so, for example, Staff.staffNo is the staffNo
column of the Staff table, whereas SupportSession.staffNo is the staffNo
column of the SupportSession table. Without the qualified identifier, there would
be no way for the DBMS to distinguish between these two different columns in the
select list and it would report an error. The qualified identifier syntax can be extended
to include the database, too:
databaseName.tableName.columnName
For example, StaffDirectory.Staff.staffNo. This allows SQL queries to
gather data from tables in different databases (as long as the two databases are man-
aged by the same DBMS). Notice that the column names in the result set are based
solely on the underlying column name, not on the qualified identifier (Figure 6.2).
229
IDWA_C06.QXD 10/5/07 11:22 AM Page 230
Staff
staffNo:Varchar fName:Varchar IName:Varchar email:Varchar phone:Varchar photo:BLOB
10780 John Smith [email protected] NULL [BLOB]
14443 Helen Abbot [email protected] 8032 [BLOB]
23257 Freya Stark [email protected] 8660 [BLOB]
25447 John Smith [email protected] 5104 [BLOB]
25448 Judith Anne Smith [email protected] 7709 [BLOB]
31210 Paul Smith [email protected] NULL [BLOB]
33509 Helen Timms [email protected] 8661 [BLOB]
33935 Padma Brar [email protected] 6641 [BLOB]
35054 Selma Hutchins [email protected] 8706 [BLOB]
35155 Helene Chirac [email protected] NULL [BLOB]
45965 Mikhail Sudbin [email protected] 5553 [BLOB]
55776 Gurpreet Choudhury [email protected] 5454 [BLOB]
56673 Joshua Bittaye [email protected] 7782 [BLOB]
56893 Ruth Bapetsi [email protected] 8022 [BLOB]
77712 Frank Rose [email protected] 8871 [BLOB]
Matching
rows 78893 Jo Karen O’Connor [email protected] 8871 [BLOB]
89987 Dan Lin [email protected] 8514 [BLOB]
SupportSession
staffNo:Varchar dayOfWeek:Varchar startTime:Time endTime:Time
56673 Friday 15:30:00 16:30:00
56673 Thursday 15:30:00 16:30:00
35054 Monday 09:00:00 10:00:00
45965 Wednesday 11:00:00 12:00:00
31210 Wednesday 11:00:00 13:00:00
35054 Monday 15:00:00 16:00:00
56893 Thursday 09:00:00 10:00:00
45965 Monday 11:00:00 12:00:00
23257 Monday 15:00:00 16:00:00
55776 Monday 14:00:00 16:00:00
56893 Tuesday 14:00:00 15:00:00
56673 Friday 10:00:00 11:00:00
35054 Tuesday 11:00:00 12:00:00
230
IDWA_C06.QXD 10/5/07 11:22 AM Page 231
Figure 6.2 The first few rows of the result set for Code listing 6.1.1.
231
IDWA_C06.QXD 10/5/07 11:22 AM Page 232
The select statement in Code listing 6.1.1 does not do what you might expect. The
first few rows of the result set are shown in Figure 6.2. There are actually 204 rows in
the result set, which is why the final row in Figure 6.2 is shown cut off.
Look carefully at the first 17 rows in Figure 6.2 and compare them to the 17 rows
in the Staff table and the first row in the SupportSession table, shown in
Figure 6.1. It is clear that the DBMS has taken each row of the Staff table in turn and
joined on to this data the data from the first row of the SupportSession table.
Then, it has taken each row of the Staff table in turn and joined on to this data the
data from the second row of the SupportSession table. Just to be clear, the select
statement in code listing 6.1.1 joins each row of the SupportSession table to every
row of the Staff table. This is called a cross join.
In itself, the cross join is not very useful. In fact, the result set in Figure 6.2 is mis-
leading as it suggests, for example, that every member of staff is available for the sup-
port session offered by the staff member with staff number 35054 on a Monday
between 9 and 10 o’clock. In fact, the staff member with staff number 35054 is Selma
Hutchins and only Selma Hutchins offers this support session. The cross join, is, how-
ever, the basis for a more useful approach to joining data from related tables.
The obvious way to produce meaningful information out of the result set in Figure 6.2
is to instruct the DBMS to only include those rows of the cross join where the values in
the Staff.staffNo and SupportSession.staffNo columns are the same. Of
the first 17 rows, this is only true for the ninth row – the row with Selma Hutchins’ data.
To see how to do this, it helps to understand how the DBMS processes a select statement.
Chapter 5 explained that the order of the different parts of the select statement is fixed:
1 select list
2 from clause
3 where clause
4 order by clause.
However, the DBMS does not process the instructions in this order. The very first thing
the DBMS does with a select statement is process the from clause. If there is a single
table, this simply means that it retrieves all the rows from that table. When there are
two (or more) tables, the DBMS creates the cross join. Only after the cross join has been
produced does the DBMS look at the where clause. It uses the instructions in the where
clause to decide which rows of the cross join will actually go into the final result set. It
then uses the select list to decide which columns to include and the order by clause to
sort the rows of the result set as required. To sum up, the order of processing is:
1 from clause
2 where clause
3 select list
4 order by clause
232
IDWA_C06.QXD 10/5/07 11:22 AM Page 233
As the DBMS creates the cross join before processing the where clause, the where clause
can be used to extract meaningful data from the cross join. Typically, this means match-
ing the foreign key columns of one table with the primary key columns of the other.
Code listing 6.1.2 produces the result set shown in Figure 6.3. This is the correct
way to join the Staff and SupportSession tables as it clearly identifies who
offers which support session. Notice that those staff who do not offer any support
sessions (Dan Lin and Helene Chirac) do not appear in the result set.
Code listing 6.1.3 Listing the days on which members of staff offer support
sessions
SELECT DISTINCT staffNo, dayOfWeek
FROM SupportSession
ORDER BY staffNo;
SELECT DISTINCT fName || ' ' || lName AS name, dayOfWeek
FROM Staff, SupportSession
WHERE Staff.staffNo = SupportSession.staffNo
ORDER BY lName, fName;
The where clause can also include other conditions, not just the join condition. For
example, the select statement in Code listing 6.1.4 lists details of all support sessions
that run on a Monday morning and the staff members who offer them.
Code listing 6.1.4 An equi-join with additional conditions in the where clause
SELECT Staff.staffNo, fName, lName, phone,
email, SupportSession.staffNo,
dayOfWeek, startTime, endTime
FROM Staff, SupportSession
WHERE Staff.staffNo = SupportSession.staffNo
AND dayOfWeek = 'Monday'
AND startTime < '12:00';
The foreign key on the SupportSession table has only one column, so there is
only one join condition needed to identify the correct rows from the cross join of the
SupportSession and Room tables. The foreign key on the Equipment table has
two columns – {building, room}. These match the primary key columns of the
Room table – {building, roomNo} (see Figure 6.4).
Room
building:varchar roomNo:varchar capacity:numeric
Wilson 205 25
Wilson 113 25
Wilson 205 100
Priestley G12 150
Priestley 113A 20
Locke 24 30
Matching Locke 27 25
rows Locke 14 30
Equipment
assetNo:varchar assetType:varchar description:varchar building:varchar room:varchar portable:boolean cost:numeric acquired:date
1570131 Data projector HP vp6100 Wilson 205 FALSE 1300 00 2004-07-14
1799131 Data projector HP vp6100 NULL NULL TRUE 1300.00 2005-01-12
4560293 OHP 3M OHP 1608 Locke 24 FALSE NULL NULL
4503993 PC Dell OptiPlex™ GX280 Wilson 205 FALSE 625.00 2004-04-30
5010009 OHP 3M OHP 2000 NULL NULL TRUE 185.00 2001-08-20
2992220 Data projector ToshibaTDP-SW20 Priestley G12 FALSE 1100. 00 2005-01-12
2892112 Laptop SONY X505 NULL NULL TRUE NULL NULL
234
IDWA_C06.QXD 10/5/07 11:22 AM Page 235
To identify the correct rows for the equi-join of the Room and Equipment tables,
the values in the two building columns and the values in the room and roomNo
columns must match. This requires two join conditions. Code listing 6.1.5 shows
how to write the equi-join to join these two tables. The same principle applies
when a foreign key has three or more columns – the where clause has one join con-
dition for each column in the foreign key. Notice that the room and roomNo
columns are not qualified by their table name. This is because there is only one room
column and only one roomNo column in the Room and Equipment tables. As
there is no possible ambiguity about which column is meant, there is no need to use
qualified identifiers.
Code listing 6.1.5 An equi-join for tables with composite foreign and
primary keys
SELECT Room.building, roomNo, capacity, assetType,
description
FROM Equipment, Room
WHERE Equipment.building = Room.building
AND room = roomNo;
For those who find it a chore to type qualified identifiers in full, SQL offers a short-
cut. Any table in the from clause can be given an alias – called a correlation name –
and that name used in the qualified identifier instead of the full table name. Code
listing 6.1.6 illustrates table aliasing using correlation names, with the Equipment
table being given the correlation name e, and the Room table the correlation name
r. Otherwise, this select statement is the same as to that in Code listing 6.1.5. With a
short correlation name for each table, it is feasible to use the qualified identifier for
every column in the select list, which makes it absolutely clear which table each col-
umn belongs to.
235
IDWA_C06.QXD 10/5/07 11:22 AM Page 236
columns). The DBMS automatically identifies the columns with matching names and
uses them to generate a join condition. The third select statement is a condition join,
the syntax of which is actually more complicated that the original equi-join, so con-
dition joins aren’t used in this book. The final select statement is called a column
name join. The column name join is similar to the natural join, except that the DBMS
does not have to identify the columns that have the same name. Instead, it is told
which columns appear in both tables as the shared column names are listed (in
parentheses) after the keyword USING. One useful feature of the column name join
is that there is no need to use qualified identifiers for the columns listed after the
keyword USING.
provide a list of all rooms and the equipment stored in those rooms.
The inner join in Code listing 6.1.5 produces the result set shown in Figure 6.5. This
only lists those rooms that have equipment in them, but this is not what was
requested. The business query requests a list of all rooms, together with any equip-
ment stored in those rooms. Even if a room has no equipment in it, the room should
still appear in the list. This might be needed to decide which rooms need a new digi-
tal projector, for example, or to distribute equipment more evenly among teaching
rooms.
236
IDWA_C06.QXD 10/5/07 11:22 AM Page 237
The solution is to use an outer join. An outer join is one in which data for un-
matched rows from one, or both, tables is included in the result set. For such rows, the
column values from the other table are set to null in the result set. This is exactly what
is required – unmatched rows from the Room table should be included, but the
assetType and description columns for these rows of the result set should be
null as there is no equipment in those rooms (Figure 6.6).
Code listing 6.1.8 shows the ISO standard SQL syntax for the outer join statement.
Consider the first of the two select statements. The from clause includes the new SQL
keywords RIGHT OUTER JOIN as well as the two table names, with their correlation
name aliases. The keywords OUTER JOIN indicate that unmatched rows from one or
more of the tables should be included in the result set. The keyword RIGHT indicates
that it is the unmatched rows of the table to the right of the OUTER JOIN keywords
that need to be included – the Room table in Code listing 6.1.8. Unmatched rows in
the table to the left of the OUTER JOIN keywords (the Equipment table) will not be
included. The on clause plays the role of the join condition as it indicates how to join
rows that do match. Outer joins can also include a where clause, but it should not
repeat the join condition.
Code listing 6.1.8 Two equivalent outer joins, listing all rooms and any
equipment in those rooms
SELECT r.building, roomNo, capacity, assetType,
description
FROM Equipment e RIGHT OUTER JOIN Room r
ON (e.building = r.building) AND (room = roomNo)
ORDER BY r.building, roomNo;
SELECT r.building, roomNo, capacity, assetType,
description
FROM Room r LEFT OUTER JOIN Equipment e
ON (e.building = r.building) AND (room = roomNo)
ORDER BY r.building, roomNo;
237
IDWA_C06.QXD 10/5/07 11:22 AM Page 238
Figure 6.6 shows the result set for this code listing. Notice how rooms that have no
equipment in them are still included in the list, with nulls in the assetType and
description columns. Comparing this result set to the data in the base tables
(Figure 6.4), it is clear that all rows in the Room table have been included, but
only those rows in the Equipment table that match one of the rows in the Room table
have been added (that is, none of the portable equipment is shown in Figure 6.6).
The order by clause is optional and has been included here to present the informa-
tion in a sensible order. The second of the two statements is equivalent to the first
because, in this statement, the Room table is to the left of the keywords OUTER
JOIN. The two alternative SQL outer join statements, left and right outer joins, are
provided for ease of use, rather than for any difference they would make to their
behaviour.
With outer joins, it is important to be careful which of the matching primary and
foreign key columns are included in the select list. Consider the outer join in Code list-
ing 6.1.9. This includes the building column from the Equipment table rather
than from the Room table. Although this seems a minor change (both columns hold
the same data) it changes the result set significantly (Figure 6.7). The problem this
causes occurs in those rows of the result set that hold data from unmatched rows in
the Room table, such as the first row. Data from the Room table’s roomNo and
capacity columns is included, but these are the only columns from the Room table
– the building column, like the assetType and description columns, belongs
to the Equipment table, so is set to null. This leaves the user puzzling over which
building room number 14, with a capacity of 30, is in, making the list in
Figure 6.7 useless.
Code listing 6.1.9 An outer join that does not retrieve the data required
SELECT e.building, roomNo, capacity, assetType, description
FROM Equipment e RIGHT OUTER JOIN Room r
ON (e.building = r.building) AND (room = roomNo)
ORDER BY r.building, roomNo;
238
IDWA_C06.QXD 10/5/07 11:22 AM Page 239
Figure 6.7 Result set for Code listing 6.1.9 – getting the select list wrong.
It is also possible to write an outer join statement that includes unmatched rows from
both tables, which is called a full outer join. A business query that requires a full outer
join is:
provide a list of all equipment, where it is located and which rooms currently
have no equipment in them.
Clearly this list needs to join data from the Room and Equipment tables to show
which room the fixed equipment is located in. It must also include unmatched rows
from both tables because the query specifically requests information on all equipment
and rooms that have no equipment in them.
The full outer join for this query is shown in Code listing 6.1.10 and the result set in
Figure 6.8. Notice that the assetNo column of the Equiment table is included in
this statement. This makes it possible to distinguish between identical items of equip-
ment (such as the two HP vp6100 data projectors).
239
IDWA_C06.QXD 10/5/07 11:22 AM Page 240
by the MySQL DBMS, the order by clause regards null as ‘less than’ any data value –
hence, the result set shown in Figure 6.8. This behaviour does not affect the treatment
of null in a comparison expression as null < capacity always evaluates to UN-
KNOWN.
Unlike the inner join, an outer join is not a subset of the cross join. The cross join
matches each row of one table with every row of the other, so there are no un-
matched rows in a cross join.
Support for the ISO SQL outer join syntax is a little patchy. Several DBMS had imple-
mented outer joins before they were included in the ISO SQL:1992 standard. They all
used different, usually incompatible, syntax, so the ISO SQL standard adopted the
new syntax described above. Since then, some DBMS have implemented support for
the ISO SQL syntax, but some have not. Even those that support the ISO SQL syntax
retain their old, non-standard syntax. The MySQL DBMS supports the ISO SQL syn-
tax for left and right outer joins, but not the full outer join.
Occasionally, data in one row of a database table is related to data in another row of the
same table by a foreign key relationship. An example occurs in the Module table of the
Web Timetable database. Each module may (or may not) have a prerequisite – that is, a
module at a lower level that must be taken first. For example, before students can study
the second year module ‘Application development’, they must have passed the first
year module ‘Introduction to programming’.
A sample instance of the Module table appears in Figure 6.9, with this foreign
key relationship highlighted. Code listing 6.1.11 shows the create and alter table
240
IDWA_C06.QXD 10/5/07 11:22 AM Page 241
Matching
primary key Foreign key
Module
code:varchar title:varchar level:char preReq:varchar
CCFC0108 Introduction to programming F NULL
CIFC0084 Information systems F NULL
BAFC0178 Introduction to business F NULL
BMFC0107 Introduction to management F NULL
CIIC0053 Application development I CCFC0108
CCIC7009 Computing mathematics I NULL
CCHO0418 Formal specification H NULL
BAIC3427 Business audit I BAFC0178
CCH09668 Advanced programming H CCIC7009
CIHO6008 Soft systems H NULL
statements for this table. The column preReq is the foreign key column – its values
match the value of the primary key column code in some other row. The referential
actions are the obvious ones. These are that, if the module code changes in the refer-
enced row (the prerequisite module), then change the value of the preReq column in
the referencing row and, if the referenced row is deleted, then set the value of the
preReq column in the referencing row to null (remove the prerequisite requirement).
produce a list of modules and their prerequisites and include the module code,
title and level for the module and its prerequisite.
241
IDWA_C06.QXD 10/5/07 11:22 AM Page 242
Code listing 6.1.12 shows two select statements that answer this query in different
ways. The first is an inner join, so only lists modules that have prerequisites, which is
one possible interpretation of the business query. The second is an outer join, so lists
all modules and the prerequisites for those modules that have them, which provides
the fullest possible information in answer to the query.
This illustrates an important point about users’ requirements: they aren’t always
clear. Which query – and, hence, which information – is actually required would need
to be clarified with the end users.
242
IDWA_C06.QXD 10/5/07 11:22 AM Page 243
Both the select statements are self joins, which are select statements that join two
copies of the same table. Because the from clause includes the same table name twice,
both occurrences must be given a correlation name. The correlation names are used to
remove the inevitable ambiguity in the select list.
The copy of Module aliased main provides data for the main module, while that
aliased prereq provides data on prerequisites. As the roles played by the two tables
are different, it is important to get the join condition right – the foreign key column of
main should match the primary key column of prereq, not the other way round.
Finally, the columns in the select list from prereq are also aliased, to indicate
that these column hold data on prerequisite modules. The result sets are shown in Fig-
ures 6.10 (a) and (b).
The select lists of the SQL select statements used so far have included columns and
expressions that use operators to combine column values, such as fName || ' ' ||
lName, or endTime-startTime. The ISO SQL standard defines a number of
value functions. Like the built-in functions of PHP, value functions are simply
predefined functions that take zero or more values (the parameters, or arguments),
each of a given data type, and return a single value. For example, the value fun-
ction CHARACTER_LENGTH(myString) takes a single character string value and
returns a numeric value, which is the number of characters in the string. So,
CHARACTER_LENGTH('Jo Karen O''Connor') returns the value 17 as there are
14 letters, 2 spaces and a single apostrophe character. Remember, in SQL, an apostrophe
character inside a string literal is represented by two single quote marks. The single
quote marks that delimit the string literal are not part of the literal, so don’t count
towards its length.
The ISO SQL standard classifies value functions according to the data type of their
return value, so value functions that return a numeric value are called numeric value
functions, those that return a character string are called string value functions and so
on. The following examples look first at a range of numeric and string value functions
to manipulate character string data, then value functions that help deal with dates and
times. The examples cover the most commonly used value functions.
One common problem when manipulating character string data is the problem of
case sensitivity – is ‘Smith’ the same as ‘smith’? In a web database application, there is
no guarantee that end users will always use the same case when typing in data values.
When the DBMS is case-sensitive, this means that some comparison operations won’t
work as expected – ‘Smith’ is not equal to ‘smith’. To overcome this problem, SQL
includes two string value functions to manipulate the case of a character string – namely,
LOWER(myString) and UPPER(myString). Examples of these two value func-
tions in action are shown in Code listing 6.2.1. Figure 6.11 shows the result set for the
first of the select statement (based on the data in Figure 6.1). Note that the first names
are no longer capitalized. The second select statement in Code listing 6.2.1 shows how
the UPPER() function helps to overcome the problem of case sensitivity, by changing
both the literal entered by the user and the column value to upper case.
243
IDWA_C06.QXD 10/5/07 11:22 AM Page 244
LOWER(fName) UPPER(IName)
john SMITH
john SMITH
judith anne SMITH
paul SMITH
frank ROSE
helen ABBOT
freya STARK
padma BRAR
selma HUTCHINS
mikhail SUDBIN
helene CHIRAC
gurpreet CHOUDHURY
ruth BAPETSI
joshua BITTAYE
dan LIN
jo karen O’CONNOR
helen TIMMS
Code listing 6.2.1 Using the value functions LOWER() and UPPER()
SELECT LOWER(fName), UPPER(lName)
FROM Staff
SELECT staffNo, fName, lName, phone, email
FROM Staff
WHERE UPPER(lName)=UPPER('Smith')
Whether the DBMS is case-sensitive regarding character string data or not depends
partly on the underlying operating system and partly on the settings chosen by the DBA
when a particular database was created. This means that the database developer may
have little choice in the matter.
The MySQL DBMS uses the functions UCASE and LCASE rather than UPPER and
LOWER. This mismatch between the ISO standard and the value functions of a
particular DBMS is typical. Consult the manual for your particular DBMS for informa-
tion on the actual value functions available to you and their particular syntax. The
developers of the MySQL DBMS are working towards full compliance with the ISO
standard, so may eventually support the ISO SQL syntax for value functions.
244
IDWA_C06.QXD 10/15/07 3:23 PM Page 245
Some applications need to manipulate character strings in more complex ways. One
common requirement is to trim leading and/or trailing spaces from a string literal.
Suppose a user entered the string literal ' Smith ' into the surname search web page
discussed in Chapter 4. To the DBMS, ' Smith ' and ‘Smith’ are different character
strings – the spaces are significant. The SQL value function TRIM() can remove lead-
ing and trailing spaces:
● TRIM(LEADING ' ' FROM ' Smith ') will return the string 'Smith '
● TRIM(TRAILING ' ' FROM ' Smith ') will return the string ' Smith'
● TRIM(BOTH ' ' FROM ' Smith ') will return the string 'Smith'.
The TRIM() function can actually trim any character, not just spaces, so, for example,
TRIM(LEADING 'x' FROM 'xxSMITHxx') will return the string 'SMITHxx'.
The PHP in Code listing 6.2.2 shows one way to deal with unwanted spaces at the
start, and end, of a character string passed from an HTML form. Of course, trimming
off unwanted spaces can also be done using PHP (by, for example, adapting the
cleanse_data() user-defined function described in Chapter 4).
Code listing 6.2.2 Using the value function TRIM() in an SQL select
statement passed from a PHP script
<?php
$surname = $_GET["surname"];
$sql = "SELECT fName, lName, email
FROM Staff
WHERE lName = TRIM(BOTH ' ' FROM '$surname')";
$result = mysql_query($sql, $connection);
?>
The select statements in Code listing 6.2.3 and their result sets in Figure 6.12, illustrate
how these three functions work. The first statement uses the CHARACTER_LENGTH()
value function to find the number of characters in the lName column value for each
row of the Staff table. Notice that the apostrophe in ‘O’Connor’ counts as one char-
acter. CHARACTER_LENGTH() always takes a single parameter, which must be a
character string.
The second select statement shows how to use the POSITION() value function.
POSITION() takes two parameters:
POSITION(stringToFind IN stringToSearch)
and finds the starting position of the string of characters given by the parameter
stringToFind within the character string value in the stringToSearch
parameter. In the example, stringToFind is the character string literal ‘ch’ and
stringToSearch is the value of the lName column on each row of the Staff
245
IDWA_C06.QXD 10/5/07 11:22 AM Page 246
table. For most rows, ‘ch’ does not appear in the lName column value, so POSITION()
returns zero (0). For the row where lName has the value ‘Chirac’, POSITION() returns
one (1) as ‘ch’ starts at the very first character of ‘Chirac’ (note that this assumes the
DBMS is case-insensitive).
The third select statement demonstrates the SUBSTRING() function. This func-
tion takes three parameters:
The stringValue parameter is a character string and the other two are integers. The
idea is to take the character string value stringValue and ‘cut out’ numCharacter
characters starting from the start. This example ‘cuts out’ five characters from the
value of the fName column, starting from the fourth character. The result set is shown
in Figure 6.12.
Figure 6.12 The result sets for the select statements in Code listing 6.2.3.
IName CHARACTER_LENGTH(IName) IName POSITION( ‘ch’ IN IName) fName SUBSTRING(fName FROM 4 FOR 5)
Smith 5 Smith 0 John n
Smith 5 Smith 0 John n
Smith 5 Smith 0 Judith Anne ith A
Smith 5 Smith 0 Paul I
Rose 4 Rose 0 Frank nk
Abbot 5 Abbot 0 Helen en
Stark 5 Stark 0 Freya ya
Brar 4 Brar 0 Padma ma
Hutchins 8 Hutchins 4 Selma ma
Sudbin 6 Sudbin 0 Mikhail hail
Chirac 6 Chirac 1 Helene ene
Choudhury 9 Choudhury 1 Gurpreet preet
Bapetsi 7 Bapetsi 0 Ruth h
Bittaye 7 Bittaye 0 Joshua hua
Lin 3 Lin 0 Dan
O'Connor 8 O’Connor 0 Jo Karen Karen
Timms 5 Timms 0 Helen en
246
IDWA_C06.QXD 10/5/07 11:22 AM Page 247
The examples in Code listing 6.2.3 illustrate the basic operation of these string ma-
nipulation functions, but aren’t very realistic. For an example of a realistic use of these
value functions, consider the business query:
produce a list showing staff names and their phone numbers – the staff name
should be split into the full first name, middle initial (if any) and surname.
The tricky bit is extracting the first name and middle initial from the fName column.
First and middle names will be separated by a space character – ‘Judith Anne’ and ‘Jo
Karen’ and so on. The obvious approach is to find the first space in the fName column
value, with everything up to the space being the first name, and the first character
after the space being the middle initial. A first attempt at using the substring function
to extract the first name might be:
SUBSTRING(fName FROM 1 FOR POSITION(' ' IN fName))
The POSITION(' ' IN fName) finds the position of the first space character, so the
SUBSTRING() function then extracts everything from the first character up to and in-
cluding the first space. Unfortunately, this only works for those fName column values
that actually have a space character in them. For those that don’t, POSITION(' ' IN
fName) returns zero, so the SUBSTRING() function extracts zero characters.
The way around this problem is to add a space character at the end of every
fName value, using the concatenation operator fName||' '. For example, ‘John’
becomes ‘John ’ and ‘Judith Anne’ becomes ‘Judith Anne ’. In both cases,
POSITION(' ' IN fName||' ') will always find the first space character at the
end of the person’s first name. The full expression is:
SUBSTRING(fName||' ' FROM 1
FOR POSITION(' ' IN fName||' '))
A similar problem occurs when extracting the middle initial. If the value of fName
does not have a space character, then:
SUBSTRING(fName FROM POSITION(' ' IN fName)+1 FOR 1)
returns the first character of the first name, rather than the first character of the middle
name.
The select statement that implements the business query is shown in Code listing
6.2.4 and the result set in Figure 6.13. The values of the lName and phone columns
are not manipulated, although lName, like the two expressions, is aliased.
Code listing 6.2.4 Retrieving the first name, middle initial and surname and
combining them into a single value
SELECT SUBSTRING(fName||' ' FROM 1
FOR POSITION(' ' IN fName||' ')) AS
forename,
SUBSTRING(fName||' ' FROM
POSITION(' ' IN fName||' ')+1 FOR 1) AS
middleInitial,
lName as surname,
phone
FROM Staff;
247
IDWA_C06.QXD 10/5/07 11:22 AM Page 248
This example shows how tricky it can be to manipulate character string values using
SQL. Things are a little easier using PHP as an if ... else ... statement can dis-
tinguish between different cases. For example ‘If there is a space in the string then
extract everything up to the space, else use the whole string.’ A better solution for ap-
plications that need to manipulate middle names separately from the first name is to
include them as separate columns in the Staff table. This is closer to the ideal of the
relational model, which sees each column value as a single value of the given data
type, with no subparts. An image of the Staff table with separate fName and mName
(middle name) columns is shown in Figure 6.14. Notice that, in this instance, the
mName column is null in almost all the rows. This isn’t ideal, either. The database de-
signer would need to choose which solution best suited the needs of the application.
The ISO SQL standard includes three useful functions for generating date and time
values:
These three functions are commonly used in INSERT and UPDATE statements. Code
listing 6.2.5 shows how to add a new piece of equipment to the Equipment table and
give the current date as the value for the acquired column. All three functions return
248
IDWA_C06.QXD
10/5/07
Figure 6.14 The Staff table with separate columns for first and middle names.
11:22 AM
Staff
staffNo:Varchar fName:Varchar mName:Varchar IName:Varchar email:Varchar phone:Varchar photo:BLOB
10780 John NULL Smith [email protected] NULL [BLOB]
Page 249
249
6.2 SQL built-in value functions
IDWA_C06.QXD 10/5/07 11:22 AM Page 250
values of the appropriate data type – date, time or time stamp. None of them returns a
string literal. Note that the values returned by CURRENT_TIME and CURRENT_
TIMESTAMP actually include information on the time zone. This is useful for distrib-
uted database applications that are used by many people in different countries, but is
beyond the scope of this book.
The format of the date literal is specified using special character combinations. For ex-
ample, the MySQL DBMS includes the STR_TO_DATE(dateLiteral, format)
and DATE_FORMAT(dateValue, format) functions to achieve these two tasks.
In both cases, the format parameter is a string that defines the date format to use. For
example, the format string ‘%d-%m-%Y’ instructs the MySQL DBMS to interpret the
250
IDWA_C06.QXD 10/5/07 11:22 AM Page 251
date in the standard European format of day of the month, then month and, finally,
the four-digit year, all separated by dashes.
Code listing 6.2.7 demonstrates the DATE_FORMAT() function in action, and Figure
6.15 shows the results. The first column of the result set is presented in the MySQL DBMS
preferred date format. The others are all presented in the format specified. Because
acquired is a date, there is no time associated with the column value. Even so, the
fourth column includes a time in the literal because the format specifically requested
one. The best the DBMS can do in this situation is return the time literal ‘00:00’.
Code listing 6.2.7 Using the DATE_FORMAT() function on the MySQL DBMS
SELECT acquired,
DATE_FORMAT(acquired, '%d-%m-%Y') AS European,
DATE_FORMAT(acquired, '%m-%d-%Y') AS American,
DATE_FORMAT(acquired, '%d-%m-%Y %H:%i') AS WithTime,
DATE_FORMAT(acquired, '%M %D, %Y') AS Elegant
FROM Equipment
WHERE acquired IS NOT NULL;
The DATE_FORMAT() function is usually used when retrieving data from the database
to format the date values into a date literal that the end user will recognize. In con-
trast, STR_TO_DATE() is usually used to take a date literal entered by the end user
and ensure that the DBMS can recognize the date value it represents. In most web
database applications, users never actually type in date literals – they select valid dates
from list boxes or calendar controls. These controls will themselves use a particular
date format. The STR_TO_DATE() function can convert this format into a date value
for the MySQL DBMS. For example, Code listing 6.2.8 takes a date literal in the format
‘January 12th, 2005’ and puts the appropriate date value into the acquired column
of the Equipment table.
Code listing 6.2.8 Using the STR_TO_DATE() function in the MySQL DBMS
UPDATE Equipment
SET acquired = STR_TO_DATE('January 12th, 2005',
'%M %D, %Y')
WHERE acquired IS NULL;
251
IDWA_C06.QXD 10/5/07 11:22 AM Page 252
The SQL in Code listings 6.2.7 and 6.2.8 are specific to the dialect of SQL implemented
by the MySQL DBMS. Other DBMS have their own versions of the STR_TO_DATE()
and DATE_FORMAT() functions. Most DBMS will include a range of other functions for
manipulating dates, times and time stamps – check your manual for details.
The queries discussed so far allow rows of data to be retrieved from tables or modified
in some way. Sometimes the user wants a summary of the information held on a
database. Common business queries that require summary information often ask the
following basic questions:
● ‘How many . . . ?’
● ‘What is the largest . . . ?’
● ‘What is the smallest . . . ?’
● ‘What is the average . . . ?’
● ‘What is the total . . . ?’
The ISO SQL standard includes a number of aggregate functions (called set functions
in the ISO SQL standard) that can formulate such business queries as SQL select state-
ments, called aggregate queries. Each aggregate function summarizes the data held in
a particular column across a set of rows. Business queries that ask ‘How many?’ can be
answered using the aggregate function COUNT(). For example, consider the business
query:
The aggregate function COUNT() counts how many rows have a valid data value in a
particular column. It does not count rows where the column is null.
The select statement in Code listing 6.3.1 uses the aggregate function COUNT() to
answer the business query asked above. The only new feature is the COUNT() aggre-
gate function itself. The DBMS deals with this select statement as follows. First, it eval-
uates the from clause, retrieving all the rows from the Staff table. Next, it identifies
which of these rows satisfy the where clause. Here, as there is no where clause, that
means all of them. It works through these rows, counting those for which the phone
column is not null. Finally, it puts this value in the result set and returns it to the user.
The result set for Code listing 6.3.1 is very simple – a single row, with a single column,
holding the value 14 (the number of staff who have a phone number).
Code listing 6.3.1 Using an aggregate query to count how many staff have a
phone number
SELECT COUNT(phone)
FROM Staff;
252
IDWA_C06.QXD 10/5/07 11:22 AM Page 253
Figure 6.16 Result set for Code listing 6.3.2 (note the column names).
An aggregate query can ask for more than one piece of summary information. Code
listing 6.3.2 asks for the number of members of staff, the number of them who have a
phone and the number of them who do not have a phone. The result set is shown in
Figure 6.16. There are 17 members of staff, 14 of whom have a phone and 3 of whom
don’t. Notice that the column names for the first two columns of the result set are just
the expressions from the select list. The third element of the select list has a column
alias and so has a more meaningful name as its column heading in the result set.
Code listing 6.3.2 Using an aggregate query to count how many staff have a
phone number
SELECT COUNT(staffNo), COUNT(phone),
COUNT(staffNo) COUNT(phone) AS noPhone
FROM Staff;
Using the COUNT() aggregate function with a column name will return the number of
rows for which this column is not null. To count the total number of rows, use
COUNT(*) in the select list. For example, code listing 6.3.3 is equivalent to 6.3.2 as
COUNT(staffNo) will always return the same number as COUNT(*) because the
staffNo column is never null (it is the primary key). COUNT(*) can be useful as it
counts all the rows, even when the value of a particular column is null.
253
IDWA_C06.QXD 10/5/07 11:22 AM Page 254
Notice that finding when the oldest piece of equipment was purchased means finding
the earliest purchase date – that is the minimum value of the acquired column. (See
Figure 6.4 for a snapshot of the Equipment table instance.) Always think carefully
when using the MAX() and MIN() aggregate functions with date columns.
254
IDWA_C06.QXD 10/5/07 11:22 AM Page 255
TRUE. The select statement in Code listing 6.3.8 does this. It calculates the average
cost of portable equipment to be £742.50 and the total spent to be £1485. Note that
both values are lower than the corresponding values for all equipment.
Another difference between the ISO SQL standard and the commercial implementa-
tions is that most DBMS do not allow the and operators for dates, times and in-
tervals. Instead, they use special functions to manipulate date and time values. Code
listing 6.3.10 shows how to write the select statement in Code listing 6.3.9 using the
version of SQL implemented by the MySQL 5.0 DBMS. This DBMS includes the func-
tion TIMEDIFF() to calculate the difference between two times. Again, you need to
check the manual for your particular DBMS for information on its date and time manip-
ulation functions.
255
IDWA_C06.QXD 10/5/07 11:22 AM Page 256
One restriction on the aggregate queries seen so far is that it is not possible to include
columns in the select list. This should be obvious. The aggregate function returns a
single value, regardless of how many rows satisfy the where clause, so there is only
ever one row in the result set. A column returns one value for every row that satisfies
the where clause, so there may be many rows in the result set. It is impossible for the
result set of an aggregate query to include data for all these rows. However, there are
business queries that want rows to be grouped together and summary information
calculated for each group, rather than the whole table. For example:
list all staff members and the number of support sessions they offer.
1 group together rows with identical values in the column that needs to be displayed
2 for each group of rows, use an aggregate function to calculate the required summary
data.
The result set for code listing 6.3.11 (Figure 6.17) includes one row for each group –
that is, one row for every member of staff whose staff number appears in the staffNo
column of some row of the SupportSession table. This means every member of
staff who offers support sessions.
256
IDWA_C06.QXD 10/5/07 11:22 AM Page 257
staffNo COUNT(*)
23257 1
31210 1
35054 2
45965 2
55776 1
56673 3
56893 2
Some business queries only require data for some of the groups. For example:
produce a list showing those staff who offer more than one support session and
include a count of the number of support sessions offered.
This list would identify staff who are ‘student-friendly’, offering a choice of times when
they will give individual support to students. There are four such in Figure 6.17. SQL
includes the having clause to specify which groups of data have their summary infor-
mation included in the result set. This is a different purpose from that of the where
clause. The where clause restricts which rows are grouped together, but the having
clause restricts which groups have summary data calculated for them. Note that the
having clause, like the where clause, uses a Boolean expression to identify which
groups should have their data summarized. This Boolean expression can involve any
column that appears in the group by clause or any suitable aggregate function. Usually,
restrictions on column values should be made using the where clause. So, if no data is
required for a whole group of rows based on a shared column value, then the individ-
ual rows can be eliminated in the where clause, using that same shared column value.
When possible, use the where clause to cut down the number of rows that need to be
grouped. This means that the having clause usually restricts the groups based on an ex-
pression involving an aggregate function.
Code listing 6.3.12 shows the select statement to answer the business query above,
and Figure 6.18 the result set. Comparing Figure 6.18 with Figure 6.17, it’s clear that
the having clause has correctly identified which groups to include summary data for.
Code listing 6.3.12 Controlling which groups of data have their summary
information included in the result set
SELECT staffNo, COUNT(*)
FROM SupportSession
GROUP BY staffNo
HAVING COUNT(*)> 1
The having clause is the sixth, and final, part of the select statement. As discussed in
the section on table joins, the order in which these different parts appear in the select
257
IDWA_C06.QXD 10/5/07 11:22 AM Page 258
staffNo COUNT(*)
35054 2
45965 2
56673 3
56893 2
statement is fixed. The group by and having clauses come immediately after the where
clause, so the order is:
1 select list
2 from clause
3 where clause
4 group by clause
5 having clause
6 order by clause.
The order of processing for the DBMS is different:
1 from clause
2 where clause
3 group by clause
4 having clause
5 select list
6 order by clause.
So, first the DBMS identifies which tables it is dealing with, then which rows of those
tables. Once the rows are known, it groups them according to the group by clause and
uses the having clause to determine which groups need to be summarized. The DBMS
then looks at the select list to see what summary information is needed. It adds one row
to the result set for each group of rows that satisfies the having clause. Finally, it orders the
result set according to the instructions in the order by clause.
The whole process can be illustrated by a business query that requires all six parts of
the select statement:
produce a list showing those staff who offer more than one support session,
excluding support sessions that run on a Friday, and include a count of the num-
ber of support sessions offered and order the list, first, by the number of support
sessions and then by staff number.
Figure 6.19 shows how the DBMS processes the select statement in Code listing 6.3.13.
The from clause mentions only one table, so, in step 1, the DBMS identifies that it’s
258
IDWA_C06.QXD 10/5/07 11:22 AM Page 259
dealing with the current instance of the base table SupportSession. If there were
more than one table, then, in step 1, the DBMS would create the cross join.
259
IDWA_C06.QXD
260
Figure 6.19 Processing the select statement in Code listing 6.3.13.
SupportSession
staffNo:Varchar dayOfWeek:Varchar startTime:Time endTime:Time staffNo dayOfWeek startTime endTime
10/5/07
staffNo dayOfWeek startTime endTime staffNo dayOfWeek startTime endTime staffNo COUNT(*) staffNo COUNT(*)
56673 Thursday 15:30:00 16:30:00 56893 Thursday 09:00:00 10:00:00 56893 2 35054 3
31210 Wednesday 11:00:00 13:00:00 56893 Tuesday 14:00:00 15:00:00 35054 3 45965 2
56893 Thursday 09:00:00 10:00:00 35054 Monday 09:00:00 10:00:00 45965 2 56893 2
56893 Tuesday 14:00:00 15:00:00 35054 Tuesday 11:00:00 12:00:00
Step 5 Summary data – Step 6 Ordering the
35054 Monday 09:00:00 10:00:00 35054 Monday 15:00:00 16:00:00
one row for each group result set
35054 Tuesday 11:00:00 12:00:00 45965 Wednesday 11:00:00 12:00:00
35054 Monday 15:00:00 16:00:00 45965 Monday 11:00:00 12:00:00
45965 Wednesday 11:00:00 12:00:00 Step 4 The groups that satisfy the
45965 Monday 11:00:00 12:00:00 having clause
23257 Monday 15:00:00 16:00:00
55776 Monday 14:00:00 16:00:00
6.4 Subqueries
6.4 Subqueries
The select statement can appear as part of another SQL statement and such a select
statement is called a subquery. A variation of the create table statement uses a
subquery to specify the table’s structure.
Code listing 6.4.1 shows such a create table statement. It begins as usual with the
keywords CREATE TABLE, followed by the name of the new table. Then, rather than
a list of column and constraint definitions, comes the keyword AS. This is followed by
a subquery. The subquery can be any valid select statement, but cannot include an
order by clause. The subquery’s result set is a derived table and it is this derived table
that forms the basis for creating the new base table. The new base table inherits the
column definitions, not-null constraints and table instance of the subquery’s result
set. However, no other constraints are copied, so there will be no primary or foreign
key constraints, nor unique or check constraints.
The create table statement in Code listing 6.4.1 thus creates a new base table called
StaffBackup in the same database as the Staff table. It has identical column def-
initions and the same table instance – both the basic structure and the data are copied.
For any column in the Staff table that has a not-null constraint, the corresponding
column on the StaffBackup table also has a not-null constraint. There are no other
constraints on the StaffBackup table.
261
IDWA_C06.QXD 10/5/07 11:22 AM Page 262
them. Rather than simply copying the existing data, it reformats it. Thus, the fName
and lName columns are concatenated together and the endTime column is replaced
with a duration column, showing how long the session lasts.
262
IDWA_C06.QXD 10/5/07 11:22 AM Page 263
6.4 Subqueries
263
IDWA_C06.QXD 10/5/07 11:22 AM Page 264
Equipment
assetNo:varchar assetType:varchar description:varchar building:varchar room:varchar portable:boolean cost:numeric acquired:date
1570131 Data projector HP vp6100 Wilson 205 FALSE 1300.00 2004-07-14
1799131 Data projector HP vp6100 NULL NULL TRUE 1300.00 2005-01-12
4560293 OHP 3M OHP 1608 Locke 24 FALSE NULL NULL
4503993 PC Dell OptiPlex™ GX280 Wilson 205 FALSE 625.00 2004-04-30
5010009 OHP 3M OHP 2000 NULL NULL TRUE 185.00 2001-08-20
2992220 Data projector ToshibaTDP-SW20 Priestley G12 FALSE 1100.00 2005-01-12
2892112 Laptop SONY X505 NULL NULL TRUE NULL NULL
3144578 Data projector HP vp6100 Locke 14 FALSE NULL NULL
How many support sessions are offered by Mikhail Sudbin, whose e-mail address
is [email protected]?
To answer this query, the DBMS must count those rows of the SupportSession
table that are offered by Mikhail Sudbin. The obvious approach is to use the value of
the staffNo column for Mikhail Sudbin, but his staff number isn’t known. The value
of the candidate key email is known, though.
The first select statement in Code listing 6.4.6 uses the value of email to retrieve
the value of the staffNo column from the Staff table, which is the done in the sub-
query. Then, this value of the staffNo column is used to restrict the rows of the outer
select statement. The second select statement shows how the same result can be
achieved using an inner join. Where either approach will work, it is a matter of person-
al preference as to which is used.
264
IDWA_C06.QXD 10/5/07 11:22 AM Page 265
6.4 Subqueries
Code listing 6.4.6 A subquery in the where clause and the equivalent join
statement
SELECT COUNT(*)
FROM SupportSession
WHERE staffNo = (SELECT staffNo
FROM Staff
WHERE email = '[email protected]');
SELECT COUNT(*)
FROM SupportSession, Staff
WHERE SupportSession.staffNo = Staff.staffNo
AND email = '[email protected]';
Strictly speaking, the where clause in the first select statement of Code listing 6.4.6
should result in a data type error. It asks whether a column value, the value of
staffNo, is equal to a derived table, the result set of the subquery. Obviously it isn’t as
the derived table has a non-scalar data type (the table type), whereas the staffNo is
a scalar (a varchar). The syntax is acceptable, though, because the result set of the sub-
query will always consist of a single row (email is a candidate key) with a single col-
umn. So, because the subquery returns a derived table with one row and one column,
the DBMS can treat it as though it were a single data value. That is why the equality
comparison works. If the subquery returns more than one column, or more than one
row, the select statement cannot use an equality comparison.
Business queries sometimes want to retrieve information based on the value of an
aggregate function. For example:
display a list of equipment that is more expensive than the average cost of
equipment.
A business query such as this is best answered using a subquery (there are other
approaches, but the subquery is by far the simplest).
Code listing 6.4.7 shows the select statement for the above query. First, the sub-
query calculates the average cost of all the equipment, then the outer select statement
retrieves those rows of the Equipment table where the value of the cost column is
greater than this average cost. This select statement cannot be replaced by a join as it
is not possible to use an aggregate function in the where clause.
265
IDWA_C06.QXD 10/5/07 11:22 AM Page 266
four: 35054, 45965, 23257 and 55776. Next, retrieve data from those rows of the
Staff table where the value of the staffNo column appears in this list. The SQL
comparison operator IN checks whether or not a given value appears in a list of values.
Code listing 6.4.8 shows how to use IN with a subquery. Again, the subquery can
only retrieve data from a single column, but there may be more than one row in the
result set. The DBMS treats the subquery’s result set as a list of single data values rather
than a derived table with several rows.
6.5 Views
The Asset Tracking database is separate from the Web Timetable database (the SQL
schema WebTimetable), though both hold data on equipment owned by the univer-
sity. The university uses the AssetType table as the main list of the different sorts of
equipment it owns. It must ensure that the values in the Equipment table’s assetType
column all exist in the AssetType table’s type column. A foreign key constraint won’t
work because the tables are in different databases, but a check constraint will.
Code listing 6.4.10 shows how this is achieved. Notice the use of the qualified identifier
"AssetTracking.AssetType". Without the name of the SQL schema, the DBMS
would look for the AssetType table in the current database (the WebTimetable SQL
schema). As it isn’t there, the alter table statement would not work.
6.5 Views
In SQL, a view is a derived table, the definition of which is stored in the data dictionary
(see Chapter 3 for a discussion of the view mechanism). Once defined, a view can be
queried much like a base table, although, as discussed in Chapter 3, not all views are
updateable. Until the release of Version 5, views had not been available on the MySQL
DBMS. As many web database applications used the MySQL DBMS, this meant that web
database developers tended not to use views. This should change, however, because
views are useful in two ways.
First, they can improve the performance of database queries. Whenever the DBMS is
passed a select statement in human-readable SQL, it needs to build an execution plan.
Essentially, the DBMS translates the SQL instructions into a series of instructions that
can be executed by the system software. This takes time. A view can be thought of as a
select statement for which the DBMS has already prepared an execution plan.
A second advantage of views is that they can simplify database queries. Rather than
repeatedly coding a join query to retrieve data from two (or more) tables, the web data-
base developer can store the join query as a view and retrieve data through the view. This
makes coding SQL select statements simpler, which is particularly useful on projects
where some developers aren’t familiar with SQL. In fact, for a web database application,
it makes sense to build enough views so that a PHP script need only ever query a single
267
IDWA_C06.QXD 10/5/07 11:22 AM Page 268
base table or view. In this way, there are no join queries in the PHP code at all, making it
much simpler to maintain.
Code listing 6.5.1 illustrates how to create a view using SQL. It creates a view based
on the Equipment table that will show only portable equipment.
Code listing 6.5.2 Querying a view compared to querying its underlying base
table
(a) A database query against the view PortableEquipment
SELECT assetNo, description
FROM PortableEquipment;
(b) The equivalent database query against the underlying base table
Equipment
SELECT assetNo, description
FROM Equipment
WHERE portable = TRUE;
Sometimes the column names in the view must be different from those in the under-
lying base table. To specify the view’s column names, use the format illustrated in
Code listing 6.5.3.
Code listing 6.5.3 A view with column names that are different from those in
its underlying base table
CREATE VIEW PortEquip (ID, atype, adesc, cost, acq)
AS SELECT assetNo, assetType, description, cost, acquired
FROM Equipment
WHERE portable = TRUE;
268
IDWA_C06.QXD 10/5/07 11:22 AM Page 269
6.5 Views
This creates a view with the same basic structure, but its columns are named ID,
atype, adesc, cost and acq. A database query using this view must use these
column names, not the column names of the underlying base table. Compare the
query in Code listing 6.5.2 (a) with the corresponding query in Code listing 6.5.4
Code listing 6.5.4 Querying a view the columns of which have been renamed
SELECT ID, adesc
FROM PortEquip;
Views are often used to present data in different formats for different users. In most of
the examples used so far, the structure of the Staff table has been:
Code listing 6.5.5 Two relational views based on the same underlying table
(a) A vertical view: some columns, all rows
CREATE VIEW Staff
AS SELECT staffNo, fName, lName, phone, email, photo
FROM FullStaff;
269
IDWA_C06.QXD 10/5/07 11:22 AM Page 270
1 each column of the underlying base table appears at most once in the select list of
the database query defining the relational view
2 each row in the relational view instance can be traced back to a unique row in the
underlying base table – the underlying row
3 in each row of the relational view instance, the value of each column is taken di-
rectly from a single column of the underlying row, with no operators or functions
used.
The ComputingStaff relational view of Code listing 6.5.5 satisfies the first
condition – no column name appears more than once in the select list. It also satisfies
the second as each row in the view is derived from a single row in the Staff table.
It does not satisfy the third, however, as the name column is assembled from
the values of three different columns of the underlying row. This means that the
ComputingStaff relational view is not updateable. The Staff view is updateable
as it satisfies all three conditions. However, it is not possible to add a new member of
staff through this view. In the underlying base table FullStaff, the department
column has a not-null constraint. The Staff view doesn’t include this column, so an
insert statement using the Staff view can’t include data for a mandatory column,
and rows cannot be added through the Staff view. Existing rows can have the data
in their staffNo, fName, lName, phone, email and photo columns mod-
ified, though.
The relational view PortableEquipment, created in Code listing 6.5.1, is also
updateable and illustrates another problem when updating the underlying base table
through a view. Consider what happens when the SQL INSERT statement shown in
Code listing 6.5.7 is executed.
270
IDWA_C06.QXD 10/5/07 11:22 AM Page 271
Further reading
Code listing 6.5.7 Adding a row into a base table via a view
INSERT INTO PortableEquipment (assetNo, assetType,
description)
VALUES (15733, 'OHP', 'Basic, two-bulb OHP');
The relational view PortableEquipment only has three columns, so only values for
these columns can be supplied with the INSERT statement. A new row is added to the
underlying base table Equipment with assetNo set to 15733, assetType to ‘OHP’
and description to ‘Basic, two-bulb OHP’. The remaining columns of the equipment
table are set to their default values – building is set to NULL, room is set to NULL and
portable is set to FALSE. (See Code listing 5.4.2 in Chapter 5 for the relevant create
table statement.) This creates a problem. The row just added to the Equipment table
via the view PortableEquipment has its portable column set to FALSE. The def-
inition of PortableEquipment view states that such rows cannot be retrieved
through the view. To the end user, seeing only those rows that can be retrieved through
the PortableEquipment view, it will appear as though the INSERT statement failed.
Allowing users to add rows via a view that won’t appear in the view instance is cer-
tain to cause confusion. SQL includes the WITH CHECK OPTION to prevent this.
When this is included in the view definition (Code listing 6.5.8), only rows that can be
retrieved through the view instance can be added to the underlying base table via the
view. In the case of the PortableEquipment view, this stops users adding any rows
to the underlying base table via the view.
Code listing 6.5.8 An updateable view with the WITH CHECK OPTION
CREATE VIEW portable_equipment
AS SELECT asset_no, asset_type, description
FROM equipment
WHERE portable = TRUE
WITH CHECK OPTION;
Chapter summary
■ This chapter has examined a range of more advanced applications of the data lan-
guage SQL. Joins, subqueries and aggregate queries were discussed in some depth.
These kinds of queries are very common, so a web database developer needs to
understand how to implement them. Some examples of SQL’s value functions were
discussed, with an emphasis on manipulating dates and character strings. The
chapter has also looked at more advanced examples of SQL DDL, including using a
subquery in a check constraint and creating SQL views.
Further reading
271
IDWA_C06.QXD 10/5/07 11:22 AM Page 272
Review questions
6.1 Explain the following terms used in relation to the SQL select statement:
(a) qualified identifier
(b) correlation name
(c) join
(d) join condition
(e) equi-join.
6.2 Explain the differences between a cross join, inner join and outer join (assume that
the join involves only two tables).
6.3 In SQL, what is a value function? Name three ISO SQL value functions and explain
the purpose of each.
6.4 In SQL, what is an aggregate (or set) function? Name three ISO SQL aggregate
(or set) functions and explain the purpose of each.
6.6 How does the SQL having clause differ from the SQL where clause?
6.7 Describe the order in which a DBMS processes the various clauses of the select
statement.
6.8 What is a subquery? Suggest two situations for which a subquery could be used.
6.10 What facilities does ISO SQL provide for creating and manipulating views?
Exercises
These exercises use the Room and Equipment tables from this chapter and the Course,
Module and Staff tables created in the exercises for Chapter 5 as practice for writing
SQL statements. The table schema are:
272
IDWA_C06.QXD 10/5/07 11:22 AM Page 273
Exercises
description VARCHAR(50),
building VARCHAR(15),
room VARCHAR(4),
portable BOOLEAN DEFAULT (FALSE) NOT NULL,
cost NUMERIC(7,2),
acquired DATE
)
PRIMARY KEY (assetNo)
FOREIGN KEY (building, room) REFERENCES Room (building, roomNo)
273
IDWA_C06.QXD 10/5/07 11:22 AM Page 274
6.13 The following questions provide practice with SQL value functions.
(a) Display a list of staff showing their title, first initial and surname.
(b) List all staff whose surname starts with the same character as their first name.
(c) List all modules with titles longer that 20 characters.
(d) Managers have decided that course titles should be stored in upper case. Write
an update statement to implement this requirement.
(e) Managers have now decided that having course titles in upper case is too ag-
gressive and want them in sentence case, where only the first letter is upper
case. Write an update statement to implement this requirement.
(f) What problems can making such changes to the stored data produce? (Hint:
What happens to the course with the title ICT?) How could you meet these re-
quirements without changing the data in the base table?
6.14 The following questions provide practice with SQL aggregate functions and queries.
For each, write a suitable SQL statement.
(a) How many courses have a second (minor) department involved in their delivery?
(b) Produce a list of departments and the numbers of courses they deliver as lead
departments.
(c) Produce a list of departments and the numbers of courses they deliver as lead
departments, excluding courses that lead to an HND qualification.
(d) Produce a list of departments and the numbers of courses they deliver as lead de-
partments. Only show information for departments with three or fewer courses.
274
IDWA_C06.QXD 10/5/07 11:22 AM Page 275
Investigations
(e) Produce a list showing how many modules each department offers at each level.
(f) Produce a list of staff showing how many modules they lead.
6.15 The following questions provide practice with SQL subqueries. For each, write a
suitable SQL statement.
(a) Using a subquery, display the staff details for the module leader of the module
with the code CCHO0418.
(b) Write two subqueries to display the staff details for all module leaders. One
should be a correlated subquery, the other not.
(c) Using a subquery, display the module details for all modules except those led by
staff who work in the computing department.
6.16 Give an example, other than the ones given in this chapter, of a situation for which
you would use a self join. Provide a suitable table schema and a rationale for an or-
ganization wanting to use the result set of the self join.
Investigations
6.17 Investigate the support your DBMS provides for storing and manipulating dates,
times and intervals. Using a suitable example, write SQL queries to demonstrate the
date and time manipulation functions.
6.18 Investigate the SQL view mechanism and its implementation in various DBMS. There
are two important issues. First, what does the DBMS actually do when a user queries
a view? Does the DBMS physically assemble the view as a derived table and then
query it or does it use the view definition to translate the query into one that only
mentions base tables? The second issue is whether or not the user can update the
base tables through the view.
6.19 The ISO SQL:2003 standard’s DML includes a number of statements not covered in
this textbook. For example UNION, INTERSECT, EXCEPT and MERGE are all wide-
ly supported SQL DML statements, though less widely used. Investigate these state-
ments, identifying scenarios for which they could prove useful.
6.20 The ISO SQL standard, and many of the commercial DBMS, include support for mul-
tiple character sets. This is particularly important in the context of web databases as
the Web is a global medium and databases may need to store information in many
different languages. Investigate the extent to which your DBMS supports internation-
alization. Does it follow the ISO SQL standard? If not, critically assess its deviations
from the standard.
275
IDWA_C07.QXD 10/5/07 11:23 AM Page 276
Chapter objectives
Chapter outline
This chapter looks at some common issues that arise when implementing web data-
base applications. It builds on the material covered in Chapters 4, 5 and 6. The focus is
on the issues rather than production-quality solutions. Most of the code developed in
this chapter is definitely not suitable for deploying as part of a live web database appli-
cation and it is important to be clear about this. There are plenty of sources of produc-
tion-quality code to solve particular problems, but, unless application developers
understand the issues, they will struggle to use those resources effectively.
Many of the issues discussed involve the problem of maintaining application state.
Application state is the term used to refer to information about the current and previ-
ous state of an application. Typical examples of application state include the identity
of the end user, what he or she has done and any data gathered from him or her. Ex-
amples of aspects of application behaviour that need to keep track of application
state include presenting the results of a database query across multiple web pages,
keeping track of decisions made as the user browses a website and implementing a sin-
gle login process to allow access to many different web resources.
As HTTP is a stateless protocol, application state does not persist between HTTP
requests. It is worth looking in a little more detail at how this affects web database
applications. Each HTTP request–response cycle has the same general structure
(Figure 7.1). First, a web browser sends an HTTP request to the web server. This request
can include application data, either in the URL query string or the HTTP request
276
IDWA_C07.QXD 10/5/07 11:23 AM Page 277
Chapter outline
Request processing
message’s entity body (such as data from a form sent using POST). When the request in-
cludes application data, then the requested resource is usually a server- side script (or
other dynamic resource). On the web server, the HTTP server receives the request and de-
cides how to deal with it. For a PHP script, the HTTP server passes the application data to
the PHP application server, which uses it to build an HTTP response message. Once the
PHP application server has built the HTTP response message, it passes this to the HTTP
server and tidies up. ‘Tidying up’ involves clearing out any application state stored by
the PHP application server for this particular HTTP request. The HTTP server forwards
the HTTP response to the web browser, then it also tidies up. The consequence of this is
that there is no record of application state on the web server. With simple applications
this is not a problem. In the Staff Directory, the web page that gathers the search criteria
submits this data to a PHP script for processing. The script uses the data to search the
database, then builds an HTTP response (a web page) to deliver the search results. Once the
web server sends the HTTP response, the application state – that is, the search criteria – is
lost. This is not a problem because everything the end user wanted done has been done.
Now consider what happens if the Pennine University decides to password protect its
Staff Directory. Before end users can submit a search they must first login. When users re-
quest the search form from the Staff Directory, the application asks them for their user-
name and password. Only when the end users have supplied a valid username and
password do they see the search form. They can then enter the search criteria, but must
login again before they see the web page with the results. In fact, every time they request
a protected resource, they need to login. This is clearly not very user-friendly. End users
expect to login once, not every time they want to do anything.
277
IDWA_C07.QXD 10/5/07 11:23 AM Page 278
There are three basic solutions to the problem of maintaining application state be-
tween two different HTTP requests. The simplest solution is to get the web browser to
remember any state information and send it along with every HTTP request. One way
to achieve this is to get the web browser to add the state information as name-value
pairs to the query string of the URL for the requested resource. This is called URL
rewriting and is a simple and effective approach if there is only a small amount of ap-
plication state data. If there is a lot of application state data, then it is possible to place
it inside an HTML form as hidden form controls. When the user submits the form, the
application state data is submitted, too.
The HTTP protocol itself includes a way to maintain login data between requests.
HTTP authentication causes the web browser to store login details for a particular web
server and send them with each request to that web server.
Sessions generalize this idea. The web server gives each end user a unique session
identifier, and the web browser sends this unique session identifier with each request.
The web server can associate the application state for that user with the session identi-
fier so that it is always available.
■ Section 7.1 begins the chapter by looking at PHP’s built-in error-reporting facilities.
Understanding these can save a lot of time and frustration when trying to debug a
PHP script.
■ Section 7.2 uses URL rewriting to implement a script that can deliver the result set
of a database query across multiple web pages. This is a common technique, used
on most e-commerce retail sites.
■ Section 7.3 delves deeper into data validation and shows how to use a single script
to both gather and process data entered by end users. Single script solutions are
useful as they allow the script to display messages about validation failures next to
the relevant HTML form control.
■ Section 7.4 looks at the reload problem. This problem affects many of the scripts
used so far. The solution uses PHP to write HTTP headers – an important technique
that is used in later sections, too.
■ Section 7.5 discusses hidden form controls in the context of ‘wizards’, which guide
end users through long or complex processes.
■ Section 7.6 discusses sessions – the technique that provides the most robust approach
to maintaining application state. Any application that uses URL rewriting or hid-
den form fields could be rewritten to use sessions.
■ Section 7.7 concludes the chapter with a discussion of HTTP authentication.
The error reporting discussed in Chapter 4 mainly dealt with errors that arise while the
server-side script is communicating with the DBMS. This is a small part of the larger
problem of error handling.
There are different kinds of errors that can occur. Some of the hardest to detect are
those that arise from incorrect program logic, such as an incorrect condition on an if
statement or a misspelt variable name. These don’t cause the script to stop, but do
mean that it produces unexpected, often incorrect, results.
278
IDWA_C07.QXD 10/5/07 11:23 AM Page 279
Another kind of error – called a runtime error – occurs when a valid script encoun-
ters a situation that it just cannot deal with. Errors communicating with the DBMS are
runtime errors. With these, the script will continue to execute, but won’t work properly.
Chapter 4 showed how to check for such errors and deal with them. Trying to use an
unknown function, perhaps because the function name is misspelt, will also cause a
runtime error. This one will stop the script executing, though.
In the early stages of developing a PHP script, incorrect syntax is a common source
of errors. Some syntax errors can prevent the script from executing – a missing semi-
colon, for example. Others may allow it to execute, but could stop it working properly.
For example, failing to enclose the key of an associative array in quotes can cause prob-
lems. It’s common to see $_GET[staffNo] instead of $_GET["staffNo"]. PHP
will first interpret $_GET[staffNo] as referring to the element of $_GET, the key of
this being the value of the staffNo constant. Only if there is no such constant does it
interpret it as the element the key of which is the string literal "staffNo".
PHP, like most programming languages, provides facilities to automatically detect
and report as many errors as practicable. This section examines PHP’s built-in error-
reporting facilities and how to use them to debug a PHP script. PHP classifies errors
into several different levels, according to when they occur and how serious they are.
As mentioned in Chapter 4, the instructions in the human-readable PHP code have to
be transformed into a format that the PHP application server can understand. The first
step in this processes is parsing, which means checking that the code in the PHP script
is grammatically correct PHP. If the code is not grammatically correct, then the PHP
parser will report an error. The fragment of PHP code in Figure 7.2 (a) has an example
of an error that the parser will pick up. There should be a semicolon at the end of line
27. The PHP parser generates the error message shown in Figure 7.2 (b).
279
IDWA_C07.QXD 10/5/07 11:23 AM Page 280
The format of PHP error messages follows this standard pattern. First comes the date
and time the error occurred (this is only included in log file error messages, not display
messages – see below), then a brief description of the kind of error. As this error was
identified by the PHP parser, it is called a PHP Parse error. The error message it-
self begins with a description of the problem:
280
IDWA_C07.QXD 10/5/07 11:23 AM Page 281
only works if the HTML version of the PHP manual is installed with the PHP software.
The warning message itself is failed to open stream: No such file or
directory ..., which indicates that the file called DataCleansing.inc could
not be found. The warning message ends, as usual, with the script’s pathname and the
line number where the problem was first recognized.
One common use of include files is to share functions between different scripts.
Indeed, this is the purpose of the include file DataCleansing.inc. If the include file can-
not be found, then any subsequent attempts to use its functions will cause a fatal error.
In this case, PHP immediately stops running the script and issues an error message like
this:
281
IDWA_C07.QXD 10/5/07 11:24 AM Page 282
Line 55 refers to an array element by its index number – in this case, $myArray[2].
Unfortunately, PHP numbers element index numbers from zero, so the only valid index
values are $myArray[0] and $myArray[1]. Thus, when it reaches line 55, PHP is-
sues the notice:
where the value of the parameter level is one of the constants from Table 7.1. In fact,
these constants can be combined to give a mixture of different levels. For example:
error_reporting(E_ALL | E_STRICT);
282
IDWA_C07.QXD 10/5/07 11:24 AM Page 283
Constant Description
E_ERROR Fatal runtime errors. These indicate errors that cannot be recov-
ered from, such as a memory allocation problem. Execution of
the script is halted.
E_WARNING Runtime warnings (non-fatal errors). Execution of the script is
not halted.
E_PARSE Compile time parse errors. Parse errors should only be generated
by the parser.
E_NOTICE Runtime notices. Indicate that the script encountered some-
thing that could indicate an error, but could also happen in the
normal course of running a script.
E_CORE_ERROR Fatal errors that occur during PHP’s initial startup. This is like
an E_ERROR, except that it is generated by the core of PHP.
E_CORE_WARNING Warnings (non-fatal errors) that occur during PHP’s initial
startup. This is like an E_WARNING, except that it is generated
by the core of PHP.
E_COMPILE_ERROR Fatal compile time errors. This is like an E_ERROR, except that
it is generated by the Zend Scripting Engine.
E_COMPILE_WARNING Compile time warnings (non-fatal errors). This is like an
E_WARNING, except that it is generated by the Zend Scripting
Engine.
E_USER_ERROR User-generated error message. This is like an E_ERROR, except
that it is generated in PHP code by using the PHP function
trigger_error().
E_USER_WARNING User-generated warning message. This is like an E_WARNING,
except that it is generated in PHP code by using the PHP func-
tion trigger_error().
E_USER_NOTICE User-generated notice message. This is like an E_NOTICE,
except that it is generated in PHP code by using the PHP func-
tion trigger_error().
E_ALL All errors and warnings, as supported, except level E_STRICT.
E_STRICT Runtime notices. Enable you to have PHP suggest changes to
your code that will ensure the best interoperability and forward
compatibility of the code.
instructs PHP to report all errors and all coding hints. Application developers who
don’t want coding hints or notices can use:
This will ensure that only warnings and error messages are displayed. Calls to the
error_reporting() function should come before any other PHP code, for obvious
reasons.
283
IDWA_C07.QXD 10/5/07 11:24 AM Page 284
The symbols ‘|’, ‘&’ and ‘~’ are the PHP bitwise operators. There are explanations of
these in the PHP manual (PHP Documentation Group, 2005).
Sorting out which errors PHP reports and how it reports them is only the first step in
error handling. PHP allows application developers to write their own error-handling
functions. These could automatically distinguish between development and produc-
tion environments and issue different styles of error message. They can even be set up
to e-mail errors to the application support team. Since PHP version 5, object-orientated
programming has been incorporated into PHP, so it now supports Java-style exception
handling with ‘try’ and ‘catch’ blocks. Good sources to consult on these more advanced
issues are mentioned in the Further reading section at the end of this chapter.
A common feature of web database applications is the search results page. The Staff
Directory search page is typical of this feature. Data is drawn from the database and
delivered to the end user as a list of results, with hyperlinks to further details. Where
the number of results is likely to be small, delivering them all on a single web page is
an acceptable solution. Where the number of results is large, however, they need to
be delivered across multiple pages, with each page delivering a fixed-sized chunk of
the result set. So, for example, the Staff Directory search could deliver four results per
page. If a particular search returned ten results, then the first four would be delivered
on page one, the next four on page two and the final two on page three. This presents
a problem for web database applications. HTTP is a stateless protocol, so the web server
does not remember which page the user was looking at and so cannot automatically
deliver the correct chunk of the result set.
Two mechanisms are required to deliver a result set across multiple pages. The first
must allow the application to retrieve a particular chunk of the result set – the first
5 rows only, then rows 6–10 and, finally, rows 11–12, for example. Second, the appli-
cation needs to implement a mechanism for navigating from page to page that main-
tains the application state between HTTP requests. The application state keeps track of
which chunk of the result set to retrieve for the next page.
The SQL standard does not provide a mechanism to allow the application to specify
which chunk of the result set it wants. Instead, the application would need to retrieve
the whole result set and navigate through to the chunk it required. This is not difficult,
but can involve the web server doing a lot of work as it trawls through a large result set.
However, the MySQL DBMS has an extension to the ISO SQL standard that does allow
the application to tell the DBMS which chunk of the result set is required. This is a very
handy feature for web database applications, shifting the work of navigating through
the result set to the required chunk from the web server to the database server.
Figure 7.4 shows the general form of the limit clause of a MySQL DBMS select state-
ment. It comes after the order by clause and begins with the keyword LIMIT. This is
followed by two values. The value offset is the number of the row to start from – rows
284
IDWA_C07.QXD 10/5/07 11:24 AM Page 285
Figure 7.4 The format of the MySQL DBMS limit clause of the select
statement.
are numbered from zero, like PHP arrays. The other value, rowCount, is the number
of rows to retrieve. So, for example, the limit clause LIMIT 8,4 tells the DBMS to
start from the ninth row of the result set and return four rows in total.
Code listing 7.2.1 shows three equivalent MySQL DBMS select statements to re-
trieve the first four rows of the result set. Code listing 7.2.1 (a) shows the standard for-
mat. Code listing 7.2.1 (b) illustrates that when only one number is supplied for the
limit clause, it is the rowCount rather than the offset. Finally, code listing 7.2.1 (c)
shows a longer form of the limit clause, which uses the keyword OFFSET to indicate
the offset. It is important to understand that the DBMS will retrieve all rows that satisfy
the where clause and put them in order before it uses the limit clause to decide which
rows to deliver to the end user.
Code listing 7.2.1 Three equivalent statements using the MySQL DBMS
limit clause
(a) Specifying the offset
SELECT fName, lName, email FROM Staff
WHERE lName LIKE '%sm%'
ORDER BY lName, fName
LIMIT 0, 4;
285
IDWA_C07.QXD 10/5/07 11:24 AM Page 286
The PostgreSQL DBMS provides a similar extension to ISO SQL. It uses the format
shown in Code listing 7.2.1 (c). See www.postgresql.org for details.
As usual, the DBMS may not actually have to retrieve all rows that satisfy the where
clause before applying the limit clause, though conceptually that is how it works.
The DBMS manual will include guidance on any optimization features associated
with the limit clause.
The limit clause provides the mechanism for delivering a chunk of the result set that
greatly simplifies the process of delivering result sets over multiple pages. All the database
client (the web server) need do is keep track of where in the result set it is. To request the
next page of data, it simply reruns the original database query, but with a different offset.
The database query should include an order by clause to ensure that each time it is run
the rows of the result set appear in the same order. This helps to ensure that, as the end
user navigates back and forth through the set of web pages, the same rows are displayed
each time a page is viewed. Although an order by clause helps keep things consistent, it
can’t be guaranteed that the data won’t change. Another user could modify the database
instance while the first page is being viewed. This may alter which rows appear on the
second page. In the worst case, some rows may end up being missed out.
The mechanism for page navigation is typically implemented using a ‘previous
page’ and ‘next page’ metaphor. The different web pages are generated by the same
PHP script. Each HTTP request for this PHP script supplies the appropriate offset, and
the script delivers the required chunk of data. The offset is often coded as a name-
value pair into the URL of the ‘previous page’ and ‘next page’ hyperlinks. Using the
URL to keep track of application state is called URL rewriting. It is one of a range of
techniques for overcoming the stateless nature of HTTP.
Figure 7.5 shows the first page generated by Version 1.5 of the SurnameSearch.php
script. This version differs from Version 1.4 in two significant ways.
First, it allows the end user to enter any string and returns all members of staff
whose surnames include this string. The URL in the address box at the top of the
browser window has a single name-value pair in the query string, indicating that the
end user entered the character ‘s’ as the surname to search for. The script uses the SQL
LIKE comparison operator to retrieve all staff members whose surnames include this
character.
The second difference is the ‘Next‘ hyperlink at the bottom of the web page.
When the screenshot was taken, the mouse was hovering over this hyperlink, so its
URL appears in the status bar at the foot of the browser window. The URL path is the
same as in the address box, so the hyperlink will request this PHP script again. How-
ever, the query string includes two name-value pairs:
● surname=s ensures that the same search criteria are passed to the script second
time round
● start=4 tells the script that it should start from row number 4 (the fifth row of
the result set).
Figure 7.6 shows the key lines of code from the PHP script that generated the web page
in Figure 7.5. Lines 76 to 80 show the actual SQL select statement. The where clause
286
IDWA_C07.QXD 10/5/07 11:24 AM Page 287
Figure 7.5 Delivering the results of a database search using multiple web
pages.
Figure 7.6 Version 1.5 of the SurnameSearch.php script, showing how to deliver a result set
over multiple pages.
287
IDWA_C07.QXD 10/5/07 11:24 AM Page 288
has a single like comparison that matches any surname that includes the string en-
tered by the end user. For example, if the end user entered ‘Smith’, he or she would see
everyone whose surname was ‘Smith’, ‘Shoesmith’, ‘Smithson’ and so on.
The order by clause first orders staff by surname, then by first name and finally by staff
number. This means that the several members of staff called ‘John Smith’ will appear in
staff number order. It is always sensible to include a candidate key in the order by clause
when delivering a result set over multiple web pages. A candidate key is unique to each
row, so ensures that the rows of the result set always come out in the same order.
The limit clause uses the PHP variable $start and the PHP constant MAX_ROWS
to define the offset and the number of rows to deliver. MAX_ROWS is defined in line
74, with the value 4. If, later on, it’s decided to deliver six rows on each web page, then
it is easy to change the definition of this constant and the script will work with no
further changes.
Setting the value of $start is a little trickier. When the script is first called by the
SurnameSearch.html form, there is a single name-value pair in the URL (check the
address box in Figure 7.5) and so a single GET parameter – surname. When it is called
using the ‘Next’ hyperlink, there are two name-value pairs in the URL – surname and
start (check the status bar of Figure 7.5), so two GET parameters. Lines 68–73 use
this difference to identify whether the script is being called for the first time or not. If
$_GET["start"] is empty, this must be the first time through, so the script sets
$start to 0 (line 70). This means that the select statement’s limit clause will be
LIMIT 0, 4. This is just what is needed, as it will retrieve just the first four rows of
the result set. If $_GET["start"] is not empty, then this is not the first time
through, so $_GET["start"] holds the value of the offset for the limit clause. Line
73 uses the cleanse_data() function to circumvent any hacking attempts
and stores the offset in the local variable $start. Clicking on the ‘Next’ hyperlink in
Figure 7.5 means $_GET["start"]=4 and so the limit clause will be LIMIT 4, 4.
Again, this is just what is needed on the second page – start from the fifth row and de-
liver four rows in total.
The rest of the script is unchanged until line 100. Line 100 resets $start to the off-
set required for the next page in the sequence by adding the constant MAX_ROWS to
the current offset. Remember, $start=$start+MAX_ROWS means ‘set the new
value of $start to be the old value of $start plus the value of MAX_ROWS’.
Line 101 uses this new offset to set up the URL for the ‘Next’ hyperlink on the
page being generated and line 102 actually writes out the hyperlink. Note that the
rawurlencode() function is used to ensure that the values in the query string do
not cause problems (see Chapter 4). It is important to use rawurlencode() only on
the values. If the whole URL is encoded, then the ‘?’, ‘=’ and ‘&’ characters used to sep-
arate the parts of the query string will also be replaced by their hexadecimal character
codes, and the URL will be unusable.
Version 1.5 of the SurnameSearch.php script is about as simple as page navigation
gets. It only implements a ‘next page’ mechanism – end users can only navigate for-
wards through the result set. There is no attempt to stop when all the rows have been
displayed – a ‘Next’ hyperlink is always created, even when the value of the offset is
greater than the number of rows in the result set. Finally, a more sophisticated naviga-
tion mechanism would include numbered links to pages, as well as the ‘previous page’
and ‘next page’ links. Adding these features is left as an exercise.
288
IDWA_C07.QXD 10/5/07 11:24 AM Page 289
Before leaving Version 1.5 of SurnameSearch.php, it’s worth pointing out a bug in the
code. When users search for ‘O’Connor’, they see the web page in Figure 7.7. The text
on this web page is generated by the debugging code in lines 84–86 (see Figure 7.6) and
it tells us that the syntax of the SQL select statement is incorrect. The debugging code
shows the full select statement in the second line (repeated in Code listing 7.2.2) and
the MySQL DBMS error message on the next two lines. The problem is in the where
clause. There are three single quote marks in '%O'Connor%' – the two single quote
marks that delimit the string literal, plus the apostrophe character that is part of the
string literal. The MySQL DBMS interprets the first single quote as the start of a string
literal and the second as the end of the string literal, even though the second is actual-
ly an apostrophe, not a delimiter. So, the MySQL DBMS thinks that the string literal is
'%O'. The rest of the string literal – Connor%' – is assumed to be a syntax error.
A similar problem can sometimes arise with double quote marks – often used around
the house name in an address, such as ‘Mr P. Smith, “Sunnydale Barn”, Upper Cum-
berworth, West Yorkshire’. Whether or not double quote marks also cause a problem
depends on the particular DBMS being used. If there is a problem, it can usually be
solved in the same way.
289
IDWA_C07.QXD 10/5/07 11:24 AM Page 290
Code listing 7.2.3 A user-defined PHP function to solve the SQL apostrophe
problem
function sqlApostrophe($value)
{
/*-- Begin function header
--
***************************************************
-- Comment Replaces a single apostrophe with two
apostrophes.
-- Needed to avoid problems with SQL string
literals
-- Version Date Author Comment
-- 0.1 2005-11-05 M. Prigmore First draft
--
***************************************************
-- End function header */
if (empty($value))
return "";
else
return str_replace("'", "''", $value);
}
Database modifications involve gathering data from the end user with an HTML
form, then, in a server-side script, using this data to modify the database. Chapter 4
showed how to add a new support session to the database. The solution used two sep-
arate web resources – a static HTML form to gather the data and a server-side PHP
script to write this data to the database. It is more common for web applications to
use a single server-side script to achieve both tasks. When the user first requests the
script, it delivers the HTML form to gather the data. The action attribute of the form
points to the same server-side script. So, when the user clicks the submit button, the
script is called for a second time. This time the script uses the form data to modify the
database instance.
Delivering different content depending on whether or not the server-side script has
been run before is a common approach. It was used above, with one script delivering
several different pages of content, and will be used again. There are two reasons for
taking this approach. First, all processing for a particular application behaviour is
contained in a single script, so it is easier to maintain. Second, reporting data valida-
tion errors can be more user-friendly, with error messages appearing next to the
HTML form control with the error. Figure 7.8 (a) shows a portion of Version 0.2 of the
SupportSessionInsert.php server-side script.
290
IDWA_C07.QXD 10/5/07 11:24 AM Page 291
291
IDWA_C07.QXD 10/5/07 11:24 AM Page 292
The server-side script begins with the usual HTML, then slips into PHP in line 37.
The if statement in line 38 checks if any data has been entered into the HTML form
controls (notice that the script assumes data is passed from the HTML form using the
GET method). If any of the expected values are missing, then the end user has not
entered the required data, so the script delivers the HTML form. Rather than using an
echo statement to write each line of HTML to the HTTP response message, the script
drops out of PHP in line 41. The PHP application server writes lines 42 and 43 directly
to the HTTP response message. It also writes out the first part of line 44, but part way
along the line it encounters the PHP start tag ‘<?php’, so, once again, begins to inter-
pret the PHP statements. There is just one:
echo $_SERVER["PHP_SELF"];
$_SERVER[] is another of the PHP superglobal arrays. It works just like the $_GET
and $_POST superglobal arrays, but holds information passed to PHP from the HTTP
server. $_SERVER["PHP_SELF"] holds the (relative) URL of the PHP script itself.
This URL is written out as the value of the action attribute of the HTML form, ensuring
that the HTML form submits its data back to this same script. This is clearly visible
in the HTML source shown in Figure 7.8 (b). The PHP end tag ‘?>’ drops the script
back out of PHP. The last two characters of line 44, ‘">’, are written straight to the
HTTP response message, as are lines 45–51.
This approach to coding PHP, dropping in and out of PHP as and when it suits,
makes it very easy to code server-side scripts and is one reason for the popularity of
PHP as a server-side scripting language. Much of the server-side script looks, and behaves,
like plain HTML. However, it does look rather odd. The non-PHP code in lines 42–51 is
actually inside the if block of the PHP if statement that starts in line 38. The if block
only ends in line 53, with the closing curly bracket (notice, too that there is an else
block to come). If there were more lines of HTML, it would be easy to miss this and
include some code that should not be part of the if block. Some PHP developers prefer
to stay in PHP once they start an if statement and use echo statements to write out all
the HTML. (This is the approach to server-side scripting used in CGI scripts.)
When the end user fills out the HTML form and clicks on the submit button, the
script is called for a second time. Assuming all the required data was entered, the if
condition in line 38 is false, so PHP skips over the if block to the else block (Figure 7.9).
The code here is the same as that in Version 0.1 (except that version 0.1 used POST
rather than GET). The script writes the data gathered from end users to the database
and delivers a success (or error) message.
The main problem with Version 0.2 of the script is that if the user forgets to enter
data for any one of the four controls, then the script shows the form again, with no
explanation. This will be very confusing. The solution is to add some more robust data
validation and tell end users what they are doing wrong.
Adding data validation, and reporting validation failures, is an important step towards
delivering a professional web application. Chapter 4 showed how to check whether or
not a control has a value using the PHP function empty(). Whenever the data gath-
ered by a control is required, then this sort of validation should be done. When end
users add new support sessions, they must supply four required pieces of data: their
staff number, the day the support session runs and the start and end times. Each of
these data items is required.
292
IDWA_C07.QXD 10/5/07 11:24 AM Page 293
Figure 7.9 Dealing with the form data in SupportSessionInsert.php Version 0.2.
Leaving a text box empty is not the only way end users could enter invalid data. Each
text box requires additional data validation:
In a traditional GUI application, each control’s value would be validated as the end
user moved the cursor to the next control. For example, as the end user moved from
‘Staff number’ to ‘Day’, the application would check that the value entered in ‘Staff
number’ was valid. If not, it would move the cursor back to the ‘Staff number’ and
warn the end user that the data entered was invalid. To use this model of data valida-
tion in a server-side script, the HTML form would need to submit its data to the web
server every time the user moved off a control. This is not how HTML forms work, so
another approach is needed.
293
IDWA_C07.QXD 10/5/07 11:24 AM Page 294
For the ‘Day’ control, there are only five valid values. A very effective way to pre-
vent invalid data is simply to list these values in an HTML select control. The HTML se-
lect control shows end users a list of values and allows them to select one. This means
that the only way end users can submit invalid data is if they forget to select one of the
available values. A similar approach works for the time, provided the hours and min-
utes are selected separately. Web applications make extensive use of this technique to
avoid the problem of data validation. If end users can only select from a list of valid
values, then the server-side script only needs to check that they did actually select one.
When it isn’t possible to list all the valid data values, as with the staff number, then
the actual values entered must be validated. Some developers prefer to check time val-
ues themselves anyway. To illustrate data validation, the ‘Add new support session’
form uses dropdown menus for the day and start time, but text boxes for the staff
number and end time.
Figure 7.10 (a) shows the form itself and Figure 7.10 (b) the HTML source. Notice that
the dropdown menu for the day shows the text ‘Choose a day . . . ’. HTML select con-
trols display the first of the options by default. If this first option is a valid data value,
users may forget to change it, and submit incorrect data. For example, they might have
meant to add a support session starting at 10:30 on Tuesday, but accidentally add one
starting at 09:00 on Monday. One common solution to this problem is to make the
first option something like ‘Choose a day . . . ’. If end users click on the submit button
too soon, the PHP script realizes that there has been some kind of mistake and redis-
plays the form. A similar approach is taken with the start time, which has dropdown
Figure 7.10 The HTML form used to gather data from end users to add a
new support session.
294
IDWA_C07.QXD 10/5/07 11:24 AM Page 295
select controls for both the hours and minutes. The first option for each select control
is empty – <option></option>. Again, if users click on the submit button before
choosing a valid start time, the PHP script recognizes this and redisplays the form.
Remember that the HTML source code in Figure 7.10 (b) is generated by the PHP
script SupportSessionInsert.php, but does not contain any PHP. It is important to keep
this in mind. When SupportSessionInsert.php is requested for the first time, it gener-
ates the form in Figure 7.10 (a). This form gathers data from end users and submits the
295
IDWA_C07.QXD 10/5/07 11:24 AM Page 296
296
IDWA_C07.QXD 10/5/07 11:24 AM Page 297
It is also possible to write validation tests that ask ‘Is the data valid?’ In this case, sat-
isfying the condition means passing the validation test. For example, the valid values
for day are the string literals ‘Monday’ to ‘Friday’. The if statement:
if ($_GET["day"]=="Monday" or
$_GET["day"]=="Tuesday" or
$_GET["day"]=="Wednesday" or
$_GET["day"]=="Thursday" or
$_GET["day"]=="Friday")
asks if the value of $_GET["day"] is one of these valid values. If the value satisfies
the condition, then it passes the validation test.
When the data satisfies the condition in line 60 (that is, fails the validation test), PHP
executes the code block in lines 61–64. This code sets $validationFail to TRUE
(line 62) and adds an element to the $validationMessage array (line 63). The ele-
ment key is startHour, which indicates that this validation failure relates to the
startHour control, and the element value is Please choose the start hour,
which explains what users have done wrong. In contrast, when users choose a valid
hour from the select control, the data does not satisfy the condition and so passes vali-
dation. In this case, because there is no else block, PHP moves on to the next statement
in line 65. This is correct – if the data passes validation then there is nothing to do.
The same sort of validation test is applied to startMinute (lines 65–69). In both
these cases, it makes sense to check that there is actually some data, because the first
option in the HTML select controls is empty – it is <option></option>. Users
could easily choose this empty option by mistake. The validation for the day select
control is different. None of the options for this select control is empty, so it seems rea-
sonable to assume that there is no need to test for the empty string (it turns out that
this is not so reasonable). Instead, the validation test in line 55 is whether or not the
value submitted is the default – Choose a day . . . . If it is, then there has been a
mistake and a validation message is added to $validationMessage.
One problem with this is that the form uses the HTTP GET method, so the controls’
values are passed as name-value pairs in the URL query string. There is nothing to stop
end users typing values directly into the address bar. For example, they could type the
query string:
?staffNo=12345&day=&startHour=09&startMinute=15&end=10:15
The web server would happily use these name-value pairs to initialize PHP’s $_GET
superglobal array and the value of $_GET["day"] will be empty! Similarly, there is
nothing to stop end users giving a start hour of 57. In both cases, the data is invalid,
but the validation tests in Figure 7.11 do not pick up the mistake. This means that the
data validation for the select controls is not sufficiently robust. The Exercises section at
the end of this chapter tighten the validation so that even hand-typed data values are
validated effectively.
Another option is to use the HTTP POST method as it does not send the data via the
URL, so end users can’t easily circumvent the PHP script’s validation tests.
Lines 45–54 validate the staffNo control using a PHP if . . . elseif statement
(Figure 7.12) rather than a plain if statement. The if . . . elseif statement allows for a
sequence of tests, rather than just one. When the if condition is not satisfied, PHP
297
IDWA_C07.QXD 10/5/07 11:24 AM Page 298
skips to the first elseif keyword and checks the condition there. If this condition is
also not satisfied, it skips to the next elseif keyword and so on.
There can also be a final else keyword, followed by a block of code telling PHP
what to do when none of the conditions has been satisfied. However, once a condition
is satisfied, the code block immediately following it is executed and the remainder of
the if . . . elseif statement is ignored.
To see how this works in practice, consider what happens when end users do not
submit a value for the staffNo control. The condition in line 45 of Figure 7.11 is sat-
isfied, so PHP executes the code block in lines 46–49. It sets $validationFail to
TRUE (line 47) and adds an element to the $validationMessage array (line 48). It
then ignores the rest of the if . . . elseif statement, skipping straight to line 55.
A second possibility is that end users submitted a value such as Peter. The condi-
tion in line 45 would not be satisfied, so PHP would then look for an elseif (or, fail-
ing that, just an else) keyword. It would find one in line 50 and check whether this
condition has been satisfied. The value ‘Peter’ is first passed to the user-defined func-
tion cleanse_data() to remove any potentially harmful characters. The value
returned by cleanse_data() is passed straight to the PHP built-in function
is_numeric(), which will check whether or not a string of characters is a numeric
literal (see Chapter 4, Section 4.3). As the string of characters Peter is definitely not a
numeric literal, the is_numeric() function returns the value FALSE. The Boolean-
not operator, ‘!’ turns this value to TRUE and the condition is satisfied. PHP then
executes the block of code on lines 51–54, indicating that the data submitted by an
end user failed validation because it is not a numeric literal.
The third possible scenario is that an end user submits a value such as 12345 for
the staffNo. Again, this value does not satisfy the condition in line 45. Nor does it
satisfy the condition in line 50 as 12345 is a numeric literal. As there are no more
elseif or else keywords, PHP skips to line 55 without doing anything.
The end control is also a text box, so it is possible that end users could fail to sub-
mit data for this control, too. The first validation test in this section, in lines 70–74 in
298
IDWA_C07.QXD 10/5/07 11:24 AM Page 299
Figure 7.13, is very similar to the preceding examples. When end users enter a value
into the end control, the else block in lines 75–120 is executed.
Validating the end time is a lot more complex than any of the validation seen so far.
When users submit data for the end control, it must go through the following se-
quence of validation tests:
1 check that the data value is in the format hh:mm, a one- or two-digit hour, followed
by the colon character, followed by a two-digit minute.
2 check that the hour is a numeric literal (note that PHP will treat ‘09’ as a numeric
literal).
299
IDWA_C07.QXD 10/5/07 11:24 AM Page 300
3 check that the numeric hour value lies between 9 and 18.
4 check that the minute is a numeric literal.
5 check that the numeric minute value lies between 0 and 59.
Position: 0 1 2 3 4 5 6 7 8 9
Character: a b c d e f g h i j
1 $separator is the Boolean value FALSE – in this case $end does not include a
colon, so can’t possibly be in the format hh:mm
2 $separator is the first character – in this case there is no hour.
3 $separator is the last character – in this case there is no minute.
Any other value of $separator indicates that the end times submitted by end users
are in the correct format.
At this point, PHP’s relaxed attitude to data types causes a problem. The Boolean value
FALSE is equal to the integer 0. This means that the condition $separator==FALSE
is true for cases 1 and 2. Similarly, Boolean TRUE is equal to the integer 1, so it is pos-
sible that cases 1 and 3 also overlap. To make it absolutely clear what the comparisons
are saying, line 80 uses the comparison operator '===', which checks that the value
and the data type are the same. With this comparison operator, Boolean FALSE and
the integer 0 are different as they are different data types. The first condition,
$separator===FALSE, is true in case 1. The second, $separator===0, is true
in case 2. The third condition, $separator===strlen($end)-1, needs a little
explanation.
The built-in PHP function strlen() calculates the number of characters in a
string. The position of the last character will be one less than the number of charac-
ters, so, when this condition is true, $separator is the last character – this is the
third invalid case. Any value of $end that is not in the format hh:mm will fail the val-
idation test in line 80.
300
IDWA_C07.QXD 10/5/07 11:24 AM Page 301
Figure 7.14 shows the ten-character string ‘abcdefghij’, with the positions of each
character indicated above them. The last of the ten characters is at position 9. It
should be clear from this that, for any string, the position of the last character is one
less than the length of the string.
When the value of $end is in the format hh:mm, it does not satisfy the condition in
line 80 and so PHP skips to the else block (lines 86–119). The else block must validate
the hour and minute parts of $end separately. Line 89 uses the built-in PHP function
substr() to pull the hour part from the front of $end. The function description is:
string substr(string str, integer start [,integer length])
The substr() function extracts a sequence of characters from a string. The
first parameter is the string itself, the second gives the position of the character
to start from and the (optional) third parameter states how many characters to ex-
tract. The function returns the extracted characters as a string. So, for example,
substr("adcdefghij", 4, 3) returns the string "efg" – e is the character in
position 4 (check this in Figure 7.14) and the extracted string is three characters
long. To see how line 89 extracts the hour, suppose that $end has the value
"11:35". In this case, the value of $separator, set in line 79, is 2 – the colon is
the third character, which is position number 2. Hence, line 89 reads:
$hour=substr("11:35", 0, 2);
It’s clear that $hour is set to the string value "11". When $end has the value
"9:35", $separator is 1 and the colon is the second character. The code in line 89
will once again correctly extract the hour portion of $end.
Manipulating strings in this way is something most web database applications have
to do. It’s important to remember that PHP numbers the character positions from
zero and to carefully test which characters a substr() function actually extracts as
it is easy to be one position out. Tracking down bugs like that can be very tricky.
The remainder of the validation is pretty straightforward. The condition in line 90 asks
if the hour is numeric and, in line 95, whether or not it falls in the valid range. Line
101 extracts the minute portion of the end time and subsequent lines validate it.
The condition in line 114 is a bit picky. It means that a time of ‘12:5’ is considered
invalid as it should be ‘12:05’.
One important point about the validation messages for the minutes is that they are
stored in the same element of the array $validationMessage as those for the
hour. This means that any messages about the minutes must be concatenated on to
this element to avoid overwriting messages about the hour.
If the hour passes validation, then, in line 107, the $validationMessage array
does not have an element with the key "end". If the minute fails validation, then
PHP will generate a notice message for line 107, stating that there is no "end" ele-
ment in the array $validationMessage (the same is true of lines 112 and 117). To
avoid these notices, use separate "endHour" and "endMinute" elements.
301
IDWA_C07.QXD 10/5/07 11:24 AM Page 302
Once the data submitted by users has been validated, the script must decide what to
do next. If any of the data failed validation, then the script should display the form
again, but this time showing the validation messages. Line 122 (Figure 7.15) checks
the value of the Boolean variable $validationFail. If any of the data submitted
failed validation, this will be TRUE and the PHP script will display the HTML form
with appropriate validation messages. Otherwise, it writes the valid support session
data to the database (not shown in Figure 7.15).
Consider the staffNo form control, created by the HTML in line 133. If the data
submitted for this form control failed validation, there will be an element of the
$validationMessage array with key "staffNo". The function call:
array_key_exists("staffNo", $validationMessage)
in line 130 returns TRUE when the $validationMessage array includes an element
with the key "staffNo" and FALSE when it does not. Hence, the validation message in
line 131 is only written to the HTTP response when the data submitted for the staffNo
form control failed validation. This same approach is used to deliver validation messages
for all the form controls. For controls that pass validation, there is no message.
Figure 7.16 shows the Add a new support session HTML form with validation mes-
sages. Figure 7.10 (a) shows the original form and it seems clear that this approach to
validation is an effective way to improve the usability of the form. However, there are
a few problems with the PHP script as it stands.
The first problem is that the validation only checks the values submitted for indi-
vidual form controls. This is necessary, but often not sufficient as the value of one con-
trol may affect the validity of values for another. For example, if the start time is 13:00
it makes no sense to have an end time of 12:45.
Another problem is that the web page shown in Figure 7.16 is actually what is deliv-
ered by the first call to the script SupportSessionInsert.php. In other words, some
302
IDWA_C07.QXD 10/5/07 11:24 AM Page 303
validation messages appear before the user has had a chance to submit any data. This
happens because, on the first call of any PHP script, the $_GET (and $_POST) array
has no elements. Consequently, any control that is tested to see if a value was submit-
ted will fail this validation. Only the day control is not tested for missing values, so it
is the only control in Figure 7.16 with no validation message.
One way around this is to use the fact that the $_GET (and $_POST) array has no
elements when the script is first called, but will have elements when the user clicks on
the submit button. Even if the controls are all empty when the end user clicks on the
submit button, the $_GET (and $_POST) array will have an element with the con-
trol’s name as the key and the control’s value (possibly empty) as the value. Enclosing
the validation tests within an if statement with the condition:
if (count($_GET)<>0)
ensures that the validation doesn’t happen the first time the script is called.
The third significant problem with the solution as it stands is that, when there are
validation failures, the original data submitted by end users is lost (it is not included in
the form controls when the form is redelivered). This will annoy end users, who will
have to rekey all the data, not just the invalid data. It is easy to solve this problem. For
example, to redisplay the data entered in the staffNo control, use PHP to echo back
the current value of $_GET["staffNo"]:
As it stands, this line will cause PHP to generate a notice the first time the script runs
as there will be no element with the key staffNo. To avoid this, use an if statement
so that the echo statement is only executed if $_GET["staffNo"] has a value.
Similar techniques work with other form controls.
The reload problem can affect any web page that processes data gathered from end
users, but is a particular problem when the HTTP GET method is used to send the data
to the web server. When end user fills out the HTML form in Figure 7.10 (a) with valid
data and click on the submit button, the web browser uses the HTML form data to cre-
ate an HTTP request. The URL in this HTTP request is:
http://www.pennine.ac.uk/StaffDirectory/SupportSession
Insert.php?staffNo=31213&day=Wednesday&startHour=09&start
Minute=00&end=10%3A00
This URL identifies the web server – www.pennine.ac.uk – and the requested re-
source – the script file located at StaffDirectory/SupportSessionInsert.php. It also
includes four name-value pairs in the query string to pass data from the HTML form
to the PHP script. The PHP script uses this data to add a row to the SupportSes-
sion table, then generates the ‘success page’ in Figure 7.17 (a). Look at the address
bar on this success page, which has been scrolled to show the query string. The
address bar contains the URL that the web browser used in the HTTP request and it
has all the values entered into the form as name-value pairs in its query string. If
end users click on the refresh button, then the web browser uses the success
page URL to create another HTTP request. This second HTTP request is exactly the
same as the one created when the user clicked on the form’s submit button. The
304
IDWA_C07.QXD 10/5/07 11:24 AM Page 305
(b) What users see when they click on the refresh button on the web page in (a).
upshot is that the data is submitted twice and the PHP script tries to add it to the
database for a second time.
In Figure 7.17 (a) the colon character ‘– : –’ in the end time part of the name-value pair
in the URL has been replaced by its character reference, ‘%3A’. The web browser
does this to all data submitted from a form, but the HTTP server translates it back be-
fore passing the value to the PHP application server.
Because the SupportSession table has a primary key, the DBMS reports an error.
The PHP code for handling DBMS communication errors comes into play and, this
time, SupportSessionInsert.php generates the web page shown in Figure 7.17 (b). Notice
that the actual SQL insert statement is perfectly valid – it has only been rejected be-
cause there is already a row in the SupportSession table with this primary key and
duplicate primary key values are not allowed. The same problem can occur if users
bookmark the success page. That is because the query string is kept as part of the book-
mark and SupportSessionInsert.php tries to add this data to the database every time
the bookmark is used.
It is worth reflecting on how the approach to gathering data from end users has
evolved. Initially, the HTML form to gather data was part of the static file Support-
SessionInsert.html. The form submitted its data to a PHP script, Version 0.1 of
SupportSessionInsert.html. This script both validated the data, reporting any valida-
tion failures, and generated a success page. Its success page was also vulnerable to the
reload problem as the form data was submitted using the HTTP GET method. The
combined script developed in the previous section could generate the HTML form to
gather the data, report validation failures on the HTML form itself and generate a
success page. Again, the success page was vulnerable to the reload problem. In fact,
the reload problem arises because the same script that receives the data also generates
the success page. This suggests an obvious solution: have one script to receive and val-
idate the data and do the database modification, but use a second script to generate
305
IDWA_C07.QXD 10/5/07 11:24 AM Page 306
the success page. As this second script does not perform the database modification,
users can click on refresh as often as they like and nothing untoward will happen.
Notice that the web page in Figure 7.17 (b), which shows the DBMS error message,
will also suffer from the reload problem. Each time the user hits the refresh button,
SupportSessionInsert.php tries to add the same new support session to the database
and, each time, the DBMS reports the same error. The simplest solution is to have the
script that delivers the success page report any DBMS communication errors, too.
The HTTP POST method is less susceptible to the reload problem as, if users refresh
the form, the web browser will warn them that they are about to resend data that has
already been sent. Also, a bookmark would not include the data, so only refresh
causes the problem. However, using the HTTP POST method is not a complete solu-
tion to the reload problem, so it is better to create a separate success page.
In a traditional GUI application, this sort of thing is simple. The code that did the
database modification would simply load up the form that delivered the success mes-
sage (or reported the DBMS communication error). This is not possible on the Web. In
a web application, the web browser requests a particular resource from the web server.
The web server responds to this request by supplying the requested resource. The web
server cannot respond by supplying a different resource from the one requested. How-
ever, the web server can ask the web browser to request a different resource. It does this
by sending a location header in the HTTP response. Location headers are used quite a
bit when a website is reorganized and important resources (such as a home page) are
moved or renamed. Any end user with a bookmark to the resource will be using the
old URL to request it. The web server recognizes the old URL and uses a location
header in the HTTP response to tell the web browser where the requested resource has
moved to. It is up to the web browser to send a second HTTP request, this time request-
ing the resource from its new location.
The structure of the header section, and, in fact, each HTTP header, is shown in Fig-
ure 7.18 (see Chapter 2 for a discussion of the structure of HTTP messages). The header
name is followed by a colon and a space, then the header value. A typical location
header looks like this:
Location: SupportSessionInsertSuccess.html
Header section
Header name : Header value CRLF
CRLF
306
IDWA_C07.QXD 10/5/07 11:24 AM Page 307
It begins with the header name ‘Location’, then gives a URL – in this case, a relative
URL. Location headers are usually combined with a response line of the form:
HTTP/1.1 301 Moved permanently
The status of 301 tells the web browser that the requested resource has been perma-
nently moved to a new location. The web browser will expect to find a location header
giving the URL for this new location.
As discussed in Chapter 4, PHP writes the output of echo and print statements
and any data outside of a <?php ... ?> tag to the entity body. To allow the appli-
cation developer to write data to the header section, PHP provides the header()
function. The description of this functions is:
void header ( string header [, bool replace [, int
http_response_code]] )
The return type void means that the function does not actually return a value (so,
strictly speaking, isn’t a function, but we’ll let that pass). The only required parameter
is header, which contains the actual HTTP header text. The parameter replace in-
dicates whether or not this header replaces any previously written headers of this type
or adds to them. http_response_code is the status code to write in the HTTP re-
sponse line. For location headers, the header() function will automatically write an
HTTP response line with status code 302, meaning that the resource has been moved
temporarily. This is the best status to use when redirecting the web browser to a suc-
cess (or DBMS error) page, so it’s best to omit http_response_code when the
header is a location header.
Figure 7.19 shows a first attempt at solving the reload problem using the header()
function. Only step 4 of the process is affected (lines 225–237). In earlier versions of the
PHP script, step 4 either confirmed that the database modification had succeeded or re-
ported a DBMS communication error. In Version 1.1, step 4 writes location headers to
the HTTP response, asking the web browser to request a resource from a different loca-
tion. If the database modification succeeds, the location header points to a success
page, but, if not, it points to a page that reports DBMS communication errors.
The success page is a static HTML page that informs end users that their database
modification has succeeded. It is a lot less informative than previous success messages,
Figure 7.19 Using the location header to redirect a web browser to a different PHP script.
307
IDWA_C07.QXD 10/5/07 11:24 AM Page 308
which also told them what data had been added. However, line 229 shows how easy it is
to use the header() function. Reporting a DBMS communication error is a little more
complex. Lines 235–236 use the functions mysql_errno() and mysql_error() to
create a URL for the location header that includes a query string. When there is a DBMS
communication error, the web browser is sent a location header like:
Location: ReportDbmsError.php?errno="1062"&error="Duplicate
entry '1-Monday-17:00:00' for key 1"
This location header gives a relative URL, so the web browser assumes that the web
resource ReportDbmsError.php is on the same path as the original resource. The
web browser uses this to create a URL for a new HTTP request and includes the query
string. This ensures that the MySQL error number and error description are passed to
ReportDbmsError.php.
Be careful when passing string data to the header() function. If the header string
includes new line characters, then PHP will generate a warning as HTTP headers
cannot include new line characters.
This technique could also be used to pass data to a more sophisticated success page.
However, there is a problem with that. Although the HTTP protocol doesn’t set a limit
on the length of a URL, many web browsers do, though it isn’t easy to find out what that
length is. Rather than use the query string to pass all the data entered by users to the suc-
cess page, it is better to pass just the primary key values. The success page can then query
the database to retrieve the record that was just added and display it to the end user.
Manually writing HTTP headers can cause a very annoying problem. When PHP cre-
ates the HTTP response message, it needs to write the HTTP headers before it writes
anything to the entity body, for obvious reasons. So, when a PHP script writes some-
thing to the entity body – either because there is HTML code outside the <?php ... ?>
tag or because of an echo or print statement – the PHP application server assumes
that there are no more headers to come. It writes out a response line and any headers
it thinks are required, then starts writing the entity body, line by line. If there is a sub-
sequent call to the header() function, PHP generates a warning message:
Warning: Cannot modify header information - headers already
sent by (output started at SupportSessionInsert.php:226) in
SupportSessionInsert.php on line 237
The key information in this message is that, because the PHP application server has
already written headers to the HTTP response message, it cannot write any more. It
also indicates at which line it first wrote the headers – line 226 in this message – and
which line there was a subsequent attempt to write headers – here, line 237.
The simplest solution to this problem is to avoid writing anything to the HTTP re-
sponse before using the header() function. Careful use of if statements can achieve
this. However, even a single blank line before the <?php ... ?> tag can lead to
headers being sent. PHP provides special output buffering functions to allow greater
control over when the PHP application server writes data to the HTTP response mes-
sage. Output buffering is an advanced feature of PHP programming, so is not covered
in this book.
308
IDWA_C07.QXD 10/5/07 11:24 AM Page 309
Adding data to a single table is a relatively simple process. Some web database applica-
tions need to gather quite complex data from end users and use this data to modify a
number of different tables. The simplest solution is to use a long web page, with lots of
form controls. This is not very user-friendly and may lead to more validation failures
and, hence, users becoming frustrated.
The alternative is to have a number of related web pages, each gathering a small
portion of the data required. When users have completed one page, they submit the
data and move on to the next web page, which gathers the next portion of the data.
Once all the data has been submitted, the application can modify the database. Appli-
cations that gently guide end users through a complex process in this way are often
called wizards. Wizards are sometimes used for user registration and often used for
opening Internet bank accounts or similar financial transactions.
At the Pennine University, there is a wizard to help students enrol on their chosen
course at the start of the year. The first step asks end users to enter their student ID
(Figure 7.20 (a)). This is passed to the next web page (Figure 7.20 (b)), which uses it to
309
IDWA_C07.QXD 10/5/07 11:24 AM Page 310
retrieve their contact details from the database. The contact details are delivered as an
HTML form, so users can modify them. When users submit the form data, a third wiz-
ard page (Figure 7.20 (c)) shows the modules students will take in the coming year. If
the course includes optional modules, then students can choose which options they
wish to take. When users submit their module choices, all the data gathered from
these three web pages is used to modify the database instance.
310
IDWA_C07.QXD 10/5/07 11:24 AM Page 311
Strictly speaking, modifying contact details and module choices could be done inde-
pendently, but the enrolment process provides a good example of a wizard in a web
database application.
The main problem with wizards is ensuring that data gathered on one wizard page
is available to all subsequent wizard pages. The self-enrolment wizard illustrates why.
Changes made to the contact details are submitted to the third wizard page, but this
311
IDWA_C07.QXD 10/5/07 11:24 AM Page 312
page does not write these changes to the database. Instead, it simply gathers more data
from end users – their module choices. It must then pass all the data – module choices
and contact details – to a final script and it is that which performs all the database
modifications.
This is the old problem of maintaining state across multiple stateless HTTP interac-
tions. One solution has already been discussed: URL rewriting. URL rewriting is effec-
tive for keeping track of which part of a large result set should be displayed (Section 7.2)
or for passing data to an error-reporting web page (Section 7.4). It is less effective for
wizards, however, as there is usually far too much data.
Another problem is that data sent via the URL query string is not secure. The best so-
lution is to store the data in the database at each stage and have subsequent wizard
pages retrieve it as required. All that needs to be passed from wizard page to wizard page
is some form of unique identifier. Sessions provide probably the best way to implement
this approach and they are discussed in the next section. One problem with this
approach is that repeatedly retrieving the data from the database can degrade perfor-
mance. If there isn’t too much data (just too much for a URL) and security isn’t a major
problem, then hidden form controls offer an alternative mechanism to sessions and
URL rewriting for maintaining state.
Hidden form controls are just that – controls on an HTML form that have a value,
but are hidden from the end users. They are defined using the HTML input element.
For example:
defines an input control of type hidden (that is, a hidden form control). Its name is
studentNo and its current value is 2222222222. When the web browser displays
the web page with this input element, there is nothing visible on the form. However,
the control is part of the form, so, when end users submit the form data, the web
browser also submits the data stored in the hidden form control. A PHP script can
rewrite the value of a hidden form control just as easily as it can rewrite the query
string of a URL and this is how a wizard can use hidden form controls to pass data
from page to page.
The self-enrolment wizard queries four database tables. The Student table holds
data about each student – their names and contact details, the courses they are study-
ing and their year of study. The Course and Module tables hold data on courses and
modules respectively. Finally, the table CourseModule holds data about which mod-
ule is used on which course and whether it is a core or optional module on that course.
The CourseModule table is known as a link table.
Link tables allow a relational database to deal with situations where many rows of one
table are linked to many rows of another. This is the case with the Course and Module
tables as one course will have many modules, but each module may be taken on many
courses. (Link tables are discussed in detail in Chapter 9.) The table schema are:
312
IDWA_C07.QXD 10/5/07 11:24 AM Page 313
Figure 7.21 The HTML code for the first page of the self-enrolment wizard.
313
IDWA_C07.QXD 10/5/07 11:24 AM Page 314
The next task is to retrieve students’ contact details, together with which courses they
are on and which stage (year of study). All this information is held in the Student
table, so the SQL select statement in lines 63–66 in Figure 7.22 does the job.
The studentNo column is the primary key for the Student table. Because the select
statement uses an equality match on this primary key, there will only be one row in the
result set (this first draft of the script does not attempt to deal with an incorrect student
ID number). This simplifies the script as there is no need to loop through the result set.
Line 75 in Figure 7.23 fetches the single row from the result set into the associative
array $row. After a couple of lines that write information, line 81 starts the form. Line
82 writes out the student ID number $row["studentNo"] – as plain text (end users
cannot change this value on this web page). Line 83 writes the student ID number out
again, but this time as the value of a hidden form control:
Figure 7.22 Retrieving the data for the second wizard page.
Figure 7.23 Writing out the form to allow end users to amend their contact details.
314
IDWA_C07.QXD 10/5/07 11:24 AM Page 315
When end users click on the submit button, the web browser sends the name and
value of this hidden form control to the third wizard page. Without this hidden form
control, the third wizard page would not know the value of the student ID number en-
tered on the first wizard page. Lines 84 and 85 also use hidden form controls to pass
the course code and year of study to the third wizard page.
It is worth noting that all this data is enclosed in an HTML paragraph element,
starting in line 82 and ending in line 86. The form element cannot appear inside a
paragraph element, but paragraph elements can appear inside the form element. (This
is one of the many nesting rules of the XHTML 1.1 standard.) The elements that define
the form controls must appear inside a paragraph element (or some similar element).
So, the standard approach to creating a form is that the form element contains a para-
graph element, which contains form control elements.
Similar nesting rules apply to HTML tables, so are applied in lines 88–93. These
write out an HTML table with four row elements – <tr> . . . </tr> – each of
which has two table data elements – <td> . . . </td>. Table data elements define
the columns for each row in an HTML table. (There are also table header elements –
<th> . . . </th> – that define the column headings. These are used on the wizard
page.)
The HTML table is used to improve the presentation of the student contact details.
The first column contains a label for a data item. The second column contains the cur-
rent value of that data item in a text box control (note that there is no need for a para-
graph element here as an input element can appear inside a table data element as well
as inside a paragraph element). For example, in line 89, the label is ‘Given name’ and
the text box contains the value $row["fName"].
Using tables is a standard way to achieve a tidier web page. Note how the first four
text boxes in Figure 7.20 (b) line up nicely. A second HTML table makes the seven text
boxes used for the address line up nicely, too.
An HTML table is a structure within the HTML document. It is nothing to do with the
database.
The address has been split into seven separate data items. There is a reason for this.
Many applications now use special software to retrieve the correct address based on
the house name, or number, and postcode. It is easier to get end users to split the ad-
dress up than to try and use the string manipulation functions to extract the house
name or number and the postcode from one long address value. It also facilitates data
mining to find out, for example, how many students come from EU countries. If the
country is buried inside a single string value for the address, such questions cannot
easily be answered.
The third wizard page – EnrolmentWizardStep3.php – receives 14 name-values pairs
via the HTTP POST method – 3 from hidden form controls and 11 from visible con-
trols. It first validates the data, ensuring that the required data has been submitted.
Then the values of $_POST["course"] and $_POST["stage"] (from hidden
form controls) are used to retrieve a list of core and optional modules. Figure 7.20 (c)
shows how these are delivered. The core modules are shown as text and the optional
modules in an HTML select element.
315
IDWA_C07.QXD 10/5/07 11:24 AM Page 316
The script structure for this page follows the usual six steps to query a database from a
server-side script. Figure 7.24 shows the SQL select statement used in step 3. It is a join
query, joining the tables Module and CourseModule. The first condition in the
where clause is the join condition. The second condition uses the value of
$_POST["course"], copied to the local variable $course. This value identifies the
course being taken by the student. So, this condition ensures that only modules taken
on that course are retrieved. The third condition uses the same approach to ensure
that only modules relevant to the student’s current year of study are retrieved. The
order by clause gets the DBMS to order the rows so that the core modules (where the
value of type is 'Core') come before the optional modules (type is 'Option')
and that core and optional modules are then ordered alphabetically by title.
In Figure 7.20 (c), the course is identified only by its code, passed from the previous
wizard page. It would be better to retrieve the full course details from the database
and deliver these as text. The student should not be able to modify course details
during enrolment. This means running two separate SQL queries in the one script,
which is easy enough – simply repeat steps 3, 4 and 5 for each database query in
turn.
Once the SQL statement has been run successfully, the script sets up the HTML form,
creating hidden form controls for each of the 14 items of data that were passed to it.
This ensures that the data is passed on to the next wizard page. It then writes the core
modules to the web page, using an HTML table to line up the data, and sets up the op-
tional modules as items in a list that end users can select from.
Figure 7.25 shows the PHP code to achieve both of these tasks. Lines 125 and 126
set up the table to display the core modules. Notice that the first row in the table, in
line 125, has columns defined using the table heading element – <th> ... </th> –
not the table data element.
Line 129 fetches the first row of the result set and line 130 starts a while loop to
loop through the rows one at a time. The condition on this while loop has two
Boolean expressions. The first – $row <> FALSE – tests that there is a row to process.
316
IDWA_C07.QXD 10/5/07 11:24 AM Page 317
Figure 7.25 Delivering the module data (core modules are listed as text, op-
tional modules as choices in a list box).
The second – $row['type']=='Core' – tests whether the value of the type col-
umn for this row is 'Core'. When both Boolean expressions are TRUE, PHP executes
the echo statement in line 133. This writes out the HTML to create a new row in the
table, with the module’s code and title in the table data elements. Line 134 fetches the
next row and PHP loops back to line 130. This process continues until the value for
the type column in the current row is no longer 'Core' (which means that it will be
'Option'). When this happens, all the core modules must have been dealt with be-
cause the order by clause in line 92 (Figure 7.24) ensures that core modules appear first
in the result set.
As all core modules will now have been dealt with, the script writes the end table
tag (line 136) and moves on to deal with the optional modules. End users need to be
able to choose which options to take, so they are displayed as HTML option elements
within an HTML select element. Line 143 writes the select element’s start tag. On the
web page generated by this script, the start tag will be:
<select name="module[]" multiple="multiple" size="5">
The size attribute tells the web browser how many list items to display at one time. If
there are more than five, it will automatically add a scroll bar to the list. The multiple
attribute allows users to select more than one item from the list. In older versions of
HTML, it was sufficient just to give the attribute name, but, in XHTML, all attributes
must have a value, hence the rather clumsy name-value pair multiple="multiple".
317
IDWA_C07.QXD 10/5/07 11:24 AM Page 318
The name attribute is the interesting one. Because the select element allows multiple
selections, it needs to pass multiple data values to the next PHP script. This PHP script –
for the fourth wizard page – expects the form controls to come via the $_POST super-
global array. As superglobals are associative arrays, each element has a key and a value.
The key of the $_POST array element is the name of the HTML form control element.
The value of the $_POST array element is the value of the HTML form control element
and, by default, this is a string. If the name attribute is defined as name="module",
then the value of $_POST["module"] is a string that holds information on only one
of the selected items. Adding the square brackets to the HTML select element’s name
attribute informs PHP that the value of $_POST["module"] is an array, not a string.
Because PHP now expects the value of $_POST["module"] to be an array, it correctly
adds every item chosen by end users to this array.
This is quite a complex situation – $_POST is an array. Most of its elements are
strings, but the element $_POST["module"] is itself an array. The elements of the
array $_POST["module"] are strings – the list items chosen by end users. Figure 7.26
presents a picture of the $_POST superglobal array. The key-value pairs are as they
would be when end users click on the ‘Confirm’ button shown in Figure 7.20 (c) and
the fourth wizard page is called. The first 14 elements – $_POST["studentNo"],
$_POST["course"], . . . , $_POST["postcode"], all have character string val-
ues. The fifteenth – $_POST["module"] – has an indexed array as its value. In the
image shown, the indexed array has two elements and indicates that this end user se-
lected both optional modules shown in the list in Figure 7.20 (c).
Returning to the code in Figure 7.25, line 144 starts another while loop. This while
loop picks up where the previous one left off. At this point, $row must hold the row of
data for the first optional module (the previous while loop dealt with all the core mod-
ules). Because of the order by clause (line 92, Figure 7.24), the rest of the rows in the
result set will also be optional modules (‘Core’ and ‘Option’ are the only possible values
Figure 7.26 A picture of the $_POST array, showing its key value pairs for
the web page in Figure 7.20 (c).
Key Value
studentNo 2222222222
course B2Y8
...
318
IDWA_C07.QXD 10/5/07 11:24 AM Page 319
of the column CourseModule.type). So, this while loop simply runs through the
remaining rows of the result set, writing each out as an HTML option element.
Lines 150–153 (Figure 7.25) close the select element, write out a submit button and
close the form element. Line 155 closes the connection to the DBMS.
The final wizard page (not shown) receives all the data gathered from each end user
and writes it to the database. This means updating the Student table with the new
contact details and updating the timetable database with the list of optional modules
chosen by the student.
This example has demonstrated how to use hidden form controls to pass data along
a chain of PHP scripts, ensuring that data gathered on the web page generated by the
first PHP script is available to all subsequent scripts. Without hidden form controls,
the data would only be available to the next script in the chain.
There are some problems with the self-enrolment wizard as it is currently coded.
One significant improvement would be to write the contact details to the database on
the third wizard page before retrieving the module details. From a business perspec-
tive, changing contact details and choosing optional modules are both part of the
same process – enrolment. From the database perspective, they are quite independent
actions, so there is no need to insist that both changes are made at the same time. This
would mean that only the student ID number, course code and year of study would
need to be passed along using a hidden form control, simplifying the code.
It’s also possible to generate all the wizard pages from a single PHP script. This would
make the validation more user-friendly. At the moment, any validation failures mean
going back to the first wizard page and starting the self-enrolment process from scratch.
319
IDWA_C07.QXD 10/5/07 11:24 AM Page 320
Processing requested
no session ID
Session initialized
Processing requested
with session ID
Session data retrieved
continues until the end user requests that the session ends or until there has been no
HTTP request with this session ID for a fixed length of time. Once ended, the applica-
tion server destroys the session ID and session variables and gets the web browser to
destroy its copy of the session ID.
Sessions can be a much simpler mechanism for the application developer to use
than URL rewriting and hidden form fields. There is no need for them to write code to
embed complex URL query strings or hidden form controls into the web page. The ap-
plication server itself maintains the application state, and the HTTP server and web
browser automatically send the session ID to each other in their HTTP messages. Even
if many different end users request the same script at the same time, the web server
just creates an instance of the script for each end user and ensures that each instance
receives the correct application state for that user. All the application developer need
do is create and use session variables, knowing that the right ones will be available
each time the script is called by a particular end user.
320
IDWA_C07.QXD 10/5/07 11:24 AM Page 321
As a mechanism, sessions is also more secure than URL rewriting and hidden form
fields because, as application state data never leaves the application server, they can-
not be intercepted. Thus, common uses of sessions include storing shopping basket
data, and managing user logins (though HTTP authentication provides an alternative
login mechanism).
When sessions are used to personalize websites, the session variables are usually
stored in a database, so data values can persist even after a particular session closes.
Once a user logs in, he or she can see information about past interactions with the
application and perhaps suggestions as to future actions, such as buy certain products
or sign up for a newsletter.
PHP makes using sessions very easy. Any script that needs to access session data
should begin by calling the built-in PHP function session_start(). The function
has no parameters and always returns true. When a script calls session_start(),
the PHP application server checks to see whether or not the HTTP request message came
with a session ID. If it did not, then it initializes a new session by creating a unique ses-
sion ID. This session ID is sent to the web browser in the HTTP response, typically as
a cookie (see note below), though, as some end users block all cookies, it can also be
sent using URL rewriting (see PHP Documentation Group (2005) for details). Once
session_start() has initialized a session, the script can add session variables. In
PHP version 5, the session variables are all elements of the $_SESSION superglobal
array. To create session variables, simply add an element, key and value, to this array.
When PHP receives an HTTP request that does include a session ID, session_
start() behaves differently. Instead of initializing a new session, it retrieves any
existing session variables from the session data store and makes them available via the
$_SESSION superglobal array. The script can add new session variables, change the
value of existing session variables or remove them. The session ends automatically
when the user closes the browser. To end a session manually takes some doing,
though. First, the $_SESSION superglobal array needs to be cleared out, then the web
browser needs to be told to remove the cookie from the client computer, then finally
the built-in PHP function session_destroy() removes all trace of the session
from the web server. The PHP user manual (PHP Documentation Group, 2005) has
some useful user-contributed comments on how to end sessions.
A cookie is a set of data stored by a web browser on its client computer. Each cookie
is associated with a particular Internet domain. Cookies set by the Pennine University’s
website, for example, could be associated with the domain ‘www.pennine.ac.uk’
domain or with ‘pennine.ac.uk’. (The second is broader as it includes the mail server
and other Internet-accessible servers.) The web server specifies which domain a
cookie is associated with, along with other cookie information, and sends this infor-
mation to the web browser using special HTTP headers.
When the web browser creates an HTTP request for a resource in a given domain
(because the end user clicked on a hyperlink, for example), it automatically includes
in the HTTP request any cookie data for cookies associated with that domain. Again,
the cookie data is sent using HTTP headers.
Persistent cookies are stored in a file on the computer’s hard disk and persist until
their expiry date, when the web browser will remove them. Session cookies are
removed when the web browser closes.
321
IDWA_C07.QXD 10/5/07 11:24 AM Page 322
A nice, simple example of a web database application that can only really be imple-
mented using sessions is the ‘My list’ feature of the Pennine University’s Library Cata-
logue. The library has a web-based catalogue, allowing people to search for books they
want to take out on loan. Figure 7.28 (a) shows the result of searching the catalogue with
the title keyword ‘database’ (generated by the TitleSearch.php script). The search results
give the title of the book and the class mark, which is enough to find it on the shelves.
Notice that this search application uses the same multiple page mechanism as the
Staff Directory search application discussed above. In fact, the code for the two
search applications is almost identical. It is quite common that two applications are
similar enough to share a lot of code, but different enough to be separate applica-
tions. Good application developers turn the common code into functions, such as the
cleanse_data() function discussed in Chapter 4, and reuse it whenever they can.
This reduces the likelihood of faults as many of the functions used in an application
will already have been used, and so tested, in existing systems.
Object-orientated software engineering takes this approach much further and is ar-
guably a more effective approach to application development than procedural pro-
gramming with functions. Those who are familiar with the object-orientated approach
could recast some of the example web database applications into object-orientated
ones as an exercise as PHP 5 supports this approach.
(a) The search results page, with URL for ‘Details’ hyperlink in status bar.
322
IDWA_C07.QXD 10/5/07 11:24 AM Page 323
(b) The book detail page, with URL for ‘View my list’ hyperlink in status bar.
This example uses the ISBN as the primary key for the Book table. Obviously this
isn’t realistic. A library will hold many copies of a book and each copy has the same
ISBN. However, for the purposes of this example, ISBN will do as a primary key for
the Book table as two different books will have different ISBNs and we can assume
that the borrower won’t care which copy of any particular book is picked.
Look closely at the list of authors on the Book detail web page in Figure 7.28 (b).
There is a superfluous comma after the second author’s name. This appears because
each author is listed on a separate line in the result set for the database query. The
BookDetail.php script writes out the authors one at a time and doesn’t know how
many there are altogether. Adding a comma after each name sets the line up for the
next author. However, the last author in the list always has a superfluous comma after
their name. The script should really trim this off.
323
IDWA_C07.QXD 10/5/07 11:24 AM Page 324
Figure 7.29 Adding a book to the booklist using Version 0.1 of List.php.
So far, the application has not needed to maintain data about users’ choices on the
web server. However, the librarians were always being asked for pens and paper, so that
people could write down details of books they wanted. This led to lots of lost pens and
wasted time. To help out, the ‘My list’ feature was added to the catalogue. The ‘Add
this book to my list’ hyperlink calls the List.php script, passing it details of the book
being displayed. The effect of this is to add the book’s ISBN, class mark and title to the
session data store and display an end user’s current book list (Figure 7.29). End users
can do further searches and add other books to their lists. When they have all the
books they want on their lists, they can print them out and carry the printout with
them as they search the shelves for these books. Because each book list can be as long
as end users want, it is impractical to use URL rewriting or hidden form controls to main-
tain this data. Storing the data in the session data store is clearly the best solution.
Figure 7.30 shows the code for Version 0.1 of List.php. The script calls
session_start() in line 22. This establishes, or re-establishes, the session. Any
data stored in the session store can be accessed via the $_SESSION superglobal array
and new data can be added to this array simply by adding a new element.
From line 25 to line 33 there is the usual start to a web page at the Pennine University.
Line 35 tests whether or not any data was passed to the script via the HTPP GET method.
(This is rather weak data validation and should be improved before the script is deployed.)
If there is some data, then it will be details of a book to add to the list – specifically,
the ISBN, class mark (the library catalogue number) and title. Lines 38–40 add this
data to the $_SESSION superglobal array as a new element. First, line 38 cleanses the
ISBN and stores it in the variable $key, then, in line 39, the ISBN becomes the key to
324
IDWA_C07.QXD 10/5/07 11:24 AM Page 325
a new element in the $_SESSION array. The value of this new element is formed by
concatenating the (cleansed) values of the book’s title and class mark. This is all that
needs to be done to add the data to the session data store. PHP will ensure that the
data is available when this session is re-established. Note that if the same book is
added for a second time, then there is already an element of $_SESSION with its ISBN
as the element key. In this situation, what happens in line 39 is that the value of the
existing element is overwritten (with exactly the same information, so nothing actually
changes). This means that no book ever appears twice in the book list. This rather elegant
feature avoids the script having to check $_SESSION for identical data each time a
book is added to the list.
Lines 43–53 display the book list itself, using a standard PHP foreach loop to run
through the elements of the $_SESSION superglobal array. Note that both the key
(the ISBN) and the value (the title and class mark) of each element are displayed on the
web page. This happens each time a book is added to the list and also when the end
user clicks on a ‘View my list’ hyperlink (see, for example, Figure 7.28 (b)). Figure 7.29
shows a web page generated by Version 0.1 of List.php, with just one book on the list.
One interesting point about sessions is that only scripts that need to access the ses-
sion data store have to start a session. Neither the TitleSearch.php script, which gener-
ated the list of books in Figure 7.28 (a), nor the BookDetail.php script, which generated
Figure 7.28 (b), uses session data, so neither script calls the session_start()
function. Even so, when end users request either of these resources from the Library
Catalogue, the web browser will send the session ID in the HTTP request.
325
IDWA_C07.QXD 10/5/07 11:24 AM Page 326
Figure 7.31 Clearing the book list by destroying all session data: Version 0.1
of ClearList.php.
If end users browse off to another website, then the web browser will not send the ses-
sion ID. This is because, by default, PHP sets the domain of the cookie that stores the
session ID to the web server’s domain – www.pennine.ac.uk. Any resource requested from
this domain is sent the Library Catalogue session ID.
If other applications, such as the Web Timetable, also use sessions, then the applica-
tion developer needs to set different domains – for example, ‘www.pennine.ac.uk/
LibraryCatalogue’ for the Library Catalogue session ID and ‘www.pennine.ac.uk/
WebTimetable’ for the Web Timetable session ID. The PHP user manual (PHP Docu-
mentation Group, 2005) describes how to do this.
It is common to allow users to end a session, clearing out any session data on the
web server and removing the cookie from the client machine. Version 0.1 of the script
ClearList.php (Figure 7.31) does this. Add a ‘Clear list’ hyperlink to the book list web
page to allow users to clear their lists (Figure 7.32). The ClearList.php script includes all
its PHP code before it outputs any HTML. This is because most of the statements cause
PHP to write an HTTP header to the HTTP response message. Remember, all HTTP head-
ers need to be written before any of the HTTP entity body – that is, before any HTML.
At present, the only way to remove data from the list is to clear it out entirely. It would
be nice to include a ‘Remove this book’ feature so that end users could remove indi-
vidual entries from their lists.
To do this, the script would have to remove an element from the $_SESSION array.
PHP provides the built-in function unset(), which achieves this. For example, to
remove the first book in the list in Figure 7.32, use the PHP statement:
unset($_SESSION["0321189566"]).
This works because the book’s ISBN – 0321189566 – was used as the key of the
element.
326
IDWA_C07.QXD 10/5/07 11:24 AM Page 327
Figure 7.32 The booklist web page with a ‘Clear list’ hyperlink.
Line 20 (Figure 7.31) starts the session. This is necessary as the script will alter the ses-
sion data (by removing it all). Line 21 sets the value of the $_SESSION superglobal
array to the empty array. This is the line that gets rid of all the session data from the
web server.
Lines 22–25 deal with the cookie on the client machine. It is possible that the cookie
hasn’t yet been set – the end user may have bookmarked the ClearList.php page or a
page that contains a hyperlink to it and got to this point before the web server sends
the session ID to the client machine. When PHP first sets the cookie that stores the ses-
sion ID, it gives it a default name. That name is always accessible, using the built-in
PHP function session_name(). When a session ID has been sent, there will be an
element in the $_COOKIE superglobal array and the key for it is the name of the cur-
rent session. The built-in PHP function isset() checks whether or not a variable or
array element has been given a value, so isset($_COOKIE[session_name()])
returns true when there is a cookie with the name of the current session. This confirms
that a cookie has been set on the client machine. In this case, line 24 uses another
built-in PHP function to set an HTTP header that instructs the web browser to destroy
this cookie.
It’s worth examining this function:
327
IDWA_C07.QXD 10/5/07 11:24 AM Page 328
destroy the cookie. (There are other parameters to the setcookie() function, but
they aren’t required when destroying a cookie.)
Erasing and destroying the cookie may seem like overkill, but remember that this
function doesn’t actually do either of these things. It merely sends an HTTP header
that asks the web browser to do them. By asking the web browser to erase the cookie
data first, then destroy the cookie completely, there is a better chance that the infor-
mation stored in the cookie will actually be removed – if one action fails, the other
may well succeed.
Line 26 calls the built-in PHP function session_destroy() which destroys the
remaining session information on the web server (the session variables themselves
were unset in line 21).
Finally, line 27 sends an HTTP header redirecting the web browser to the Title-
Search.html web page. This is the page that gathers the catalogue search criteria. The
rest of the script is plain HTML, describing a simple web page that informs end users
that they are being redirected. Strictly speaking this is not necessary as the web
browser should immediately request the TitleSearch.html web page. End users see the
web page defined by the code in lines 29–43 only if it takes a long time for the web
server to deliver TitleSearch.html.
The session created in the ‘My list’ example allowed anonymous browsing – there was
no need for users to identify themselves before adding books to their lists. Many applica-
tions, however, require users to identify themselves before they can use the application.
First, each end user must be registered with the application. Registration involves
issuing each end user with username and password. Some web database applications
allow users to register themselves. They usually provide some basic information, such
as their names and e-mail addresses, and choose their own usernames and passwords.
These applications are, in effect, open access as anyone can register. However, once
registered, end users’ behaviour can be monitored.
There are many reasons for an application monitoring the behaviour of end users.
For example, an organization might want to identify which parts of a website are most
popular, prevent misuse of the application, gather information about users for direct
marketing purposes or facilitate personalization of a website. When a web database ap-
plication is not open access, then the application or database administrator will issue
usernames and passwords. These act as a barrier to entry as well as allowing the admin-
istrator to monitor end users.
Once registered, end users can login each time they return to the website. To login,
they identify themselves with their usernames and authenticate their identity with
their passwords. Once logged in, the application must keep track of each end user’s
identity from one HTTP request to the next. This is where sessions can be useful.
When users first come to the website, a simple HTML form allows them to login. For
each user, the username is stored in a session variable, so is available to the application
328
IDWA_C07.QXD 10/5/07 11:24 AM Page 329
for subsequent HTTP requests. Each script in the application must check that the
‘username’ session variable exists and, if not, display the login page (for example, by
using redirection). There should also be a ‘logout’ button or hyperlink on each web
page. This calls a script to destroy the session.
HTTP authentication provides an alternative way to keep track of the identity of
end users. HTTP authentication is built in to the HTTP protocol and it is possible to set
up the the web server so that it checks that an end user is logged in before dealing with
the HTTP request. The HTTP server automatically makes the login details (username
and password) available to PHP.
A good example of an open access application that wants to track its end users is the
Pennine University’s Course Catalogue. Prospective students can browse the online
catalogue anonymously, but must register to receive a printed copy. Once registered,
they can log in and the application will keep track of those pages they have visited. It
can use this information to suggest other courses that they might be interested in.
Many web applications currently use similar techniques to personalize web pages for
registered users. Logging in also gives the prospective students access to parts of the
website that are not open access. For example, they can track the progress of their ap-
plication and view some of the online learning materials used in the courses that they
are interested in.
This section covers gathering registration data and managing HTTP authentica-
tion from within PHP scripts. Once these techniques are understood they can be
combined with sessions and storing information on the database to develop scripts
that implement much more sophisticated functionality than the examples dis-
cussed here.
329
IDWA_C07.QXD 10/5/07 11:24 AM Page 330
● "virtual" means that the text wraps on the web page, but only new line charac-
ters actually typed by end users are sent as part of the control’s data
● "physical" means that the web browser actually adds new line characters to the
data wherever it needs to wrap the text
● "off" means that the text is not wrapped, so end users must type new line charac-
ters to start a new line.
Notice that the label ‘Address’ is not aligned with the top of the address textarea. Web
applications now use cascading stylesheets to alter the web browser’s default presenta-
tion, but, as this technology has not been covered, the examples stick with the default
presentation.
330
IDWA_C07.QXD 10/5/07 11:24 AM Page 331
When the web browser submits the data from the textarea form control, it correctly
sends the new line character, encoded as %0D%0A (this is actually two characters –
‘carriage return’ followed by ‘linefeed’). When PHP uses this data as part of an insert
statement, the new line characters are included as part of the address. When the data
is inserted into TEXT or the VARCHAR column of a MyISAM table, the MySQL DBMS is
quite happy with this, correctly storing the new line characters in the address. When the
data is later retrieved from the table, it will again correctly include the new line charac-
ters. This may not be the case with every MySQL storage engine, nor other DBMS.
The other new form control is the password control. The form controls labelled ‘Pass-
word:’ and ‘Confirm password:’ in Figure 7.33 actually have the word ‘password’ typed
into them, but password form controls never display the actual data they hold, only a se-
quence of ‘*’ or ‘•’ characters (depending on the web browser). This means that users can
enter their passwords without worrying that someone could read over their shoulder what
they have typed. The password form control is defined using the HTML input element:
<input type="password" name="password1" size="20"
maxlength="20" />
Of course, hiding the text from end users also presents a problem. For example, if Heidi
had meant to type ‘password’ but accidentally mistyped it as ‘passeord’, then she cannot
see her mistake. That is why there is a ‘Confirm password:’ control. It is unlikely that
end users will make the same typing error twice, so asking them to type their passwords
twice and comparing the two values is a good way of preventing this sort of mistake.
Code listing 7.7.1.1 shows how to cross-validate two password controls. (See Sec-
tion 7.3 for a fuller discussion of validation.)
{
$validationFail = true;
$validationMessage["password"]="There was a problem with
the password. Please try again.";
}
Once the user registration data has been gathered, it is usually stored in the database. It is
good practice to encrypt password data before storing it. There are a number of encryp-
tion algorithms available, but the simplest approach is to use a hash function. Hash func-
tions take a string of characters and return another string, usually of hexadecimal digits.
Given a particular input string, the hash function always returns the same output string.
For example, consider the SHA-1 hash function. Given the clear text value password as
input, the SHA-1 hash function always returns the hashed value 5baa61e4c9b93f3
f0682250b6cf8331b7ee68fd8 as output. Hash functions are designed so that it is
331
IDWA_C07.QXD 10/5/07 11:24 AM Page 332
not feasible to retrieve the original clear text value from the hashed value. Given the
hashed value 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8, there is no easy
way to work out that this represents the clear text value password.
Hash functions provide a useful mechanism for encrypting passwords. When a user
chooses a password during registration, it is the hashed value, not the clear text value,
that is stored on the database. This means that it is not possible for anyone to discover
another user’s password simply by looking in the database. When an end user comes
to login, he or she types the clear text password into the login form, the application
uses this to generate the hashed value and compares this hashed value to the one stored
in the database. If they match, then the user has entered the correct password.
The MySQL DBMS implements the SHA-1 hash function as the value function
SHA1(). This function expects a single string value as input and outputs another
string value:
string SHA1(string str)
So, for example:
SHA1('password')='5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8'
Note that the single quote marks around a clear text and hashed values are not part of
the string value but the usual SQL string delimiters.
Adding an encrypted password to a database is relatively straightforward. Figure 7.34
shows the code from Register.php, which deals with valid registration data. First, the
332
IDWA_C07.QXD 10/5/07 11:24 AM Page 333
data is cleansed and stored in local variables, then the script establishes a connection to
the DBMS.
Lines 140 and 141 are the interesting ones. These set up an SQL insert statement to
add the registration data to a table called User. The values added are simply those en-
tered into the registration form by the end user, apart from the value for the password
column. The value for this column is SHA1('$pwd'). This applies the MySQL func-
tion to the value of the PHP variable $pwd.
It is worth looking a little more closely at these two lines as it is important to under-
stand what the SQL insert statement actually looks like.
Code listing 7.7.1.2 (a) shows the PHP code that builds the SQL insert statement.
When the PHP application server processes this statement, it will replace each of the
variable names inside the string literal with their values. In particular, it replaces $pwd
with the value password, entered by the end user into the password control on the
registration form. The result is the SQL insert statement shown in Code listing 7.7.1.2
(b). Note that this SQL statement includes the MySQL function SHA1(). It is the
MySQL DBMS that encrypts the password, not PHP.
Code listing 7.7.1.2 Using the MySQL DBMS function SHA1 () to encrypt the
password
(a) The PHP code
$sqlStatement = "INSERT INTO User (name, address, email,
username, password)
VALUES ('$name', '$address', '$email',
'$username', SHA1('$pwd'))";
(b) The SQL insert statement built by the PHP code given the data shown in
Figure 7.33
INSERT INTO User (name, address, email, username, password)
VALUES ('Heidi Schultz',
'6 Winnow Close
Colleywood Park
Stenborough Gloucestershire',
'[email protected]',
'heidiz',
SHA1('password'))
PHP also includes a function called sha1(). This behaves in exactly the same way as
the MySQL function SHA1(). Using the PHP sha1() function means encrypting the
password before sending it to the DBMS.
Code listing 7.7.1.3 (a) shows one way to use the PHP sha1() function, while (b)
shows the SQL insert statement. The difference between Code listings 7.7.1.2 (b) and
7.7.1.3 (b) should be obvious.
Code listing 7.7.1.3 Using the PHP function sha1() to encrypt the password
(a) The PHP code
//First encrypt the password
$pwd=sha1($pwd);
333
IDWA_C07.QXD 10/5/07 11:24 AM Page 334
(b) The SQL insert statement built by the PHP code given the data shown in
Figure 7.33
INSERT INTO User (name, address, email, username, password)
VALUES ('Heidi Schultz',
'6 Winnow Close
Colleywood Park
Stenborough Gloucestershire',
'[email protected]',
'heidiz',
'5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8')
In the examples given of functions implementing the SHA-1 algorithm, the MySQL
function is always written in upper case and the PHP function in lower case. In fact,
both function names can be written in either case. It does make sense, though, to
consider always writing MySQL functions in upper case and PHP functions in lower
case. As with everything, this is a matter of personal style.
Lines 144–157 use relocation to avoid the reload problem. In this case, when the data
has been successfully added to the database, it makes sense to send the user to the
login page. If there is a problem, an error page is displayed instead.
334
IDWA_C07.QXD 10/5/07 11:24 AM Page 335
Figure 7.35 An HTTP authentication prompt from the Firefox web browser.
user for this particular security realm. If the HTTP request does not have an authoriza-
tion header, the web server responds with a challenge.
A challenge is simply an HTTP response message with status code 401, meaning
‘You need to login to get this resource’, and the WWW-Authenticate HTTP response
header set appropriately. The WWW-Authenticate HTTP header tells the web browser
how to login. This response message also includes an entity body. Typically, the entity
body is a simple web page stating that the requested resource can only be accessed by
authorized users. When the web browser receives the challenge, it displays a prompt
like the one shown in Figure 7.35. The user can either click on the ‘Cancel’ button or
enter his or her login details and then click on ‘OK’. When an end user clicks on ‘Can-
cel’, the web browser displays the web page that came as the entity body of the chal-
lenge (Figure 7.36).
When the end user enters their login details and clicks on ‘OK’, two things happen
(Figure 7.37). First, the login details are stored by the web browser in its authentica-
tion cache. Second, they are sent to the web server in the authorization header of a
second HTTP request for the same resource. This time, the web server should check the
details in the authorization header. If the login details are not valid, then the web
server sends another challenge. If the login details are valid, however, the web server
responds by supplying the requested resource. Note that, in this case, the entity body
of the challenge is never displayed to the end user.
Once logged in, the web browser sends the login details, as an authorization header,
with all future HTTP requests for web resources from the same domain (just as it does
with the session ID). Thus, HTTP authentication uses the same basic mechanism as
sessions to overcome the statelessness of the HTTP protocol and track authenticated
users from one request to the next. However, HTTP authentication is much more lim-
ited in the application state information that it holds. Only the username and pass-
word are maintained, nothing else.
The web browser stores login details for each domain, not for each security realm. It
will send the login details with every request for a resource from that domain. It will
not send the login details with requests for resources from a different domain.
335
IDWA_C07.QXD 10/5/07 11:24 AM Page 336
The HTTP authentication process described in Figures 7.36 and 7.37 is relatively
straightforward, and so is managing HTTP authentication with PHP. The HTTP authen-
tication process provides the PHP script with a username and password. The PHP script
uses the username and password to apply a security policy, asking. ‘Is this user autho-
rized to view this web page?’
Figure 7.38 shows Version 0.1 of the PHP script Login.php. In this case, the security
policy is very simple: if the end user has provided a username and password, any user-
name and password, then he or she is authorized. This is not a realistic security policy,
but is a good starting point as it keeps things as simple as possible. The script generates
different HTTP responses for authorized and unauthorized users, just as it would with
a more realistic security policy.
The username and password typed into the HTTP authentication prompt (Figure 7.35)
are sent to the HTTP server, which makes them available to the PHP script in the
336
IDWA_C07.QXD 10/5/07 11:24 AM Page 337
Figure 7.37 Retrieving a protected web resource by supplying valid authorization details.
337
IDWA_C07.QXD 10/5/07 11:24 AM Page 338
and HTTP servers, so is widely used. The second piece of information is the name of
the security realm – realm="Course Catalogue", in line 28. Setting the response
line and the WWW-Authenticate header will ensure that the HTTP response chal-
lenges the web browser to provide authentication details. When the web browser re-
ceives an HTTP response with a status of 401, it clears out its authentication cache for
the realm named in the WWW-Authenticate header and displays the authentication
prompt to the end user. Notice that this prompt includes the name of the security
realm (Figure 7.35).
The rest of this code block (lines 29–33) describes the entity body of the HTTP
response. This is the web page that will be displayed when end users clicks on ‘Cancel’
on the authentication prompt.
Line 29 is interesting. It is an include statement, instructing the PHP application
server to insert the contents of DocHeader.inc into the script at this point. Figure 7.39
shows the contents of DocHeader.inc. They are the first eight lines written to the entity
338
IDWA_C07.QXD 10/5/07 11:24 AM Page 339
Figure 7.40 The web page delivered with the HTTP challenge – it is only
seen if end users click on Cancel on the authentication prompt.
body of the HTTP response by the PHP script. When the PHP application server en-
counters the include statement, it assumes that the include file will begin with plain
HTML rather than PHP. (That is why previous include files, which actually contained
PHP, enclosed all their content in the PHP tag – <?php ... ?>.) Thus, the PHP
application server will write the whole of DocHeader.inc to the entity body of the
HTTP response message. When it returns to line 30 of Login.php, it assumes that it is
back to processing PHP, so correctly interprets the PHP echo statement.
Lines 30–33 (Figure 7.38) write out a helpful message. The message is delivered as
part of the entity body of the HTTP challenge.
Lines 45–50 finish off the entity body. Remember, this web page (Figure 7.40) is
only ever seen when end users click on ‘Cancel’ on the authentication prompt.
If end users enter both a username and a password, then they have met the security
policy enforced by this PHP script. In this case, the else block is executed instead (lines
36–43). This cleanses the data entered into the username and password text boxes and
stores them in local variables. It then includes the DocHeader.inc file, just as the if
block did. This HTML needs to be written to the entity body regardless of whether or
not end users meet the security policy. It cannot be written before line 25 as the if
block writes HTTP headers, which must be written before anything is written to the
entity body. So, it must be written independently in both blocks. Extracting the HTML
into an include file is a more elegant solution than copying it out twice.
Lines 40–42 write a message telling end users what username and password they
typed. Again, this isn’t realistic behaviour, but does demonstrate that the script has
successfully gathered the authentication details.
Finally, lines 45–50 again finish off the entity body. A web page generated by the script
after one user has successfully entered her username and password is shown in Figure 7.41.
A more realistic security policy would be to compare usernames and passwords to
those stored on the database. In the Course Catalogue database, these details are stored
339
IDWA_C07.QXD 10/5/07 11:24 AM Page 340
Figure 7.41 The web page delivered when an end user provides valid
authentication details.
in the username and password columns of the User table. Passwords are encrypted
by the DBMS before storing, using the SHA-1 algorithm.
It is reasonably simple to use the techniques developed over the last four chapters to
write a PHP script that implements this security policy. The script must check that the
username and password text boxes are not empty and are present on the database.
Figure 7.42 shows the relevant portion of Version 0.2 of the Login.php script.
Lines 26–53 implement the security policy by validating the values of $_SERVER
["PHP_AUTH_USER"] and $_SERVER["PHP_AUTH_PW"].
Line 27 checks whether or not an end user entered a username and password. If not,
line 28 sets the variable $validationFail to TRUE. If the end user did enter a user-
name and password, lines 31–50 validate this data against the list of registered users
stored on the database.
The SQL statement (lines 40–43) retrieves those rows of the User table where
the username column has the same value as $_SERVER["PHP_AUTH_USER"]
and the password column has the value output by the MySQL function SHA1()
when the input is the value of $_SERVER["PHP_AUTH_PW"]. (The $_SERVER ele-
ment values are copied to local variables before being used.) Remember, the registra-
tion process encrypts the password before storing it on the database, but $_SERVER
["PHP_AUTH_PW"] is the clear text password. If line 43 read:
AND password='$password'";
then no user would be able to login because the encrypted data stored in the password
column will never match the clear text data stored in the local PHP variable
$password.
Line 44 retrieves the data from the database, and line 49 validates it. If there are no
matching rows, then there is no registered user with the username and password sup-
plied, so line 50 sets $validationFail to TRUE.
340
IDWA_C07.QXD 10/5/07 11:24 AM Page 341
Figure 7.42 Validating authentication details against the database: Version 0.2 of Login.php.
Line 55 checks the value of $validationFail. If it is true, then the user is chal-
lenged to supply a valid username and password. If it is false, then the username and
password supplied have passed validation and the user gets to see the requested resource,
which is generated by the else block in lines 65–71.
There are some difficulties with HTTP authentication. One difficulty many develop-
ers have had has been with implementing a ‘logout’ process. As the authentication
details are stored on the web browser, not the web server, the only way to log a user out
is to force the web browser to clear its authentication cache. As mentioned above, send-
ing an HTTP response with the status code 401 does this. It is relatively easy to create a lo-
gout script that sets this in motion. It should simply rechallenge the web browser. Lines
57 and 58 in Figure 7.42 show how to set up the HTTP response message to do that.
A more serious problem is that, with the Basic HTTP authentication scheme, the au-
thentication details are transmitted across the network as clear text (though see the
note below). Anyone with suitable software can intercept these HTTP messages and
341
IDWA_C07.QXD 10/5/07 11:24 AM Page 342
discover the usernames and passwords of registered users. One option is to use the more
secure digest authentication scheme, but even this is not considered that secure. Web
database applications that need robust security should use a version of secure HTTP,
which encrypts the whole HTTP message (see, for example, Gourley and Totty (2002)
for an overview).
The basic authentication scheme sends data as clear text, but it is encoded, using
what is called Base-64 encoding. The goal of Base-64 encoding is to ensure that the
data can be transmitted correctly, without any characters being mangled on the way.
This is a little like the URL encoding discussed earlier. Base-64-encoded data is very
easy to decode and it is not an encryption mechanism.
Chapter summary
■ This chapter has examined a number of issues and techniques commonly used in
web database applications. The focus has been on how to maintain application state
despite the statelessness of the HTTP protocol. Three general techniques have been
discussed:
● URL rewriting
● hidden form controls
● sessions.
■ In addition, the special case of authentication details, which can be managed using
HTTP authentication, has been discussed.
■ The examples used to illustrate these techniques also demonstrated:
● multiple page result sets
● server-side data validation
● single script solutions to gathering data
● how to solve the reload problem
● wizards
● how to use a DBMS function to encrypt a password.
■ This is quite a lot to cover in one chapter and was only achieved by focusing on the
general concepts rather than production-quality solutions. The ‘Further reading’
section suggests a few places to look for such solutions. Understanding the under-
lying concepts will help you to understand these.
Further reading
There are several good books on PHP and MySQL, though most of them give pretty
sparse coverage of database issues. Bulger et al. (2004) is a good practitioners’ book and
includes several fully worked case studies. Williams and Lane (2004) is an excellent
342
IDWA_C07.QXD 10/5/07 11:24 AM Page 343
Exercises
introduction to web application development. Welling and Thompson (2005) has a gen-
tler pace, but covers very similar ground.
If none of these appeals, then simply browse the shelves at your library or any good
bookshop. Online resources are equally plentiful.
Review questions
7.1 What is application state? Summarize the three main approaches to storing applica-
tion state in a web database application.
7.2 Using suitable examples, explain what will cause each of the following kinds of PHP
errors:
(a) parse error
(b) warning
(c) fatal error
(d) notice.
7.3 Explain why setting the default error reporting level so that the PHP application server
reports only parse errors on the web page is a sensible approach for a shared PHP
application server.
7.4 Explain the purpose of the LIMIT clause – a MySQL extension to the select statement.
7.8 What is an HTTP location header and how is it used to solve the reload problem?
Exercises
7.11 Using Version 1.5 of SurnameSearch.php as a basis, complete the code to display
the result set across multiple pages. In particular, include the following features.
(a) Validate string and numeric data using the functions you wrote in Exercise 4.13 (d).
Deal with search criteria that include single apostrophes in the string literal.
(b) Display a count of the number of matching rows at the start of the page. (Hint:
The PHP built-in function mysql_num_rows() counts the number of rows in the
result set. If there is a limit clause, then this will not be the total number of match-
ing rows. Try using an aggregate query.)
343
IDWA_C07.QXD 10/5/07 11:24 AM Page 344
(c) Display the rows in an HTML table element so that, for each row, the forename, sur-
name and e-mail and the ‘Full details’ hyperlink appear in separate columns. Alter-
natively, if you are familiar with CSS, use CSS to achieve this layout. (Hint: See
solution to 4.14 (d)). Consider writing a suitable function to do this for any result set.
(d) Add a ‘Prev’ hyperlink, allowing users to navigate to the previous page of the result
set. This link should not be enabled on the first page, nor should the ‘Next’ link be
enabled on the last page.
(e) Make the forename column heading a hyperlink. When this hyperlink is clicked
on, redisplay the search results ordered by forename. Allow end users to cycle
through three states – ascending order, descending order and unordered. Do the
same for the surname and e-mail columns.
(f) The web page now includes up to six hyperlinks, each encoding different items of
application state. The page navigation and ordering hyperlinks must be properly
synchronized. Critically evaluate how effective URL rewriting is as a mechanism
for maintaining application state as the number of items increases.
7.12 Suppose Paul and Helene are both using the Staff Directory application. Paul is
searching the directory and receives a two-page result set. Helene is modifying the
database instance. Describe a scenario in which Helene’s modifications mean that
Paul does not see all the rows that match his search criteria. How could you guard
against such problems?
7.13 Tidy up Version 0.3 of SupportSessionInsert.php, dealing with the following issues:
(a) ensure that the form displays without validation error messages when first requested
(b) tighten the validation to deal with the situation in which the user types name-
value pairs directly into the URL query string
(c) when there are validation errors, ensure that any valid data is redisplayed in the
relevant form control
(d) when both the start and end time are valid times, check that the duration of the
support session is at least one hour
(e) ensure that the script does not suffer from the reload problem.
7.15 Make the following improvements to the MyList application discussed in Section 7.6.
(a) Each web page in the Library Catalogue application should include a message
telling end users how many books are currently on their lists. Amend the
TitleSearch.php, BookDetail.php and List.php scripts to include this information.
(b) Allow users to remove a single item from their lists.
344
IDWA_C07.QXD 10/5/07 11:24 AM Page 345
Investigations
very simple security policy, such as allowing any username and password. For a
more realistic security policy, compare each username and password supplied by
users to suitable user details stored on the database.)
(b) Allow users to log out. Remember to clear out their session data.
Investigations
7.17 Investigate the more advanced facilities now offered by PHP to handle errors in PHP
code. Areas to investigate include the ‘try . . . catch’ mechanism for handling excep-
tions and custom-built error-handling functions.
7.18 Investigate output buffering, which allows greater control over when HTTP headers
are written.
7.19 In the examples discussed in this chapter, the session data is stored in data files on
the PHP server. Investigate approaches to storing session data on the database.
7.20 Write a PHP script that will check the syntax of an SQL statement. Things you could
check include (but aren’t limited to) that keywords are spelt correctly, required parts
of the statement are present, clauses appear in the correct order and parentheses
nest properly. Try extending your script to check for basic semantic errors, such as
misspelt table or column names, type mismatches between literals and columns and
so on. (Note: To do this, you will need to query the data dictionary.)
345
IDWA_C08.QXD 10/5/07 11:38 AM Page 346
Chapter objectives
➔ To discuss the three stages of database design – conceptual, logical and physical.
➔ To introduce the ER diagram as a means of documenting the conceptual database
design.
➔ To explain some problems with ER diagrams and discuss solutions.
Chapter outline
Web database applications deliver web pages the content of which is generated
dynamically by merging data drawn from a database (the dynamic content) with stat-
ic content. A simple web database application, such as the Pennine University’s Staff
Directory, would usually be written without any formal designs. The application devel-
oper would work out a solution in his or her head and validate it by talking through
ideas with the end user. Even in this situation, though, the application developer
would still produce a design, it just wouldn’t be documented. This approach can be
fast, and effective, but does assume that the end user is always available and the appli-
cation developer is an expert.
When the end user is too busy to be closely involved, or for novice developers and
those using unfamiliar technologies, documenting the design is part of the process of
developing a solution. Alternative solutions can be worked out without the need to
code anything and a decision made about how best to proceed.
For large web database applications with many different developers it is crucial that
all the design decisions are documented. If someone leaves the development team and
hasn’t documented his or her work, then that knowledge leaves the team. A fully docu-
mented design is both a record of the decisions taken and a means of evaluating these
decisions against users’ requirements. This makes a clear understanding of effective
design techniques a must for all serious application developers.
346
IDWA_C08.QXD 10/5/07 11:38 AM Page 347
Chapter outline
Designing a web database application involves two distinct tasks. First, the design
must describe a database that meets the information requirements of end users. Sec-
ond, it must describe a collection of web pages that allow users to access the database.
This may be to deliver data from the database instance to end users or allow end users
to modify the database instance.
Traditionally, database designers were not involved in such user interface design.
Databases often served several different applications, so could not be tailored to any
particular one. Similarly, web designers often knew little about database design, focus-
ing on the important issues of website usability and good looks. The distance between
what database designers and website designers do is shrinking and both need to under-
stand something of the other’s territory. The UML helps by providing a common
language for both database and web design.
Database design is split into three distinct phases. Conceptual database design
produces a model showing the business artefacts and behaviour needed to realize
users’ requirements. The description of the data requirements is not based on any par-
ticular data model but simply documents the different kinds of things that the data-
base will hold data about and the connections between them. This demonstrates that
the application will meet the users’ requirements without going into the technicalities
of how. Because it avoids technical issues, end users can usually easily understand, and
comment on, a conceptual database design.
Logical database design begins to deal with how the database will be organized. It
takes the data requirements identified in the conceptual design and represents them
using the data structures available in a particular data model. Separating conceptual and
logical database design means that end users don’t need to understand the particular
data model used. As even the relational data model can be hard to grasp, this is a good
thing.
Physical database design adapts the logical design to use particular technologies –
the MySQL DBMS rather than the Oracle database (remember, Oracle calls its DBMS a
‘database’), for example. Although each DBMS organizes its data structures according
to a particular data model, there are often differences in how they do this.
The process of website design is less mature than that of database design and
there is no single approach that commands wide acceptance. Even so, the conceptual–
logical–physical split makes sense here, too. Conceptually, there is little difference be-
tween a user interface using web technologies and one using a more traditional graph-
ical user interface (such as Microsoft Visual Basic forms or Java Swing classes).
Logically, however these two approaches have significant differences – in particular
whether the connection between user interface and database is stateful or stateless.
The physical design of the website will identify the particular web technologies used.
Chapter 11 uses the conceptual–logical–physical split to introduce an approach to
website design for web database applications.
■ Section 8.1 outlines the database design process and sets it within the context of
the software development lifecycle introduced in Chapter 1.
■ Section 8.2 uses the Staff Directory to introduce the basic features of an ER diagram
and discusses how to use the requirements analysis to identify what data needs to
be held on the database.
347
IDWA_C08.QXD 10/5/07 11:38 AM Page 348
■ Sections 8.3–8.5 elaborate on the material in Section 8.2, discussing more advanced
notations for ER diagrams.
■ Section 8.6 discusses association multiplicity constraints. It is an important topic
that needs to be understood before ER diagrams can be used effectively. The section
considers the UML outer multiplicity constraint first, then contrasts it with the inner
multiplicity constraint, which is probably more widely used in database design.
■ Section 8.7 discusses generalization relationships.
■ Section 8.8 concludes the chapter by considering ways to validate an ER diagram.
This chapter, and the following two, focus on database design. There are three stages
in database design:
● conceptual
● logical
● physical.
The conceptual database design produces a high-level design for the database that is
not tied to any formal data model, yet organizes the data required by the application
efficiently and effectively. The fragmentary data requirements documented in the
use cases are collected and represented in a single, coherent, high-level overview of
requirements.
The conceptual database design describes what data structures are required, the con-
nections between them and any constraints imposed on the data by business rules. The
focus is very much on what rather than how. It forms the basis of the database designer
and end users discussions about the data requirements. To serve this purpose, the con-
ceptual database design must be intelligible to end users.
The entity-relationship (ER) diagram is a popular choice for conceptual design. It
organizes data into structures called entities. Each entity is something that the applica-
tion must hold data about. The particular data held is represented as attributes of an
entity. Connections between entities are represented by relationships (hence the
name ‘entity-relationship diagram’).
Logical database design begins to deal with how the application will meet the
users’ requirements. It takes the data requirements identified in the conceptual design
and represents them using the data structures available in a particular data model. Set-
tling on a particular data model means reconsidering the way data is organized.
Logical database design Shows how to use a particular data model to meet the
end users’ data requirements.
348
IDWA_C08.QXD 10/5/07 11:38 AM Page 349
Separating conceptual and logical database design means the end users don’t need
to understand the particular data model used. As even the relational data model can
be hard to grasp (foreign keys are a particular sticking point), this is a good thing.
Chapter 9 discusses logical database design for the relational data model and Chapter
12 examines logical database design for a database using XML. In both cases the con-
ceptual ER diagram can be adapted to represent only data structures appropriate to
the chosen data model. Another design technique, called normalization, helps the
database designer to organize the data into structures that minimize redundant data.
Problems caused by redundant data are one motivation for using databases in the
first place (see Chapter 1), so minimizing redundant data within a database is a good
idea.
Physical database design considers how the requirements identified in the logical
database design can be met using a particular DBMS. Issues in physical database design
include what data types are actually available, how business rules can be implemented
and what data storage structures are available. This makes physical database design a
much more diverse process than either conceptual or logical database design.
Physical database design Shows how to implement the logical database design
using particular software and hardware technologies.
Although each DBMS organizes its data structures according to a particular data
model, there are often differences in how they do this. Non-standard extensions to
SQL are common and a DBMS usually offers control over the data storage structures
or provides other facilities not defined in the data model. For example, the Oracle
database allows the DBA to specify which disk a table should be stored on or even to
split a single table across more than one disk. The MySQL DBMS offers a choice of dif-
ferent storage structures (called storage engines), each with its own benefits and
drawbacks. There are also some general techniques that allow the DBA to optimize
the performance of most databases. One is denormalization, which reintroduces re-
dundant data to the database in a controlled way. The idea is that, while redundant
data causes problems, it is sometimes better to allow some redundancy to improve
performance. The trade-off is that the situation must be carefully managed to avoid
the problems such redundant data can bring. Chapter 10 discusses physical database
design.
All three stages of database design lie within the ‘design solution’ phase of the infor-
mation systems lifecycle discussed in Chapter 1. There is no intent at this stage to ac-
tually build anything – no code is written, no application software deployed. The tools
and techniques discussed in this and the three following chapters can all be applied
using pencil and paper, though software is often used to support the designer. This
may be a simple drawing package or a fully functional, UML-compliant software engi-
neering tool. However, one of the best approaches for the novice is to take a blank
sheet of paper, a pencil and an eraser and sketch out solutions to some sample prob-
lems. When the basics of the actual techniques have been grasped, it is easier to tackle
the occasionally baroque complexity of the software engineering tools.
349
IDWA_C08.QXD 10/5/07 11:38 AM Page 350
In this book, ER diagrams are drawn using the UML notation. This notation was itself
influenced by earlier notations for ER diagrams (particularly the IDEF1X notation) and
incorporates some of the best features of these earlier notations. This makes it a good
choice for anyone learning to use ER diagrams for the first time.
For those familiar with UML, an ER diagram uses the class diagram notation, but with-
out the strictly object-orientated features such as attribute and operation visibility. It
also adds some features, such as candidate keys, which are not always used in
object-orientated approaches to database design.
The ER diagram is the main technique used in conceptual database design. The Staff
Directory web database application makes a good introductory case study for ER dia-
grams. Although it does not include all the features that can be represented in an ER
diagram, it has the main ones. This version of the Staff Directory database can also
support the requirements of the School of Computing’s Staff List application, dis-
cussed in Chapters 1 and 2.
In an ER diagram, an entity represents a set of things of interest in the scenario
being modelled. A particular one of these things is called an occurrence of the entity.
To qualify as an entity, there must be more than one occurrence of it, at least in prin-
ciple. So, in the Staff Directory, there could be an entity to represent members of staff
as there are lots of these, but there will not be an entity to represent the directory itself.
There is only one staff directory, so it is not represented as an entity.
All occurrences of an entity share the same properties. These shared properties are
represented as attributes of the entity. From the database perspective, an entity repre-
sents something that the database holds data about, while the attributes represent the
actual data held. A database also holds information about connections between differ-
ent entities. In an ER diagram, these are represented as a relationship. In some cases, a
relationship will have attributes of its own.
These three simple abstractions – entity, attribute and relationship – allow the ER
diagram to model the data requirements of any web database application.
Entity A set of things about which the information system must hold, or process,
data. Each entity has a name that is unique within the ER diagram.
In the Staff Directory database, there will be an entity to represent members of staff as
the set of members of staff is clearly of interest. The simplest representation of an
entity in an ER diagram is as a rectangle enclosing the entity name and the stereotype
350
IDWA_C08.QXD 10/5/07 11:38 AM Page 351
«persistent»
Staff
«persistent», as shown in Figure 8.1. Note that the curved arrows are not part of the ER
diagram – they simply point to the interesting features.
Stereotypes distinguish between the different kinds of application component. The
stereotype «persistent» indicates that the entity’s data persists in the database even
after the computer is switched off. The entity name can be a noun, as in Figure 8.1, or
a noun phrase. Staff is a good name for an entity representing members of staff.
SupportSession is a suitable name for the entity representing support sessions offered
by members of staff. An entity name can include letters, digits and underscores and is
written with no spaces and each word capitalized, so we have SupportSession rather
than support_session or Support session. It is usually singular, too, so SupportSession
rather than SupportSessions.
Each relationship also has a name. It has the same format as an entity name, but is
usually a verb or verb phrase. For example, there is a meaningful connection between
the entities Staff and SupportSession – each support session is offered by a particular
member of staff. This is represented as the relationship Offer, drawn as a straight line
between the two entities (Figure 8.2).
The relationship name is written close to the line (above, below or beside the line
are all fine) and includes a name direction indicator, which is the arrowhead in Fig-
ure 8.2. The name direction only indicates how to read the relationship name. In this
case, the relationship name is to be read as ‘Staff offer support sessions’ rather than ‘Sup-
port sessions offer staff’. The relationship itself can be traversed in either direction –
from a particular occurrence of Staff to an occurrence of SupportSession that he or she
will Offer or from a particular occurrence of SupportSession to the occurrence of Staff
who will Offer the SupportSession.
351
IDWA_C08.QXD 10/5/07 11:38 AM Page 352
Relationships are best drawn as horizontal or vertical lines or else lines that turn
90-degree corners (see examples in the figures later in this chapter). It is a good idea
to avoid representing relationships with diagonal lines, reserving these for other uses
within the ER diagram.
An entity is something that the database holds data about, and that data is represented
as the entity’s attributes. This means that an entity has at least one attribute, which
can also be included in the ER diagram. The attributes of the Staff entity represent such
things as the staff number, parts of each person’s name, e-mail address and so on. The
SupportSession entity has attributes to represent the day a support session runs and its
start and end times. It does not have an attribute to represent which member of staff
offers a particular support session – that is, there is no ‘foreign key’. This job is done by
the Offer relationship. Some database designers do include foreign keys in the concep-
tual database design, but the information is redundant as the relationships indicate
meaningful connections between entities, so foreign keys are not required.
The attribute name is a noun or noun phrase and can include letters, digits and un-
derscores, but not spaces. The first letter is not capitalized, but second and subsequent
words are, so we have startTime rather than StartTime or start_time. This helps to distin-
guish attribute names from entity names.
Attributes of an entity are listed in the attribute compartment. This compartment
can be left off an ER diagram (as in Figures 8.1 and 8.2), but, when it is included it sits
directly below the entity’s name compartment (which holds the name and the stereo-
type «persistent»). Both compartments are shown in Figure 8.3, which shows a simple
Figure 8.3 An ER diagram showing two entities with their attributes and a
relationship between them.
Name compartment
352
IDWA_C08.QXD 10/5/07 11:38 AM Page 353
ER diagram with two entities, each having attributes, and a relationship between the
entities. Some relationships can have attributes of their own. This is not shown in
Figure 8.3, though is discussed below.
Within an entity (or relationship) each attribute must have a unique name. How-
ever, two attributes of different entities (or relationships) can have the same name.
The database designer must decide whether each thing mentioned in the users’ re-
quirements is an entity or an attribute. Consider the e-mail address of a member of
staff. Each e-mail address has two parts – the username and mail server address. In the
e-mail address ‘[email protected]’, ‘p.smith’ is the username and ‘pennine.ac.uk’
is the mail server address. As it has identifiable properties, the e-mail address could be
modelled as an entity. However, an e-mail address has no interesting properties (at least,
not in the scenario being modelled). End users need both parts of the e-mail address to
use it – they have no use for the username or the mail server address alone. This means
that the e-mail address is not something the database holds data about. Rather, it is
data about members of staff, so is modelled as an attribute of the Staff entity.
Using these three simple abstractions – entity, attribute and relationship – it is pos-
sible to produce a conceptual database design that organizes the required data effec-
tively and efficiently. The diagrams themselves are simple enough to explain, so can
form the basis for discussions between the database designer and end users. This
allows end users to verify that their requirements are actually being met by the pro-
posed database design. However, simple ER diagrams such as the one in Figure 8.3
provide only basic information about the database. The notation for ER diagrams is
much richer than that. An ER diagram can also include information on an attribute’s
data type, business rules and database constraints, and different kinds of relationship.
This richer notation is explored in the following sections. Table 8.1 summarizes the
definitions of the three basic structures and their characteristics.
ER diagram
data structure Definition Characteristics
Entity A set of things about which the Name is unique within the ER
information system must hold, diagram.
or process, data. Must have at least one attribute.
The name compartment is always
shown, the attribute compartment
may be hidden.
Relationship A meaningful connection Name is unique within the ER
between entities. diagram.
May have attributes, but most do
not.
Attribute A property of an entity or Name is unique within the entity
relationship. or relationship.
353
IDWA_C08.QXD 10/5/07 11:38 AM Page 354
The user will have accessed the full contact details for an academic member of staff
and will request a list of that person’s support sessions. The day, start time and end
time of all the support sessions are displayed. If there are no support sessions for the
chosen academic, a message stating this is displayed.
To identify possible entities and attributes, simply scan through this use case and high-
light any nouns or noun phrases. Next, consider whether or not each will be an entity or,
if not, an attribute of an entity. If the noun indicates an entity, then check that it is not
one covered by an already identified entity – for example, an ‘academic’ is also a ‘member
of staff’. Table 8.2 lists every noun and noun phrase in the use case specification, togeth-
er with a comment on whether it is an entity or an attribute. Where possible, an attribute
is assigned to one of the entities. Lists like these aren’t usually kept, but can be useful for
documenting decisions. Here are some important points to note from Table 8.2.
● The application itself is not an entity, nor would ‘staff directory’, ‘database’ or other
such nouns indicate entities. There will only ever be one occurrence of each of these
things (within the scenario being modelled) so they cannot be entities.
● ‘Academic’ is not modelled as a separate entity as an academic is simply a particular
kind of member of staff. It is possible to model different kinds of thing, using a gener-
alization relationship. If this approach were taken, then there would be an Academic
entity. This is discussed below.
● ‘User’ is not an entity. In this use case, there is no need to store information about
users as the Staff Directory application is open access.
● ‘Contact details’ is regarded as a synonym for ‘member of staff’. These appear to rep-
resent different things as a person is not usually equated with his or her phone num-
ber and e-mail address. However, in the context of the Staff Directory, a ‘member of
staff’ is the same as his or her ‘contact details’.
354
IDWA_C08.QXD 10/5/07 11:38 AM Page 355
Table 8.2 Nouns from the ‘Get support sessions’ use case specification.
Once the entities and attributes have been listed, suitable names can be chosen. The
names should be short and meaningful. They need not be any of the nouns identified
from the use case specification. None of the synonyms ‘member of staff’, ‘academic’
and ‘contact details’ is as clear and meaningful as ‘staff’, so the entity that represents
the contact details of members of staff (including academic staff) is called Staff. In con-
trast, ‘support session’ is perfectly clear, so the entity that represents support sessions is
called SupportSession.
Entity and attribute names are often abbreviated as many application developers
prefer short names. Many database designers would abbreviate SupportSession to
SupSes. There is no technical reason for doing this as the days when identifiers could
be only eight characters long are slowly passing into IT legend. There may also be a
good business reason for keeping a long name as the abbreviation could be unclear.
For example, does Sup stand for ‘support’ or ‘supplementary’ or ‘supply’? In contrast,
the No of ‘staffNo’ is a recognized abbreviation for ‘number’. The attribute names
fName and lName are a bit dubious – why not just use forename and surname? The ab-
breviations were used in this book for a very pragmatic reason – they kept the width of
the table images small enough to fit on to a page. In a real system, the longer names
would be preferable. The best guidance is to use abbreviations that improve the clarity
of the design and avoid those that don’t.
Relationships are identified by a second trawl through the use case specifications.
This time, the aim is to identify verb phrases that indicate some meaningful connec-
tions between entities. There will often be several such phrases indicating each rela-
tionship. Occasionally, there may be two different relationships between the same two
entities and these will both need to be identified and included in the ER diagram.
Table 8.3 lists the verb phrases found in the Get support session use case specification.
Again, the name of each relationship should be short and meaningful. For example,
355
IDWA_C08.QXD 10/5/07 11:38 AM Page 356
Table 8.3 Verb phrases from the ‘Get support session’ use case
specification.
Offer is shorter than OfferedBy, so the relationship name in Figure 8.3 is read Staff Offer
SupportSession rather than SupportSession OfferedBy Staff. Remember that this does
not alter the meaning of the ER diagram as, however the name is read, the relationship
still works in both directions.
ER diagrams provide a concise overview of the data requirements of a particular ap-
plication. Entities represent things that the application holds data about, attributes the
data held and relationships the connections between entities. This overview can lack
detail. For example, the simple ER diagram in Figure 8.3 does not indicate whether
support sessions are offered by individual staff or groups of staff.
One way to provide more information about a scenario is to use an object diagram.
This shows particular occurrences of entities and the relationship occurrences that link
them. For an entity occurrence, each attribute has a particular value. Figure 8.4 shows
an object diagram for the ER diagram in Figure 8.3. In this example, one occurrence of
paulSmith:Staff :SupportSession
356
IDWA_C08.QXD 10/5/07 11:38 AM Page 357
8.3 Attributes
each of the entities and relationships is included in the object diagram. Entity occur-
rences are distinguished from entities by underlining the entity name and preceding it
with a colon. The occurrence of the Staff entity has been given an occurrence name,
while the occurrence of the SupportSession entity is anonymous. Naming entity
occurrences is not necessary, but may be useful in some situations. Each attribute is
shown with a particular value. Where an entity occurrence has no value for an
attribute, this is shown as an empty literal – ' '.
Object diagrams are a useful tool for analysing a scenario more closely. By showing
sample data, they can illustrate which attributes are required and which are optional.
They can also provide examples of how many relationship occurrences a particular en-
tity occurrence participates in. For example, is it possible for a particular member of
staff to offer no, one or more than one support session? Once these issues have been
investigated, they can be noted on the ER diagram using the more advanced notation
discussed in the following sections
8.3 Attributes
357
IDWA_C08.QXD 10/5/07 11:38 AM Page 358
«persistent» «persistent»
Student Student
studentNo studentNo [1..1]
fName [1..*] fName [1..*]
IName IName [1..1]
email [1..2] email [1..2]
phone [0..1] phone [0..1]
«persistent»
Student
studentNo : Char(10)
fName [1..*] : Varchar
IName : Varchar
email [1..2] : Varchar
phone [0..1] : Varchar
Different information systems will need different data types. For example, a geograph-
ical information system will need data types for points, lines, polygons and so on,
while an accounting information system probably won’t. The data types used should
be documented by the database designer, either in a UML package or listed in a table.
In this book, the scalar data types discussed in Chapter 3 are the ones used during con-
ceptual database design.
The most common data type is the character string. Character strings can be of a
fixed or varying length – abbreviated to Char and Varchar respectively. In database
design, it is common to state a maximum length for the data values of character string
attributes. For example, on the Student entity, the attribute studentNo is a fixed-length
character string, which always has ten characters. This is written studentNo : Char(10).
It is not necessary to include a length, so, for example, the lName attribute is a varying
length character string with no maximum length, written lName : Varchar. Figure 8.6
shows all the attribute data types for the Student entity. For numeric attributes, the
scale and precision can be specified.
Attributes may have a default value (sometimes called an initial value). The default
value for an attribute is the value used in an entity occurrence when end users don’t
specify a value for the attribute. The duration attribute of the TeachingSession entity
has a data type of interval and a default value of ‘1h’ as, at the Pennine University,
most teaching sessions last for one hour. In the ER diagram, the attribute will be
written duration : Interval=1h. This default value is only used when end users don’t
supply a value for a particular occurrence of TeachingSession. So, in an occurrence of
TeachingSession that represents a two-hour tutorial, duration will have the value 2h,
even though the default is 1h.
358
IDWA_C08.QXD 10/5/07 11:38 AM Page 359
8.3 Attributes
Few DBMS will accept 1h as a valid interval literal, but it is fine during conceptual
design. It is easy to explain to the end users and easy to convert to a valid interval
literal during physical design.
It is common to have attributes with fairly small sets of allowed values. The day
attribute of the TeachingSession entity has only seven possible values. To model this, in-
clude an attribute constraint. Adding constraints to an attribute is one use of the prop-
erty list. The property list for an attribute is a list of constraints, tagged values (discussed
below) and comments that apply to the attribute. It appears in braces – {. . .} – after the
default value. The attribute constraint can also be written in a note and attached to the
attribute by a dashed line. Figure 8.7 shows attributes with default values and attribute
constraints. The constraint on the day attribute is documented in a note. The duration
attribute has both a default value and a constraint, though in this case the constraint is
documented in the attribute’s property list. Note that this constraint uses mathematical
notation – 1h <= duration <= 3h. This expresses the business rule that every teach-
ing session lasts at least one hour, but may not last more than three hours. The
TeachingSession entity is unusual because most of its attributes have constraints. The
Student entity in Figure 8.6 is more typical as none of its attributes has a constraint.
The dashed line joining a note to the part of the ER diagram it relates to can be drawn
at a diagonal. This helps to distinguish it from the relationships in the diagram, which
are not drawn as diagonal lines.
Some entities have derived attributes. For each entity occurrence, the value of its
derived attributes can be derived (or calculated) from the values of the other attributes.
In an ER diagram, a derived attribute has a forward slash – / – before its name.
The TeachingSession entity has two derived attributes – /endTime and /runsFor (Fig-
ure 8.7). For an occurrence of TeachingSession, the value of the derived attribute
/endTime will be the result of adding the value of duration to the value of startTime. The
way to calculate a derived attribute is noted as a comment in the attribute’s property
list or in a note attached to the attribute.
«persistent»
{Value can be any of Monday, TeachingSession
Tuesday, Wednesday, Thursday
Friday, Saturday, Sunday} day : Varchar
startTime : Time {09:00 <= startTime <= 19:30}
duration : lnterval=1h {1h <= duration <= 3h}
/endTime : Time {endTime=startTime+duration,
endTime <=21:00}
startDate : Date
endDate [0..1] : Date {endDate<=startDate+12weeks}
/runsFor [0..1] : Interval {runsFor=endDate-startDate}
type : Varchar
{Value one of 'Practical', 'Tutorial',
sessionCode [0..1] : Char {Null iff type='Lecture'}
'Workshop', 'Lecture', 'Seminar'}
359
IDWA_C08.QXD 10/5/07 11:38 AM Page 360
For /endtime, the property list includes two pieces of information. The first –
endTime=startTime+duration – indicates how to derive the value of this attribute from
other attribute values. The second – endTime<=21:00 – places a constraint on the pos-
sible values of the derived attribute. Note that a constraint on a derived attribute is
really a constraint on the attributes its value is derived from. In this case, the con-
straint endTime<=21:00 on /endTime really says the value of startTime+duration must not
be greater than 21:00. It is not possible to write separate constraints on the startTime
and duration attributes that capture this.
The multiplicity of a derived attribute depends on that of the attributes it is derived
from. If all the attributes involved always have a value (that is, they have multiplicity
[1..n] for some integer n), then the derived attribute will always have a value, too. The
particular derivation will determine the upper bound of the multiplicity. So, for exam-
ple, endTime has a multiplicity [1..1] as its value can always be derived and a teaching
session can have at most one end time. In contrast, the value of runsFor depends on
the value of endDate. This has a multiplicity of [0..1], so on some entity occurrences
there is no value for endDate. On these entity occurrences there cannot be any value
for runsFor either, so its multiplicity is [0..1].
An alternative to using an attribute constraint to restrict what values an attribute
can hold is to declare a user-defined type. This is commonly done when many differ-
ent attributes will share the same constraint. Figure 8.8 (a) uses a UML class diagram to
declare the user-defined type SessionType. This user-defined type is the set of all the
different kinds of teaching session at the Pennine University.
360
IDWA_C08.QXD 10/5/07 11:38 AM Page 361
8.4 Entities
The stereotype «enumeration» indicates that this UML class diagram models a user-
defined type, not a database entity. The name compartment contains the stereotype
and the user-defined type’s name.
The lower rectangle is not a list of attributes. Instead, it lists all of the literals that
represent allowed data values for the data type. Figure 8.8 (b) shows how to use this as
the data type for the attribute type in the TeachingSession entity.
User-defined types are a more advanced feature of the UML, and are not often used
in relational database design. However, when several attributes share a common set of
allowed values, and the list is long, then it’s worth considering using this feature.
Some DBMS already allow the DBA to define enumeration data types of this sort (the
MySQL DBMS is one) and ISO SQL:2003 includes support for user-defined types.
Once all the attributes have been identified, it is useful to have a list of them. This
forms part of the application glossary, which lists all the entities, attributes and
relationships.
A key purpose of an application glossary is to provide a central repository of infor-
mation about the design. It includes a description of each item and any synonyms
(different names for the same thing). A good modelling tool will be able to generate an
application glossary from the ER diagrams, but, if necessary, a separate document can
be maintained. In this case, it is important to be clear whether it is the application
glossary that is definitive or the ER diagrams as it is likely that they will get out of step
at some point in the development cycle.
It is common to keep an application glossary in table format. Table 8.4 shows part of
the application glossary for the Web Timetable application. It lists the attributes of the
TeachingSession entity and includes all the information about each attribute shown in
Figure 8.7.
8.4 Entities
An entity is a set of things of interest to the organization that share the same proper-
ties. To the database designer, they are those things the database must hold informa-
tion about. Each occurrence of an entity is distinct from all other occurrences. Even
two entity occurrences with the same attribute values are distinct from one another.
When an entity can have two distinct occurrences but the attribute values are the
same, it is called a weak entity. In contrast, every occurrence of a strong entity is dis-
tinct from all other occurrences. Figure 8.9 uses an object diagram to show that, in the
Staff Directory database, SupportSession is a weak entity. Two different support ses-
sions, offered by different members of staff, occur on the same day and at the same
time, so have exactly the same attribute values. The Staff entity, though, is a strong
entity – for example, the value of staffNo is different for each entity occurrence (all
members of staff have a different staff number). This leads to the following definition
of strong and weak entities.
Strong entity An entity that can be uniquely identified by its attribute values.
Weak entity An entity that cannot be uniquely identified by its attribute values.
361
IDWA_C08.QXD
362
Table 8.4 Extract from the application glossary showing information on attributes.
10/5/07
begins.
duration The length of the length 1..1 Interval 1 hour Between 1 and
session. 3 hours.
/endTime The time at which end 1..1 Time No later than Derived as startTime
the teaching 21.00. + duration.
session ends.
startDate First day the 1..1 Date
session runs.
endDate Date of final 0..1 Date No more than
session. 12 weeks after
startDate.
/runsFor Number of weeks 0..1 Interval Derived as difference
the session runs for. between startDate
and endDate.
type What type of 1..1 Varchar One of practical,
session this is. tutorial, workshop,
lecture or seminar.
sessionCode Used to distinguish tutor 0..1 Char Null whenever
between different group type is lecture,
sessions for the not null for
same module. other types.
IDWA_C08.QXD 10/5/07 11:38 AM Page 363
8.4 Entities
Figure 8.9 Strong and weak entities in the Staff Directory database.
:Staff :SupportSession
Drawing an analogy between the relational data model and ER diagrams, an entity corre-
sponds to a table – both hold data about some ‘thing’ that the organization is interested
in. Attributes correspond to columns and each occurrence of an entity corresponds to a
row in the table – both hold the actual data for a particular instance of the ‘thing’ in
question. In the relational data model, a set of columns the values of which are unique to
a particular row of the table is called a candidate key. The candidate key acts as a unique
identifier for a row. So, a set of attributes that uniquely identifies an entity occurrence
can also be thought of as a candidate key. It follows from the definition that every strong
entity has at least one candidate key and that a weak entity has no candidate keys.
This illustrates an important difference between the relational data model and ER
diagrams. In the relational data model, every table has a primary key, as there can be no
duplicate rows. In an ER diagram some entities may not have a primary key as there
can be duplicate entity occurrences.
This analogy forms the basis for the process of deriving a relational database design
from an ER diagram.
A conceptual ER diagram does not include foreign keys because the relationships
indicate connections between entities.
363
IDWA_C08.QXD 10/5/07 11:38 AM Page 364
Figure 8.10 Using tagged values in the property list to indicate candidate
keys.
«persistent»
Module
code : Char(7) {PK}
title : Varchar {CK=1}
department : Varchar {CK=1}
level : Varchar
In the Web Timetable database, Module is a strong entity as each occurrence can be
uniquely identified by the value of code, so {code} is a candidate key for the Module en-
tity. Entity occurrences can also be uniquely identified by combinations of values of the
title and department attributes as, within a department, there will be only one module
with a particular title. So, {department, title} is also a candidate key for the Module entity.
It can be useful to indicate candidate keys on the conceptual ER diagram and distin-
guish the primary key. The way to do this is to include a tagged value in the attribute
property list. Tagged values are written as a name-value pair – name=value. Each candi-
date key may have several attributes in it and there may be several candidate keys. For
each candidate key, place a tagged value – CK=n – in the property list of each attribute
that is part of the candidate key. The value of n indicates which candidate key the at-
tribute is part of.
In Figure 8.10, the candidate key {department, title} is indicated by including the
tagged value CK=1 in the property list for the attributes department and title. The tagged
value PK is included in the property list of the attribute code to show that this is part of
(in fact, the whole of) the primary key. The tagged value for the primary key only has a
name – there is only one primary key, so there is no need to use the value. Where a pri-
mary key has two or more attributes, each has the tagged value PK in its property list.
Tagged values are a standard part of the UML but the particular tagged values CK=n
and PK are not part of the UML standard. They are introduced specifically to use in
ER diagrams.
There is an alternative notion of strong and weak entities. This approach defines a
weak entity as one the existence of which depends on another entity. For example, in
the Staff Directory database, the SupportSession entity depends on the Staff entity – an
occurrence of SupportSession cannot exist except when linked to an occurrence of
Staff. However, the Staff entity is not existence dependent on SupportSession – there
may be occurrences of Staff that are not linked to any occurrences of SupportSession.
Basing the definition of strong and weak entities on existence dependence is not
equivalent to the definition based on primary keys. For example, in the Web Timetable
database, every module has a module leader (Figure 8.11). Thus, an occurrence of the
Module entity cannot exist unless it is associated with an occurrence of the Staff
entity. This suggests that Module is a weak entity, yet both the Module and Staff entities
have primary keys. It is possible, on the existence dependence definition, to argue that
364
IDWA_C08.QXD 10/5/07 11:38 AM Page 365
8.4 Entities
Figure 8.11 Every module has a module leader, but is Module existence
dependent on Staff?
Module is a weak entity. The module leader writes the module specification, delivers
the lectures and marks the assignments. Without the leader the module cannot run, so
Module is existence dependent on Staff. It is also possible to argue that Module is not a
weak entity. If a module leader resigns unexpectedly, the module does not cease to
exist simply because there is no module leader. Whether Module is, or is not, a weak
entity seems to depend on the way the Pennine University deals with the unexpected
loss of a module leader. If a new module leader is appointed immediately, then it is
arguable that Module is existence dependent on Staff, but otherwise it is not.
If the distinction between strong and weak entities is to be useful, then it should be
clear, so the definition in terms of candidate keys is preferred. This does not mean that
existence dependence is not an important concept. The UML has an alternative way of
modelling existence dependence, using special kinds of relationship, and these are dis-
cussed below.
The evolution of the concepts ‘weak entity’ and ‘existence dependence’ illustrate
how the discipline of computing tends to proceed. A concept is articulated in one
form, discussed by the community and refined. As it is refined, it may be split into
several simpler, though related, concepts. Eventually these may lead to new ap-
proaches to doing computing or be entirely superseded by new concepts. Those
studying, or working in, the discipline must develop strategies to cope with this
process of continual change. One is standardization. The ISO standard for SQL and
the OMG standard for the UML, for example, both seek to control the rate of change.
The existence of such standards ensures some continuity, so learning them is not a
waste of time.
Business rules are important in database design. Many can be modelled using attribute
constraints or tagged values. This is the best approach when a business rule constrains
the value of one attribute (possibly based on the values of others). The attribute con-
straints on TeachingSession (Figure 8.7) all implement business rules that constrain
the value of one attribute. Some business rules are more complex. Consider the follow-
ing business rule at the Pennine University:
When a teaching session runs on a Saturday or Sunday, it must not start before
10:00 and must finish by 16:00, but otherwise, a teaching session must start at
09:00 or later and must finish by 18:00.
365
IDWA_C08.QXD 10/5/07 11:38 AM Page 366
This business rule involves a conditional constraint – that is, if the value of the day
attribute is a weekday, then apply one constraint to the value of the startTime and
duration attributes, but otherwise apply a different constraint. Conditional statements
like this cannot easily be implemented using database integrity constraints, but should
be noted in the conceptual ER diagram. A good way to record such complex con-
straints is in a note attached to the entity, rather than against a particular attribute.
Figure 8.12 shows the TeachingSession entity with three constraints written in notes –
two attached to attributes (as they were in Figure 8.7) and one to the entity itself.
Other business rules place constraints on what entity occurrences are valid, rather
than attribute values. Candidate and primary key constraints (Figure 8.10) are one
example of such a business rule as no two occurrences of the entity can have the same
attribute values for a candidate key. It would be possible to document candidate keys
using notes attached to the entity, but the resulting ER diagram would be rather
cluttered, so tagged values are used instead. Another example would be:
every module must have at least two hours of contact time each week.
This constraint involves both the Module and TeachingSession entities as it insists that
for any occurrence of the Module entity, there are linked occurrences of TeachingSession,
such that the sum of their duration attribute values is at least two hours. (This is a rather
naïve constraint – what if there are three practicals, lasting an hour each, but students
can only attend one?) To indicate that the constraint involves both entities, simply
attach the note to them both. If there are lots of these business rules, then, again, the ER
diagram can become cluttered. If this happens, it may be better to draw a separate ER
diagram to document such constraints and hide the attribute compartments of the
entities, as has been done in Figure 8.13.
The UML was developed to help design object-orientated applications, so includes
an ability to record the behaviour associated with an entity. Traditional database design
techniques have not considered behaviour – what the application does with the
366
IDWA_C08.QXD 10/5/07 11:38 AM Page 367
8.4 Entities
«persistent» «persistent»
TeachingSession Module
Figure 8.14 An entity with an operation and the operation’s method note.
Name
«persistent» Delivers «persistent»
compartment
TeachingSession Module
day : Varchar {Value any of Monday, ..., Sunday} code : Char(7) {PK} Attribute
startTime : Time {09:00 <= startTime <= 19:30} title : Varchar {CK=1}
compartment
duration : lnterval=1h {1h <= duration <= 3h} department : Varchar {CK=1}
/endTime : Time {endTime=startTime+duration, level : Varchar
endTime <=21:00} onDelete()
Operation
startDate : Date compartment
endDate [0..1] : Date {endDate<=startDate+12weeks} Operation
/runsFor [0..1] : Interval {runsFor=endDate-startDate}
type : Varchar {Value one of 'Practical', 'Tutorial', {When a Module occurrence is
'Workshop', 'Lecture', 'Seminar'} deleted, also delete any
sessionCode [0.. 1] : Char {Null iff type='Lecture'} associated TeachingSession
Operation's
occurrences.}
method note
data – only how best to organize the data. With the advent of stored procedures, which
allow the DBMS to manage behaviour as well as data organization (see Chapter 11), it
can be useful to include some application behaviour in the ER diagram.
Behaviour is modelled using operations, with each operation belonging to an entity.
Operations are listed in the operations compartment, which, if it is shown, comes
immediately below the attribute compartment.
Another possible use of operations is to document referential actions. Referential
actions were discussed as part of the relational data model and again as part of the SQL
DDL create table statement (Chapters 3 and 5 respectively). They can be applied more
widely however, as whenever there is a relationship between two data structures it is
useful to know what to do when an occurrence of one of them is removed. Figure 8.14
shows the Module entity with an operation onDelete( ). Whenever an occurrence of the
Module entity is removed, it seems reasonable to insist that every associated occur-
rence of TeachingSession is also removed. Otherwise there will be teaching sessions
scheduled for a module that isn’t running.
367
IDWA_C08.QXD 10/5/07 11:38 AM Page 368
Operation names have the same format as attribute names, but are followed by
parentheses – ‘( )’. The parentheses can include a list of parameters, which are values
passed to the operation each time it is executed. There is no need to specify how the op-
eration will actually work, but it is useful to include a method note for each operation
that explains its purpose.
8.5 Associations
368
IDWA_C08.QXD 10/5/07 11:38 AM Page 369
8.5 Associations
Figure 8.16 Binary and unary associations using the same role name.
Binary association
Role name
Represent
Unary association
One situation in which role names become useful is when an association connects an
entity with itself – a unary association. The association Represent in Figure 8.16 is an
example of a unary association – it connects the Student entity with itself.
Note that, like a binary association, a unary association has two ends. Each occur-
rence of a unary association links two occurrences of the entity (possibly, though not
necessarily, the same entity occurrence).
The ER diagram models how students make their views about courses known. Each
year, the students on each course elect a student representative. Student representa-
tives are themselves students, but they attend student panels to raise any concerns
expressed by the students they represent. The role name StudentRep at one end of the
unary association Represent emphasizes that the occurrences of Student at the differ-
ent ends play different roles – a student, acting in the role of student representative,
represents all the students on the course.
Unlike entity names and association names, role names need not be unique. In fact,
it makes sense to reuse them as an entity may play the same role in more than one as-
sociation. Student representatives also attend student panels, so there is a binary asso-
ciation, Attends, connecting Student with StudentPanel. Obviously, an occurrence of
Student who participates in this association is also playing the role of StudentRep and
this is indicated by including the same role name at the Student end of the Attend as-
sociation (Figure 8.16). This is read, ‘Student, acting in the role StudentRep, Attend Stu-
dentPanel’. When two different entities play similar roles, though, they should be given
different role names to avoid possible confusion.
An association can connect three or more entities – an n-ary association. This situa-
tion is less common than a unary association, but most organizations will have exam-
ples of associations between three entities.
An n-ary association is shown as a diamond with lines from the diamond to
the entities involved. The association name is written inside the diamond. With
n-ary associations, there is no name direction indicator. Figure 8.17 shows the
BookEquipment association. This association connects the three entities Staff,
Equipment and TeachingSession and models the fact that staff may book additional
equipment to use in a particular teaching session. Notice that BookEquipment has three
association ends – one for each entity. Each of these association ends could be given
a role name, though, in this example, there is no need.
369
IDWA_C08.QXD 10/5/07 11:38 AM Page 370
«persistent»
Staff
Association has three
Association name, enclosed association ends – one
in a diamond shape for each entity
«persistent»
BookEquipment
TeachingSession
«persistent»
Equipment
N-ary associations like that shown for BookEquipment model situations that cannot be
modelled using binary associations alone. Consider the ER diagram in Figure 8.18 (a).
This attempts to use binary associations to capture the same scenario modelled by Fig-
ure 8.17. The binary association Book is intended to indicate that a member of staff
books additional equipment, but it is not clear which teaching session this equipment is
needed for. The binary association For can show that a member of staff wants additional
equipment in a particular teaching session, but it cannot show which equipment is re-
quired. Similarly, UsedIn can show which equipment is required for a teaching session,
but not who booked it.
The problem is that there is no way to tie together the information captured by
these three separate binary associations. It is possible, for example, that an occurrence
of Book exists even though there are no occurrences of the other two associations.
These ambiguities cannot be resolved using binary associations alone.
They can be resolved, however, by introducing a new entity, EquipmentBooking,
and three binary associations Figure 8.18 (b). The new entity represents a particular
equipment booking. In business terms this could be a paper booking form, filled out
by staff who want to book additional equipment for their teaching sessions. The idea
is that each occurrence of EquipmentBooking is linked to exactly one occurrence of the
other three entities – that is, one member of staff books one piece of equipment for
one teaching session. This is exactly the same situation as Figure 8.17.
When developing a conceptual ER diagram, the database designer can choose
whether or not to use n-ary associations. If they are not used, then there will be enti-
ties like EquipmentBooking, the purpose of which is to link together other entities. A
special feature of link entities is that they might not have any attributes. This breaks
the rules, but is generally tolerated.
The terms unary, binary and n-ary are useful guides to the complexity of the
situation modelled by an association, but they are not clear cut. It is possible to con-
ceive of an association in which two entities participate, but is really closer to an n-ary
association. For example, at the Pennine University, a member of staff, acting as
370
IDWA_C08.QXD 10/5/07 11:38 AM Page 371
8.5 Associations
«persistent» For
Staff
«persistent»
Book TeachingSession
«persistent»
Equipment UsedIn
(a) Incorrect use of binary associations to represent a connection between three entities.
«persistent»
Staff
Place
«persistent» «persistent»
EquipmentBooking For TeachingSession
Book
«persistent»
Equipment
(b) Correct use of link entity and binary association to represent connections between
three entities.
module leader, leads a team of module tutors (each of whom is also a member of staff)
to deliver a particular module.
The ER diagram representing this situation is shown in Figure 8.19. Only two
entities participate in the association, so it conforms to the definition of a binary asso-
ciation. However, it has three association ends, so is much more like an n-ary associa-
tion. The clinching argument is that, during logical design, this association will
have to be treated as an n-ary association. So, it seems that there should be a tighter
definition of these terms.
371
IDWA_C08.QXD 10/5/07 11:38 AM Page 372
«persistent» ModuleTutor
Staff
ModuleLeader
Deliver
«persistent»
Module
The ER diagram in Figure 8.19 is not the only way to model this situation. A better
way might be to have two separate binary associations between the entities Staff and
Module – one called Leads and one called Teaches. However, the point is that asso-
ciations like the one shown are possible, but do not fit into the traditional definitions
of unary, binary and n-ary associations.
372
IDWA_C08.QXD 10/5/07 11:38 AM Page 373
8.5 Associations
:Staff :Module
academicYear[1..*] : Varchar
Association attributes
listed in entity-like box
connected to the association by a dashed line. Notice that association attributes can
have all the features of entity attributes. In Figure 8.21 academicYear is a multivalued
attribute as a particular member of staff may lead a particular module in more than
one academic year. For an n-ary association, the attribute box is connected to the dia-
mond, again by a dashed line. There is no rule about where to place the attribute box
on the ER diagram – above, below or beside the association are all fine.
373
IDWA_C08.QXD 10/5/07 11:38 AM Page 374
374
IDWA_C08.QXD 10/5/07 11:38 AM Page 375
Aggregation and composition are particularly useful during logical database design
when the target data model is XML. Composition indicates that the element represent-
ing the part should be included within the element representing the whole. Aggrega-
tion indicates that element inclusion is a feasible approach to implementing the
association. A plain association could be implemented using keys. Logical database
design for XML is discussed in Chapter 12.
Most organizations have business rules of the form ‘one occurrence of this entity may
be linked to this many occurrences of this other entity’. Examples from the Web
Timetable application include:
375
IDWA_C08.QXD 10/5/07 11:38 AM Page 376
Figure 8.24 Object diagram illustrating the multiplicity constraint for the end of the Leads
association attached to the Module entity.
Occurrence of staff linked to
many occurrences of module
:Module
:Staff
code = 'CCFC0108'
staffNo = '35054' title = 'Introduction to programming'
fName = 'Selma' department = 'Computing'
IName = 'Hutchins' level = 'Foundation'
phone = '8706'
email ='[email protected]'
photo = '' :Module
department ='Computing'
title ='Ms' code = 'CIIC0053'
jobType = 'Academic' title = 'Application development'
jobTitle = 'Senior Lecturer' department = 'Computing'
level = 'Intermediate'
:Staff
For an association with N ends, choose any N-1 ends and associate specific instances with
those ends. Then the collection of links of the association that refer to these specific in-
stances will identify a collection of instances at the other end. The multiplicity of the asso-
ciation end constrains the size of this collection. (OMG, 2003, p. 113)
This details an algorithm for identifying the multiplicity constraint. In the final release
of the UML2.0 standard, the definition has changed to:
Traversal of an n-ary association towards a navigable end requires that objects first be
identified for the remaining n-1 ends. The result of traversal is a collection of objects for the
navigable end derived from links in which the other n-1 objects participate. For binary
376
IDWA_C08.QXD 10/5/07 11:38 AM Page 377
associations, n=2, in which case traversal proceeds from one object at the other end to a
collection of objects at the navigable end. The multiplicity of the association end constrains
the size of this collection. If the end is marked as ordered, this collection will be ordered.
(OMG, 2005 p. 113)
Although it seems more complicated, the final definition describes what is essentially
the same process. To calculate the multiplicity constraint for an association end, ask
the following two questions.
1 What is the lower bound on the number of occurrences of the entity at the chosen
association end that could be linked, by occurrences of the association, to a given
set of occurrences of entities from each of the other ends of the association?
2 What is the upper bound on the number of occurrences of the entity at the chosen
association end that could be linked, by occurrences of the association, to a given
set of occurrences of entities from each of the other ends of the association?
The answers to these questions establish a lower bound and an upper bound for the
multiplicity constraint and will depend on the meaning of the association. This form
of multiplicity constraint on an association end is sometimes called the outer multi-
plicity constraint, though in UML (and this subsection), it is simply referred to as the
multiplicity constraint on an association end.
1 What is the lower bound on the number of occurrences of Module that could be
linked, by occurrences of the association, to a given occurrence of Staff?
The object diagram in Figure 8.24 can help here. Some staff don’t lead modules – Freya
Stark, for example. An occurrence of Staff representing someone who is not a module
leader will not be linked to any occurrences of Module. This gives a lower bound on the
multiplicity constraint of zero.
Now consider the upper bound.
2 What is the upper bound on the number of occurrences of Module that could be
linked, by occurrences of the association, to a given occurrence of Staff?
Again, this is easy enough to answer. Some members of staff are module leaders for
several modules. An occurrence of Staff representing someone who is a module leader
377
IDWA_C08.QXD 10/5/07 11:38 AM Page 378
for several modules will be linked to many occurrences of Module. For example, in Fig-
ure 8.24 Selma Hutchins leads two modules. Other staff may lead more. This gives an
upper bound on the multiplicity constraint of many, written ‘*’. The complete multi-
plicity constraint, therefore, is written ‘0..*’ and placed at the Module end of the Leads
association, as shown in Figure 8.25. Note that there are no square brackets around the
multiplicity constraint on an association end.
The ER diagram in Figure 8.25 is read:
a particular member of staff may not lead any modules, but they may lead many.
This is, bar minor rewording, the first of the business rules listed at the start of this
section.
Calculating the multiplicity constraint at the Staff end of the Leads association in-
volves following the same process. First, consider the lower bound.
1 What is the lower bound on the number of occurrences of Staff that could be linked
to a given occurrence of Module?
This is easy. Every module has a module leader and a module leader is a member of staff.
Thus, there is always an occurrence of Staff linked to a given occurrence of Module. This
gives a lower bound on the multiplicity constraint of one.
Now consider the upper bound.
2 What is the upper bound on the number of occurrences of Staff that could be linked
to a given occurrence of Module?
The Leads association models the connection between a module and the member of
staff leading that module. There is only one module leader for any one module, so, at
most, one member of staff is linked to any given occurrence of Module by occurrences
of the Leads association. This is despite the fact that several members of staff might
teach the module. The Leads association does not model this ‘staff teaches module’
relationship.
The final ER diagram showing both the multiplicity constraints on the Leads
association is shown in Figure 8.26.
Figure 8.25 The multiplicity constraint for the end of the Leads association
attached to the Module entity.
Figure 8.26 The Leads association with both its multiplicity constraints.
378
IDWA_C08.QXD 10/5/07 11:38 AM Page 379
As a second example, consider the Sits association between Student and Module (Figure
8.27). This association represents students sitting modules in a particular academic
year. First, calculate the multiplicity constraint at the Student end of the association.
1 What is the lower bound on the number of occurrences of Student that could be
linked to a given occurrence of Module?
Once teaching starts, all modules will have some students taking them or the module
will not be on the timetable. However, the timetable is created before the start of the
academic year, so only data about students in the second or later years is available. First
year modules will not have any students taking them at that point, but must still be
timetabled. This means that an occurrence of Module that represents a first year module
may not have any occurrences of student linked to it, so the lower bound will be zero.
2 What is the upper bound on the number of occurrences of Student that could be
linked to a given occurrence of Module?
A typical module will have many students, so an occurrence of Module may be linked
to many occurrences of Student. The upper bound is ‘many’.
At the Module end of the Sits association, the lower bound of the multiplicity con-
straint is calculated by asking this question.
1 What is the lower bound on the number of occurrences of Module that could be
linked to a given occurrence of Student?
At the Pennine University, every student must sit one module in each academic year.
This means that the lower bound is one.
To calculate the upper bound, ask the following question.
2 What is the upper bound on the number of occurrences of Module that could be
linked to a given occurrence of Student?
At the Pennine University no student can sit more than seven modules in any one
year. Consequently, the upper bound is seven. Figure 8.27 shows the ER diagram that
captures this information.
The multiplicity constraints on a unary association are calculated in the same way,
though the fact that there is only one entity involved can make it a little confusing.
Consider the Represent association in Figure 8.16 (shown again in Figure 8.28), which
associates the Student entity with itself. The role name is useful for distinguishing
between the two ends of the association.
To calculate the multiplicity constraint for the StudentRep end of the Represent
association, ask the two questions again.
379
IDWA_C08.QXD 10/5/07 11:38 AM Page 380
StudentRep «persistent»
1..2 Student
0..*
Represent
2 What is the upper bound on the number of occurrences of Student at the Student-
Rep end of the association that could be linked to a given occurrence of Student at
the opposite end of the association?
1 What is the lower bound on the number of occurrences of Student that could be
linked to a given occurrence of Student at the StudentRep end of the association?
2 What is the upper bound on the number of occurrences of Student that could be
linked to a given occurrence of Student at the StudentRep end of the association?
380
IDWA_C08.QXD 10/5/07 11:38 AM Page 381
«persistent»
Staff 0..*
LineManager 1..1
Manage
To calculate the multiplicity constraint at the LineManager end of the association con-
sider these questions.
1 What is the lower bound on the number of occurrences of Staff at the LineManager
end of the association that could be linked to a given occurrence of Staff at the op-
posite end of the association?
2 What is the upper bound on the number of occurrences of Staff at the LineManager
end of the association that could be linked to a given occurrence of Staff at the
opposite end of the association?
As stated above, every member of staff has a line manager, so the lower bound is one.
Any member of staff has at most one line manager, so the upper bound is also one.
This gives a multiplicity constraint of 1..1.
For the multiplicity constraint at the other end of the association, ask these
questions.
1 What is the lower bound on the number of occurrences of Staff that could be linked
to a given occurrence of Staff at the LineManager end of the association?
Don’t be misled into considering only occurrences of Staff acting in the role
LineManager. As some members of staff are not line managers and so do not manage
anyone, the answer to this question is zero.
2 What is the upper bound on the number of occurrences of Staff that could be linked
to a given occurrence of Staff at the LineManager end of the association?
This upper bound will be many as those members of staff who are line managers may
manage several other staff. The multiplicity constraints are shown in Figure 8.29.
Unlike attribute multiplicity, there is no default multiplicity constraint. When there
is no multiplicity constraint on an association end, it means that the multiplicity is
not known. The commonest multiplicity constraints, and their UML representations,
are shown in Table 8.5. Other ranges are possible, though less common. For example,
381
IDWA_C08.QXD 10/5/07 11:38 AM Page 382
‘three to seven’ is written 3..7 and ‘three or five or more than 10’ is written 3, 5, 10..*.
On the ER diagram itself, the multiplicity constraint should be written close to the
appropriate association end. Writing both multiplicity constraints on the same side of
the line representing the association and on the opposite side to the association name
can improve the clarity of the ER diagram. Although the two commonest multiplicity
constraints – 0..* and 1..1 – have abbreviations, it is best not to use these in a concep-
tual ER diagram. End users are unlikely to be familiar with these abbreviations, so
using them could impede a key purpose of the ER diagramming technique – to help
end users evaluate the database design.
Multiplicity constraints have been used to characterize unary and binary associa-
tions. Associations like Leads (Figure 8.26) are called one-to-many associations. A one-
to-many association is one where the upper bound on the multiplicity constraint at
one association end is 1, while the upper bound at the other end is greater than one (it
might be ‘*’ or an integer greater than 1). A typical object diagram has one occurrence
of Staff linked with many occurrences of Module, as shown in Figure 8.30 (a).
:Module
:Staff :Module
:Module
:Module
:Student
:Student :Module
:Student
:Module
:Exam :Module
382
IDWA_C08.QXD 10/5/07 11:38 AM Page 383
Not all associations are one-to-many. The Sits association (Figure 8.27) is many-to-
many. A many-to-many association is one where the upper bound on the multiplicity
constraint at both association ends is greater than one. A typical object diagram has
many occurrences of Student linked with many occurrences of Module, as shown in
Figure 8.30 (b).
The only other possibility is that the upper bound on the multiplicity at both asso-
ciation ends is equal to one. Unsurprisingly, these are called one-to-one associations.
Here is an example.
At the Pennine University, a module may be assessed by an exam. Some modules
do not have an exam, but those that do only have one exam. Each exam examines
exactly one module. The ER diagram for this one-to-one association is shown in
Figure 8.30 (c), with a typical object diagram shown in Figure 8.30 (d).
The multiplicity for each association end of an n-ary association is calculated in the
same way as multiplicity for binary and unary associations. The complicating factor is
that there are now two (or more) other ends to worry about.
Consider the n-ary association BookEquipment, which has three participating enti-
ties (Figure 8.17). At the Equipment end of this association, the lower bound on the
multiplicity constraint is obtained by answering this question.
1 What is the lower bound on the number of occurrences of Equipment that could be
linked, by occurrences of the association, to a given set of occurrences of Staff and
TeachingSession?
To answer it, consider the possible sets of occurrences of Staff and TeachingSession
(where it’s implied that there is one occurrence of each entity). Any occurrence of Staff
could be paired with any occurrence of TeachingSession. When the chosen member of
staff does not teach the chosen teaching session, there will be no additional equipment
booked by that person for the teaching session. This means that the lower bound will
be zero.
For the upper bound, consider the following question.
2 What is the upper bound on the number of occurrences of Equipment that could be
linked, by occurrences of the association, to the given set of occurrences of Staff and
TeachingSession?
Again, consider the possible sets of occurrences of Staff and TeachingSession. When
the chosen occurrence of Staff does teach the chosen occurrence of TeachingSession, it
is possible that he or she will have booked lots of additional equipment for this partic-
ular teaching session, so the upper bound is many. Figure 8.31 shows this multiplicity
constraint on the ER diagram.
The multiplicity constraint at the TeachingSession end of the BookEquipment asso-
ciation is calculated similarly. This time, consider the possible sets of occurrences of
Staff and Equipment. First, ask this question.
383
IDWA_C08.QXD 10/5/07 11:38 AM Page 384
«persistent»
Staff
«persistent» «persistent»
BookEquipment
Equipment 0..* TeachingSession
piece of equipment, so the lower bound is zero. For the upper bound, consider this
question.
It is conceivable that a particular member of staff books the same piece of equipment
for every single one of his or her teaching sessions, so the upper bound will be many.
The multiplicity constraint 0..* is therefore added to the TeachingSession end of the
BookEquipment association.
The third association end is that with the Staff entity attached. To calculate the
multiplicity at this association end, consider the possible sets of occurrences of
TeachingSession and Equipment. For the lower bound, ask the following question.
1 What is the lower bound on the number of occurrences of Staff that could be
linked, by occurrences of the association, to a given set of occurrences of
TeachingSession and Equipment?
2 What is the upper bound on the number of occurrences of Staff that could be
linked, by occurrences of the association, to a given set of occurrences of
TeachingSession and Equipment?
384
IDWA_C08.QXD 10/5/07 11:38 AM Page 385
«persistent»
Staff
0..1
«persistent» «persistent»
BookEquipment
Equipment 0..* 0..* TeachingSession
Figure 8.33 The n-ary association for the Ballroom Dancing Society.
«persistent»
Man
0..1
1 each man dances with each woman at least once during the evening
2 some couples have more than one dance together, but some only one
3 as there are always fewer men than women, men participate in every dance, but
some women have to sit some dances out.
The n-ary association Partners (with its multiplicity constraints) is shown in Figure 8.33.
To calculate the multiplicity constraints, ask the questions as before to establish the
lower and upper bounds for each one. The multiplicity constraint at the Dance end of
the Partners association captures information about how many dances a couple may
385
IDWA_C08.QXD 10/5/07 11:38 AM Page 386
have together. The lower bound on the number of occurrences of Dance that could be
linked, by occurrences of the association, to a given set of occurrences of Man and
Woman is one – every possible couple has at least one dance (as per rule 1, above). The
upper bound on the number of occurrences of Dance that could be linked, by occur-
rences of the association, to a given set of occurrences of Man and Woman is many –
some couples may have several dances (rule 2).
The multiplicity constraint at the Woman end of the Partners association captures
information about how many dance partners a man may have in a particular dance.
The lower bound on the number of occurrences of Woman that could be linked, by oc-
currences of the association, to a given set of occurrences of Man and Dance is one as,
given a particular man and a particular dance, there will be a woman partnering that
man for the dance (rule 3). The upper bound on the number of occurrences of Woman
that could be linked, by occurrences of the association, to a given set of occurrences of
Man and Dance is also one – a man can only dance with one woman at a time (this is
ballroom dancing, so there is no swapping of partners).
The multiplicity constraint at the Man end of the Partners association captures in-
formation about how many dance partners a woman may have in a particular dance.
The lower bound on the number of occurrences of Man that could be linked, by occur-
rences of the association, to a given set of occurrences of Woman and Dance is zero –
some women must sit out some dances as they have no partner and so do not dance
(rule 3). The upper bound on the number of occurrences of Man that could be linked,
by occurrences of the association, to a given set of occurrences of Woman and Dance is
one as a woman can only dance with one man at a time (again, no partner swapping).
1 For a given occurrence of the entity at the chosen end of the association, what is the
lower bound on the number of occurrences of the association it could participate in?
2 For a given occurrence of the entity at the chosen end of the association, what is the
upper bound on the number of occurrences of the association it could participate in?
The answers to these questions establish a lower bound and an upper bound for the
inner multiplicity constraint and, just as for the outer multiplicity constraint, will de-
pend on the meaning of the association.
386
IDWA_C08.QXD 10/5/07 11:38 AM Page 387
This definition is simpler than the definition of the outer multiplicity constraint.
Only the entity at the chosen association end, and the association itself, are men-
tioned. There is no need to worry about the other entities that participate in the
association.
Consider again the Leads association between the Staff and Module entities (Fig-
ure 8.15). To calculate the inner multiplicity constraint for the Module end of the
Leads association, simply ask the two questions.
For the lower bound, ask the following.
1 For a given occurrence of Module, what is the lower bound on the number of occur-
rences of Leads it could participate in?
Notice that there is no mention of the Staff entity, though obviously any occurrence of
Leads will have an occurrence of Staff at one end. To answer the question, note that
every module has a module leader, so any given occurrence of Module must participate
in at least one occurrence of Leads – the one linking it to the occurrence of Staff repre-
senting the module leader. The lower bound in the inner multiplicity constraint,
therefore, is one.
For the upper bound, ask the following question.
2 For a given occurrence of Module, what is the upper bound on the number of occur-
rences of Leads it could participate in?
Again, the meaning of the association provides the answer – there is only one module
leader for any module, so the upper bound is also one. Thus, the inner multiplicity
constraint for the Module end of the Leads association is 1..1.
To calculate the inner multiplicity constraint at the Staff end of the Leads associa-
tion, consider a particular occurrence of Staff. It’s possible that this particular member
of staff does not lead any modules, so the lower bound on the inner multiplicity con-
straint is zero. It is also possible that this particular member of staff leads several mod-
ules, so the upper bound is many. Thus, the inner multiplicity constraint for the Staff
end of the Leads association is 0..*.
There is no special notation in the UML to express inner multiplicity constraints.
The ER diagram in Figure 8.34 uses the ‘constraint in a note’ approach to document
the inner multiplicity constraints at each end of the Leads association. (This approach
was also used with entity and attribute constraints, for example, in Figures 8.12 and
8.13.)
Notice that in Figure 8.34, the inner multiplicity at one end of the binary associa-
tion is the same as the outer multiplicity at the other end. For binary (and unary) asso-
ciations, this is always the case. For n-ary associations, inner and outer multiplicity
constraints can be quite different, with no simple equivalence between them.
Consider the n-ary association BookEquipment. To calculate the inner multiplicity
constraint at the Equipment end of this association, consider a particular occurrence of
387
IDWA_C08.QXD 10/5/07 11:38 AM Page 388
Equipment. It is possible that this particular piece of equipment has never been
booked, so the lower bound is zero. Equally, it is possible that it is booked regularly, so
the upper bound is many and the inner multiplicity constraint is 0..*. Similarly, a par-
ticular occurrence of Staff may not have booked any additional equipment or may
have made many bookings, so the inner multiplicity constraint on Staff is also 0..*. Fi-
nally, a particular occurrence of TeachingSession may not have any additional equip-
ment booked or may have several such items booked, so, again, the inner multiplicity
constraint is 0..*.
The inner multiplicity constraints on BookEquipment are shown, using standard
UML notation, in Figure 8.35(a). This standard UML notation is a little cumbersome
and leads to cluttered ER diagrams. Génova et al. (2002) suggest changes to the UML
notation to allow both inner and outer multiplicity constraints to be shown in an ER
diagram without using notes. Their proposal is that, for:
1 binary and unary associations, only show the outer multiplicity constraints (as the
inner multiplicity constraints add no new information)
2 n-ary associations, show the outer multiplicity constraints in their usual place
and locate the inner multiplicity constraints where the association end meets the
diamond.
This enhanced notation for n-ary associations is shown in Figure 8.35(b). It seems a
reasonable proposal, but, it must be emphasized, this suggestion has not been incorpo-
rated in the UML standard.
Now consider the inner multiplicity constraints on the Partners association. It is rel-
atively straightforward to calculate these. For the Man entity, every man partners a
woman in the first dance of the evening, so the lower bound is one. Every man then
partners a woman in every other dance, so the upper bound is many. Similarly, a
woman is guaranteed a dance with every man, so will participate in at least one occur-
rence of the Partners association and may participate in many. Finally, a given dance
will involve at least one couple (or it isn’t a dance) and possibly many couples. These
inner and outer multiplicity constraints are shown in Figure 8.36.
388
IDWA_C08.QXD 10/5/07 11:38 AM Page 389
«persistent»
Staff
Note documenting inner
0..1 multiplicity constraint
{Inner multiplicity is 0..*}
«persistent»
Staff
0..1
Inner multiplicity constraint
0..*
«persistent»
Man
0..1
1..*
389
IDWA_C08.QXD 10/5/07 11:38 AM Page 390
The two examples given above illustrate a problem with inner multiplicity con-
straints similar to that with outer multiplicity. It is that the upper bound on the inner
multiplicity constraints of an n-ary association is almost always many. An upper
bound of one implies that the entity this inner multiplicity constraint applies to will
participate in at most one occurrence of the association. If the upper bound on the
inner multiplicity constraint on Equipment was one (Figure 8.35) then there would be
at most one pair of occurrences of Staff and TeachingSession associated with any occur-
rence of Equipment – that is, any piece of equipment could only be booked once. This
is clearly unrealistic.
The situation regarding the two kinds of multiplicity constraint can be summed up
as follows. The lower bound on the inner multiplicity indicates whether or not a given
entity participates in the association, while the upper bound limits how many differ-
ent association occurrences a given entity occurrence can participate in. The lower
bound on the outer multiplicity constraint indicates whether or not any given set of
occurrences of the other entities participates in the association. The upper bound lim-
its how many different association occurrences this set of entity occurrences can par-
ticipate in.
For binary associations, there is an easy equivalence between inner and outer
multiplicity – the inner multiplicity constraint at one end of an association equals the
outer multiplicity constraint at the other end. For n-ary associations, the inner and
outer multiplicity constraints provide different information about the association. Fur-
thermore, for n-ary associations the lower bound on the inner multiplicity constraint
usually provides the most interesting information, as the upper bound is often many.
In contrast, the upper bound of the outer multiplicity constraint usually provides the
more interesting information for this type of association as the lower bound is often
zero.
Given the above, it seems sensible to include both inner and outer multiplicity con-
straints in the ER diagram for n-ary associations, and two viable notations (one stan-
dard UML, one extending the UML) have been suggested. The database designer can
choose whether to use one, or both, types of multiplicity constraint, but must be very
clear about which he or she is using.
390
IDWA_C08.QXD 10/5/07 11:38 AM Page 391
The idea is that all occurrences of an entity share some common attributes. However,
certain occurrences of the entity have some additional attributes, while others do not.
Those with the additional attributes are examples of the more specific, or specialized,
entity. All are examples of the more general entity.
An association can represent any kind of relationship. A generalization only repre-
sents the ‘is a kind of’ relationship. In this sense, generalization has a more specific mean-
ing than association. Generalization is used a lot in object-orientated design, but can be
useful in conceptual database design. Note that a generalization relationship always has
two distinct participating entities – the more general and the more specific entity.
Consider the Room entity from the Web Timetable application. All rooms have the
same attributes – building, roomNo and capacity. The building and roomNo identify its
location, while capacity indicates how many students the room can accommodate.
Regardless of whether the room is a general teaching room, suitable for seminars and
tutorials, or a lecture theatre, music room or laboratory, it has a location and a capac-
ity. Laboratories also have workstations. Each workstation is a place where one or
more students work. It may be a computer workstation, designed for one student to
use, or a lab bench in a science or engineering laboratory, some of which are de-
signed for two students to share. Hence, the number of workstations may be lower
than the room’s capacity. Rather than include a workstations attribute for the Room
entity, which is not relevant for most occurrences of Room, it makes sense to create a
new entity, Laboratory, and indicate that this is a specific type of Room, with an addi-
tional attribute – workstations.
This situation is shown in the ER diagram in Figure 8.37. The generalization rela-
tionship is distinguished from an association by placing a triangle at the more general
end of the relationship. The attributes common to all occurrences of Room are includ-
ed in the more general entity Room. The additional attribute relevant to the more spe-
cific entity occurrences is included in the Laboratory entity. It is important to realize
that an occurrence of Room does not have a workstations attribute. However, an occur-
rence of Laboratory does have a building, roomNo and capacity attribute as well as a
workstations attribute. The more specific entity inherits the attributes (and operations)
of the more general entity.
In the Web Timetable, the Room entity actually has three more specific entities –
Laboratory, LectureTheatre and MusicRoom.
A music room may or may not have a piano, electronic keyboard or organ in it. Staff
and students need to known which rooms do and which don’t. They also want to
know what type of keyboard is in the room. This suggests the more specific entity
MusicRoom with two additional attributes – keyboard and description.
«persistent» «persistent»
Room Laboratory
391
IDWA_C08.QXD 10/5/07 11:38 AM Page 392
«persistent»
Room
building : Varchar {PK}
roomNo : Varchar {PK}
capacity : Integer
«persistent»
Room
building : Varchar {PK}
roomNo : Varchar {PK}
capacity : Integer
A lecture theatre is distinguished from an ordinary teaching room by its seating. Some
have fixed benches set in tiers, others have seats with fold-away tables on the arms and
still others have moveable tiers of benches that can be rolled back, turning the lecture
theatre into a large, open room, used for sports such as basketball or badminton.
Again, this suggests a more specific entity LectureTheatre with the additional attribute
seating.
392
IDWA_C08.QXD 10/5/07 11:38 AM Page 393
393
IDWA_C08.QXD 10/5/07 11:38 AM Page 394
«persistent»
Room
building : Varchar {PK}
roomNo : Varchar {PK}
capacity : Integer
«persistent» «persistent»
ComputingLab ScienceLab
specialistSoftware : Set Varchar safetyRating : Varchar
notes [0..1] : Varchar
394
IDWA_C08.QXD 10/5/07 11:38 AM Page 395
{disjoint, optional}
{overlapping, mandatory}
«persistent» «persistent»
ComputingLab ScienceLab
{disjoint, optional}
395
IDWA_C08.QXD 10/5/07 11:38 AM Page 396
The third area of concern is that the ER diagram may not allow users to make all the
database queries they require. Access path analysis can verify that the ER diagram
represents a database design that can support all required database queries.
Consider the ER diagram shown in Figure 8.41 (a). The ER diagram does not include an
association between Module and TeachingSession. Instead, it implies that the connec-
tion can be made indirectly, going via Staff. Given an occurrence of TeachingSession,
navigate along the Teaches association to find which member of staff teaches this ses-
sion, then navigate along the WorksOn association to find which module that person
:TeachingSession
:Module
:TeachingSession :Staff
:Module
:TeachingSession
396
IDWA_C08.QXD 10/5/07 11:38 AM Page 397
1..* 1..1
«persistent» Teaches «persistent» WorksOn «persistent»
TeachingSession 1..* 1..1 Staff 1..1 1..* Module
(a) No connection trap – the ER diagram correctly represents the scenario being
modelled.
works on. Of course, as the object diagram in Figure 8.41 (b) illustrates, this is not pos-
sible. Links from many occurrences of TeachingSession converge on a single occur-
rence of Staff, from which links to many occurrences of Module emerge. There is no
way of deciding which occurrence of TeachingSession is linked with which occurrence
of Module. The characteristic fan shape gives this sort of connection trap its name – the
fan trap.
Fan traps may occur when two one-to-many associations fan out from the same en-
tity. While these two associations accurately reflect the connections between the par-
ticipating entities, they don’t capture the connection between the two outermost
entities. In the example, there was a connection between TeachingSession and Module
that could not be captured by the associations Teaches and WorksOn. The obvious so-
lution is to add this ‘missing’ association to the ER diagram, as shown in Figure 8.42.
This solution is used to resolve all connection traps, though it often introduces a new
problem – one of the three associations in Figure 8.42 is now unnecessary. Redundant
associations are dealt with in the next subsection.
The intended meaning of the ER diagram is crucial when deciding whether or not a
potential connection trap is a problem. The ER diagrams in Figures 8.43 (a) and (b)
involve two one-to-many associations fanning out from a single entity, but neither
represents a fan trap.
The ER diagram in Figure 8.43 (a) models the connections between members of staff
and their teaching and support sessions. Although the structure of the ER diagram is
exactly the same as that in Figure 8.41 (a), the meaning in Figure 8.43 (a) is quite dif-
ferent. There is no implied connection between teaching and support sessions, other
397
IDWA_C08.QXD 10/5/07 11:38 AM Page 398
Sits
1..7 0..*
«persistent» Uses «persistent» EnrolledOn «persistent»
Module 5..20 1..* Course 1..1 0..* Student
than that they happen to involve the same member of staff – that is, there is no missing
association between TeachingSession and SupportSession.
The ER diagram in Figure 8.43 (b) is not a fan trap either. It would be a fan trap if it
suggested that it was possible to navigate from an occurrence of TeachingSession to the
occurrences of Equipment used in that teaching session. The lower bound on the
(outer) multiplicity constraint at the Equipment end of KeptIn makes it clear that this is
not always possible – some equipment is not kept in any room.
The ER diagram does suffer from a connection trap, though. When an entity partic-
ipates in two associations and one or more of the multiplicity constraints has a lower
bound of zero, then a second sort of connection trap – called a chasm trap – can occur.
The optional participation of Equipment in KeptIn means that it is not always possible
to find what equipment is needed for a teaching session using the two associations
shown. If this information is required, then a new association should be added be-
tween TeachingSession and Equipment.
Another example of a connection trap (which is neither a fan trap nor a chasm trap)
is shown in Figure 8.44 (a). Each Student enrols on exactly one Course, which Uses at
least five Module. The modules taken on a course will cover all years of the course, so a
student will only sit some of these modules in any one year. As the Web Timetable is
intended to model a single academic year, the ER diagram has a connection trap – it is
not possible to identify which modules any given student is taking this year. Notice
that this problem would arise even if the the lower bound on the multiplicity con-
straint at the Student end of EnrolledOn were one.
The problem is that the database designer has failed to notice that there is a direct
connection between Module and Student. The solution is to add the Sits association, as
shown in Figure 8.44 (b).
Notice that the existence of a connection trap in Figure 8.44 (a) is due to the intended
interpretation of the ER diagram – that it represents modules taken by students in a
particular academic year. If the ER diagram was instead intended to represent all the
modules a student takes during his or her entire course of study, then there would be
no connection trap as the link between a particular module and a particular student
can be made via the course. That is unless, for example, a course has optional modules.
398
IDWA_C08.QXD 10/5/07 11:38 AM Page 399
Then, there would be a connection trap again as it isn’t possible to identify which stu-
dents take a particular optional module. It is worth thinking carefully before declaring
a connection trap and adding a new association.
Delivers
1..* 1..1
«persistent» Teaches «persistent» WorksOn «persistent»
TeachingSession 1..* 1..1 Staff 1..1 1..* Module
(c) The ER diagram redrawn with the redundant association WorksOn removed.
399
IDWA_C08.QXD 10/5/07 11:38 AM Page 400
The last association is WorksOn. WorksOn captures information about which mem-
bers of staff work on which modules. Figure 8.45 (c) shows the ER diagram with this
association removed and it turns out that this ER diagram is equivalent to the original.
Given a particular occurrence of Module, navigate along the Delivers association to
find all the occurrences of TeachingSession that are linked to it. For each occurrence of
TeachingSession navigate along Teaches to find which occurrence of Staff is linked to
it. This provides a list of all staff working on the module as, if they work on the mod-
ule, they must teach one of the teaching sessions. Similarly, for a given occurrence of
staff, it is possible to use Teaches and Delivers to find which modules he or she works
on. Hence, the WorksOn association in Figure 8.45 (a) is redundant.
Figure 8.46 Access path analysis for the query ‘List all staff who teach on a
particular course’.
«persistent»
Course
1..*
Uses
5..20
«persistent»
0..* Module 1..1
Leads Delivers
(a) Extract from the Web Timetable ER diagram showing the relevant entities and
associations.
«persistent»
Course
1..*
Uses
5..20
«persistent»
Module
1..1
Delivers
1..*
«persistent»
TeachingSession
1..*
Teaches
1..1
«persistent»
Staff
Identifying such queries can be difficult, but it is worth making the effort. Even if
only a few access paths are developed, they provide some reassurance that the ER dia-
gram represents a database design that will meet the core requirements of the end
users. Access paths also come in handy during physical database design (Chapter 10),
when they are used to estimate how efficient a particular database query will be.
Chapter summary
■ The main focus of this chapter has been on conceptual database design using ER
diagrams.
■ The ER diagram uses three simple abstractions – entity, relationship and attribute –
to model data requirements. The UML class diagram notation has been used to
draw the ER diagrams, along with some non-standard additions that are particularly
useful for database design.
■ Two different sorts of relationship were discussed – association and generalization
relationships. Associations represent meaningful connections between entities.
Aggregation and composition associations are special forms of association. Gener-
alization relationships are distinct from associations, representing ‘is a kind of’
relationships, which are most commonly met in object-orientated approaches to
application development.
■ The important topic of association multiplicity constraints has been covered in
depth. Two kinds of multiplicity constraints – inner and outer – were identified.
These were seen to be roughly equivalent for binary associations, but quite differ-
ent for n-ary associations.
■ The chapter concluded by suggesting approaches to validating an ER diagram
against the users’ requirements.
Further reading
Muller (1999) covers database design with the UML in depth and includes logical de-
sign for data models other than the relational data model (our Chapter 9) and XML
(our Chapter 12).
Connolly and Begg (2004) also use the UML notation for ER diagrams.
There are many books on UML itself and all cover class diagrams, which form the basis
for the ER diagrams in this chapter. Good introductory UML texts include Bennett
et al. (2005) and Holt (2001).
There are many existing database textbooks that cover the database design process, but
use other notations for ER diagrams. Kroenke (2006) and Hoffer et al. (2004) have
particularly good discussions of conceptual database design.
402
IDWA_C08.QXD 10/5/07 11:38 AM Page 403
Review questions
The research paper by Génova et al. (2002) forms the basis for the comparison of
inner and outer multiplicity in Section 8.6. It is a reasonably clear discussion, though,
as it is a research paper, may be considered a little heavy going. The distinction
between inner and outer multiplicities is not often made in database textbooks, but it
is essential to understand it if the UML notation is used.
Review questions
8.3 Describe three of the additional features of attributes and give an example of each.
8.4 Explain the purpose of a derived attribute. How are these indicated in the ER
diagram?
8.7 Explain the differences between an association, aggregation association and com-
position association. How are aggregation and composition associations indicated in
the ER diagram?
403
IDWA_C08.QXD 10/5/07 11:38 AM Page 404
Exercises
8.11 For each of the following descriptions, draw an ER diagram showing an entity and its
attributes.
(a) When they enrol at the Pennine University, students are each allocated a ten-
character student ID number, which acts as a unique identifier. The first name(s),
preferred name and family name are recorded for each student. Each is allocated
a university e-mail address, and optionally may provide an additional e-mail ad-
dress and phone number.
(b) Each course has a unique, four-character course code and a title. The level of the
course (access, undergraduate or postgraduate), the exit qualification and the
mode of study (full- or part-time) are all recorded. The name of the department(s)
delivering the course is also included (there are at most two departments in-
volved). The title and mode uniquely identify a course.
(c) Each teaching session runs on a particular day (Monday to Friday), between two
given times. No teaching session starts before 09.00 or after 18.00. Teaching ses-
sions last between one and three hours (the usual length is one hour). Each
teaching session is timetabled to run between two fixed dates and runs for be-
tween 1 and 12 weeks. The type of the session is lecture, tutorial, practical, work-
shop or seminar. For sessions that are not lectures, there is a session code, to
distinguish between classes that run in parallel.
(d) Write at least two similar descriptions of entities based on a business or organiza-
tion that you are familiar with and draw an appropriate ER diagram.
8.12 For each of the following descriptions, draw an ER diagram showing one or more en-
tities (without their attributes) joined by a single association (possibly with attributes)
or a generalization relationship. Include the (outer) multiplicity constraints in your di-
agram.
(a) A member of staff teaches one or more teaching sessions and each teaching ses-
sion is taught by exactly one member of staff.
(b) A student may act as student representative for, at most, one course and each
course has one or more student representatives. As there is supposed to be one
student representative for each year of study on a course, which student repre-
sents which year is also recorded.
(c) An exam consists of one or more questions, with each question appearing on
zero or more exams. (Hint: Use an aggregation association.)
(d) Questions may themselves be composed of many other questions, with each of
these subsidiary questions belonging to exactly one main question. (Hint: Use a
generalization relationship and a composition association.)
(e) An online chess tournament has 10 professionals and 100 amateurs taking part.
Each game is played between one amateur and one professional. Each profes-
sional will play every amateur. If an amateur wins a game, he or she can play a
second game against that professional. Any amateur winning two games against
a professional is awarded a prize.
(f) Write at least two similar descriptions of associations or generalizations based on
a business or organization that you are familiar with and draw an appropriate ER
diagram.
404
IDWA_C08.QXD 10/5/07 11:38 AM Page 405
Investigations
8.13 Analyse the ‘View timetable’ use case for the Web Timetable case study in Appen-
dix B, using the approach outlined in Section 8.2 and Appendix A. Draw up an appli-
cation glossary using the style of Appendix A. Draw an initial conceptual ER diagram
based on your analysis of the ‘View timetable’ use case.
8.14 Complete the conceptual database design for the Web Timetable case study by
analysing the remaining use cases.
8.15 Validate your solution to Exercise 8.14, identifying and removing connection traps
and redundant relationships.
8.16 Use transaction path analysis to verify that your conceptual database design for the
Web Timetable supports each of the following transactions.
(a) View a timetable for a particular course.
(b) View a timetable for a particular academic, including his or her support sessions.
(c) List all equipment stored in a particular room.
(d) List all portable equipment available for use in a particular teaching session on a
particular date.
(e) List all tutorial, practical and seminar group teaching sessions for a particular
course. Do not include teaching sessions that have four or fewer places left.
Investigations
8.18 Investigate the benefits, and the drawbacks, of including operations in the ER dia-
gram. What sorts of operations can be included? Why do traditional approaches to
conceptual database design avoid such behavioural modelling? (Muller (1999) is a
good starting point for this.)
8.19 Choose a business or organization that you know well and develop a conceptual
database design that meets its information requirements. If the organization is large,
focus on one aspect of its activities.
8.20 ER diagrams are one of a number of techniques for conceptual database design.
Object Role Modelling (ORM) (Halpin, 2001) is one alternative, the Semantic Object
Data Model (Kroenke, 2006) another. Investigate one of these alternative
approaches to conceptual database design, contrasting it with the ER diagram-
ming approach.
405
IDWA_C09.QXD 10/5/07 11:40 AM Page 406
Chapter objectives
Chapter outline
A conceptual database design demonstrates that an application will meet the users’
data requirements, without going into the details of how. Logical database design
begins to deal with how. It takes the data requirements identified in the conceptual
database design and represents them using the data structures available in a particular
data model, and this chapter focuses on the relational data model. Separating concep-
tual and logical database design means that the end users don’t need to understand
the particular data model used. Instead, the simple, and intuitive, abstractions of an
ER diagram form the basis for discussion between designer and users. As formal data
models, even the relatively simple relational data model, can be hard to grasp, this can
be a successful approach.
The problem is that settling on a particular data model means reconsidering the
way data is organized. An ER diagram uses three structures to organize data:
● entity
● attribute
● relationship.
The relational data model has only two:
● table
● column.
406
IDWA_C09.QXD 10/5/07 11:40 AM Page 407
It is clear that an entity will become a table and each attribute of the entity will be
a column in this table (the rows of the table are the entity occurrences – the actual
data). This leaves the associations.
In an ER diagram, an association represents a connection between entities. In the
relational data model, a connection between tables is represented by a foreign key col-
umn. However, associations can have attributes and foreign key columns cannot. This
means that the translation from ER diagram to relational database design is not
straightforward. This is the price paid by the database designer for a freer discussion of
data requirements during conceptual database design.
■ Section 9.1 introduces the standard approach to logical database design, showing
how to translate the ER diagram for the Staff Directory into a set of table schema
(the relational database design). This section is intended to introduce the key ideas
of logical database design before discussing those features of ER diagrams that
don’t have direct counterparts in the relational model.
■ Section 9.2 shows how to remove these features from an ER diagram while main-
taining the original intentions of the designer. The result is called a logical ER dia-
gram, targeting the relational data model.
■ Section 9.3 introduces some refinements to logical database design. In particular, it
discusses how to derive default values and referential actions from the logical ER
diagram. It ends with a summary of the logical database design process.
■ Section 9.4 discusses an alternative database design technique called normaliza-
tion. This was developed specifically for the relational data model and is an impor-
tant tool for validating the relational database design.
During logical database design, the entities, attributes and relationships of the concep-
tual database design are translated into the data structures of a particular data model.
The simpler structures used in an ER diagram usually have direct counterparts in the
chosen data model, so the translation is pretty straightforward. For a relational data-
base, entities become tables, single-valued attributes become columns and one-to-
many associations with no attributes are represented by foreign keys. Provided an ER
diagram only uses these three simple structures, the database designer can write down
a set of table schema that meet the end users’ data requirements.
The conceptual design for the Staff Directory database, shown in Figure 9.1, is such
an ER diagram. It has two entities, each of which has a number of single-valued attrib-
utes. There is a single one-to-many association, which has no attributes of its own.
The earlier chapters on the relational data model and the relational data language
SQL suggest that the logical database design should include the characteristics of
407
IDWA_C09.QXD 10/5/07 11:40 AM Page 408
Figure 9.1 ER diagram representing the conceptual database design for the
Staff Directory database.
«persistent»
Staff
staffNo :Varchar {PK}
«persistent»
fName : Varchar
SupportSession
IName : Varchar Offer
phone [0..1] : Varchar 1..1 0..* dayOfWeek : Varchar
email : Varchar {CK=1} {Value any of Monday, ..., Friday}
photo [0..1] : Blob startTime : Time
department : Varchar endTime : Time
title : Varchar
{Value one of Mr, Ms, Miss,
Mrs, Dr, Prof}
jobType : Varchar
jobTitle : Varchar
tables and columns listed in Table 9.1. These are all documented in the conceptual
database design. The conceptual database design can also include a number of busi-
ness rules. Whether these business rules can be implemented or not depends on the
particular DBMS used, so they are considered during physical database design.
The first thing to do is write down an initial table schema for each entity. This
should include the table name and column descriptions that form the table heading.
The table name will usually be the same as the entity name. Columns will usually have
the same name and data type as the attribute they implement. Any default values can
be noted after the data type. An attribute with multiplicity [0..1] may not have a value
on some entity occurrences. This means that the column implementing this attribute
may be null on some rows of the table. All other columns have multiplicity [1..1] so will
be not null. The column description should clearly indicate whether a column can or
cannot be null.
Occasionally the database designer will want to choose a different name for the
table or column than that used for the corresponding entity or attribute. The main rea-
son for changing a table or column name is that the name used in the ER diagram is
not a valid SQL identifier. For example, it might be a reserved word in the dialect of
SQL used by the target DBMS. Another potential problem might be that the target
DBMS doesn’t support a particular data type. These issues are dealt with during physi-
cal database design. Another reason for changing a name is if the end users insisted on
an unnecessarily long or confusing name during conceptual design. Now they are out
of the picture, the database designer could change this to something shorter or more
408
IDWA_C09.QXD 10/5/07 11:40 AM Page 409
meaningful. Code listing 9.1.1 shows the initial table schema for the Staff and
SupportSession tables.
Code listing 9.1.1 Initial table schema for the Staff and SupportSession
tables
Staff (staffNo VARCHAR NOT NULL,
fName VARCHAR NOT NULL,
lName VARCHAR NOT NULL,
phone VARCHAR NULL,
email VARCHAR NOT NULL,
photo BLOB NULL,
department VARCHAR NOT NULL,
title VARCHAR NOT NULL,
jobType VARCHAR NOT NULL,
jobTitle VARCHAR NOT NULL)
SupportSession (dayOfWeek VARCHAR NOT NULL,
startTime TIME NOT NULL,
endTime TIME NOT NULL)
In the relational data model, each table should have a primary key. The Staff entity is
a strong entity (Figure 9.1) and its primary key becomes the primary key of the Staff
table. The SupportSession entity is a weak entity so does not have a primary key. The
SupportSession table will have a primary key, but, at the moment, it cannot
be identified. Primary keys are noted below the table heading, as shown in Code
listing 9.1.2.
409
IDWA_C09.QXD 10/5/07 11:40 AM Page 410
the table schema so that they can be easily turned into SQL create table statements.
However, some database designers feel that table schema should have a more polished
look to them.
An alternative is to set out the table schema in a way similar to that shown in Fig-
ure 9.2. This is another of those times when personal preference (or a manager’s
preference) will decide which approach to use.
The next step deals with the one-to-many associations. Each one-to-many associa-
tion generates a foreign key and its associated foreign key constraint. The question is,
which of the two tables gets the foreign key?
Consider the Offer association between Staff and SupportSession. This tells us that
each occurrence of the SupportSession entity is linked by a single occurrence of the
Offer association to a single occurrence of the Staff entity. In contrast, one occurrence
of the Staff entity may be linked to many different occurrences of the Offer association
to occurrences of the SupportSession entity. Translating this into the relational data
model, each row of the SupportSession table should be linked to a single row
of the Staff table, while a single row of Staff may be linked to many rows of
SupportSession. Clearly the SupportSession table is the one that needs
the foreign key – a foreign key always links back to a single row in the referenced table.
Figure 9.3 shows the ER diagram and relational database tables side by side and
illustrates the process of translating a one-to-many association into a foreign key
relationship between two tables.
One-to-many associations are always dealt with in this way – the table that imple-
ments the entity at the ‘many’ end of the one-to-many association gets the foreign
key. In the example, the matching candidate key in the Staff table is {staffNo},
410
IDWA_C09.QXD
«persistent» Staff
Staff staffNo:Varchar fName:Varchar IName:Varchar phone:Varchar email:Varchar photo:BLOB department:Varchar title:Varchar jobType:Varchar jobTitle:Varchar
staffNo :Varchar {PK} 10780 John Smith NULL [email protected] [BLOB] Catering Mr Support Cheif
fName : Varchar 25447 John Smith 5104 [email protected] [BLOB] Music Mr Administration Secretary
IName : Varchar
11:40 AM
25448 Judith Anne Smith 7709 [email protected] [BLOB] Estates Mrs Support Estates Manager
phone [0..1] : Varchar
email : Varchar {CK=1} 31210 Paul Smith NULL [email protected] [BLOB] Computing Dr Academic Senior Lecturer
photo [0..1] : Blob Matching candidate 77712 Frank Rose 8871 [email protected] [BLOB] Computing Mr Technical Technician
department : Varchar key column already
title : Varchar 14443 Helen Abbot 8032 [email protected] [BLOB] Computing Mrs Administration Secretary
exists in table at the
{Value one of Mr, Ms, Miss, ‘one’ end of the one- 23257 Freya Stark 8660 [email protected] [BLOB] Computing Prof Academic Dean
Mrs, Dr, Prof}
to-many association 33935 Padma Brar 6641 [email protected] [BLOB] Health Ms Administration Administrator
jobType : Varchar
Page 411
jobTitle : Varchar 35054 Selma Hutchins 8706 [email protected] [BLOB] Computing Ms Academic Senior Lecturer
45965 Mikhail Sudbin 5553 [email protected] [BLOB] Music Mr Academic Lecturer
1..1
35155 Helene Chirac NULL [email protected] [BLOB] Health Miss Technical Technician
55776 Gurpreet Choudhury 5454 [email protected] [BLOB] Music Dr Academic Senior Lecturer
56893 Ruth Bapetsi 8022 [email protected] [BLOB] Health Mrs Academic Senior Lecturer
56673 Joshua Bittaye 7782 [email protected] [BLOB] Computing Mr Academic Lecturer
89987 Dan Lin 8514 [email protected] [BLOB] Health Dr Administration Senior Administrator
78893 Jo Karen O’Connor 8871 [email protected] [BLOB] Health Miss Administration Administrator
33509 Helen Timms 8661 [email protected] [BLOB]] Music Mrs Technical Technician
SupportSession
staffNo:Varchar dayOfWeek:Varchar startTime:Time endTime:Time
56673 Friday 15:30:00 16:30:00
56673 Thursday 15:30:00 16:30:00
35054 Monday 09:00:00 10:00:00
45965 Wednesday 11:00:00 12:00:00
31210 Wednesday 11:00:00 13:00:00
Foregin key 35054 Monday 15:00:00 16:00:00
column added to
table at the ‘many’ 56893 Thursday 09:00:00 10:00:00
0..* end of the one-to- 45965 Monday 11:00:00 12:00:00
many association 23257 Monday 15:00:00 16:00:00
«persistent»
SupportSession 55776 Monday 14:00:00 16:00:00
dayOfWeek : Varchar 56893 Tuesday 14:00:00 15:00:00
{Value any of Monday, ..., Friday} 56673 Friday 10:00:00 11:00:00
startTime : Time
endTime : Time 35054 Tuesday 11:00:00 12:00:00
411
9.1 Introducing logical database design
IDWA_C09.QXD 10/5/07 11:40 AM Page 412
412
IDWA_C09.QXD 10/5/07 11:40 AM Page 413
At this point, most of the information about entities, their attributes and the one-to-
many associations, has been included in the table schema. What’s missing is any men-
tion of the candidate keys and business rules. One option is simply to leave these
things out of the logical database design. They are fully documented in the conceptual
design and it is possible that they cannot be implemented on the target DBMS, so it
would make sense to consider them during physical design. On the other hand, this
means that the logical design (the set of table schema) is not a complete specification
of the end users’ data requirements. It will need to be read in conjunction with the
conceptual design. Rather than have the data requirements specified in two different
places, it is better to note any outstanding data requirements in the logical design
itself. Candidate keys and other constraints can be noted after the foreign key
constraints. Code listing 9.1.5 shows the final logical design for the Staff Directory
database.
Code listing 9.1.5 Final logical design for the StaffDirectory database
Staff (staffNo VARCHAR NOT NULL,
fName VARCHAR NOT NULL,
lName VARCHAR NOT NULL,
phone VARCHAR NULL,
email VARCHAR NOT NULL,
photo BLOB NULL,
department VARCHAR NOT NULL,
title VARCHAR NOT NULL,
jobType VARCHAR NOT NULL,
jobTitle VARCHAR NOT NULL)
PRIMARY KEY (staffNo)
CANDIDATE KEY (email)
BUSINESS RULE title must be one of {'Mr', 'Ms', 'Miss',
'Mrs', 'Dr', 'Prof'}
SupportSession (dayOfWeek VARCHAR NOT NULL,
startTime TIME NOT NULL,
endTime TIME NOT NULL)
PRIMARY KEY (staffNo, dayOfWeek, startTime)
FOREIGN KEY (staffNo) REFERENCES Staff(staffNo)
CANDIDATE KEY (staffNo, dayOfWeek, endTime)
BUSINESS RULE dayOfWeek must be one of {'Monday',
'Tuesday', 'Wednesday', 'Thursday', 'Friday'}
413
IDWA_C09.QXD 10/5/07 11:40 AM Page 414
As a conceptual ER diagram may include structures that are not compatible with the
target data model, a sensible first step is to remove them. They can be replaced with
structures that are compatible with the target data model and broadly model the same
situation. In most cases, this will mean replacing a structure that has a complex mean-
ing (such as an n-ary association) with some combination of simpler structures (such
as entities and binary associations). The goal is to produce an ER diagram that can eas-
ily be used to write down a logical database design.
The commonest structure in a conceptual ER diagram that is not compatible with
the relational data model is the many-to-many association. Consider the many-to-
many association Uses, which relates a course and the modules taught on that course
(Figure 9.4). The obvious way to try and implement this as a foreign key in a relational
database is to include a foreign key column course in the Module table (Figure 9.5).
The problem is that a column can only hold one value, so the foreign key column
course can only associate a particular module with one course. This isn’t good
enough as a module may be used by two or more courses. For example, the matching
primary and foreign keys in Figure 9.5 associate the module ‘Introduction to program-
ming’ with the full-time ICT course, but cannot show that it is also associated with the
part-time ICT course (which it is). The same reasoning shows that a foreign key column
module in the Course table could only ever indicate that a particular Course used
one module. Again, this isn’t good enough because a course needs to have at least four
modules. A many-to-many association cannot be represented by a foreign key.
There are two other structures that occur in ER diagrams but have no direct equiva-
lent in the relational data model. These are n-ary associations for n ⱖ 3 (a foreign key
can represent a link between, at most, two tables) and multivalued attributes (a col-
umn can hold, at most, one value of the given data type). The following subsections
illustrate how these three structures can be replaced in the ER diagram by a combina-
tion of ‘simpler’ structures that do have direct equivalents in the relational data model.
The resulting ER diagram is called a logical ER diagram.
414
IDWA_C09.QXD 10/5/07 11:40 AM Page 415
Module
code:varchar title:varchar department:varchar level:char preReq:varchar course:char
CCFC0108 Introduction to programming Computing F NULL C3F5
CIFC0084 Information systems Computing F NULL C3F8
BAFC0178 Introduction to business Business F NULL B4T1
BMFC0107 Introduction to management Business F NULL B2Y6
CIIC0053 Application development Computing I CCFC0108 C3F5
CCIC7009 Computing mathematics Computing I NULL C2Z1
CCHO0418 Formal specification Computing H NULL C2Z1
BAIC3427 Business audit Business I BAFC0178 B4T1
CCH09668 Advanced programming Computing H CCIC7009 C2Z1
CIHO6008 Soft systems Computing H NULL C3F5
Course
code:char title:varchar department:varchar level:varchar qualification:varchar mode:varchar
C3F5 ICT {Computing} Undergraduate BSc Full time
C3F8 ICT {Computing} Undergraduate BSc Part time
B4T1 Business studies {Business} Undergraduate BA Full time
B2Y6 Management {Business} Undergraduate BA Full time
B2Y8 Business computing {Computing, Business} Undergraduate HND Part time
B2Y9 Business computing {Computing, Business} Undergraduate HND Full time
C2Z1 Computer science {Computing} Undergraduate BSc Full time
Some relational database designers, recognizing that these features will cause them
problems later, choose not to use them at all. Others prefer to use them though, as
they help to make conceptual design more intuitive and flexible. Which approach to
take is, once again, a matter of personal preference.
415
IDWA_C09.QXD 10/5/07 11:40 AM Page 416
● use a non-scalar data type, such as set, multiset or array, the members of which have
the same data type as the attribute
● replace the multivalued attribute with two or more single-valued attributes on the
same entity
● replace the multivalued attribute with a new entity and a one-to-many association
from the original entity.
«persistent» «persistent»
Course Course
code : Char(4) {PK} code : Char(4) {PK}
title : Varchar {CK=1} title : Varchar {CK=1}
department [1..2] : Varchar department : Varchar Array [2]
level : Varchar {Must have 1 or 2 elements}
qualification : Varchar level : Varchar
mode : Varchar {CK=1, Value one qualification : Varchar
of 'FullTime', 'PartTime'} mode : Varchar {CK=1, Value one
of 'FullTime', 'PartTime'}
416
IDWA_C09.QXD 10/5/07 11:40 AM Page 417
array, rather than a C-style array), each of which is a varying length character string.
Note that an array has an upper bound on the number of members – in this case two –
but no lower bound. The lower bound of the original multiplicity constraint has to be
specified in an attribute constraint. It’s clear that the logical ER diagram is equivalent
to the conceptual ER diagram in terms of their meaning. The logical ER diagram
simply gives more guidance on how to implement the data requirement for a multival-
ued attribute.
Code listing 9.2.1.1 shows the table schema based on this logical ER diagram. Note
that the not-null constraint on the department column will not ensure that the
array members are not null – department = [NULL, NULL] assigns a valid, not-
null value for this column. This is why there is a new business rule noted for the
department column.
Code listing 9.2.1.1 Table schema derived from the logical ER diagram
in Figure 9.6
Course (code CHAR(4) NOT NULL,
title VARCHAR NOT NULL,
department VARCHAR ARRAY [2] NOT NULL,
level VARCHAR NOT NULL,
qualification VARCHAR NOT NULL,
mode VARCHAR NOT NULL
)
PRIMARY KEY (code)
CANDIDATE KEY (title, mode)
BUSINESS RULE department has at least 1 member
BUSINESS RULE mode must be one of {'Full-time', 'Part-time'}
The non-scalar data types are a fairly recent innovation in relational database technol-
ogy. The SQL standard only includes multiset and array data types and even these are
still not widely supported by commercial DBMS. Because of this, it makes sense to
avoid using them unless it’s known that they will definitely be available on the target
DBMS. As there are, at most, two values held by the department attribute, it is possible
in this case to replace it with two new single-valued attributes – leadDepartment and
minorDepartment. These are shown in Figure 9.7. This approach can provide more
information than using a multivalued attribute. In the current example, when the two
departments are held in a single multivalued attribute, there is no way to identify
which is the lead department and which the minor department. With two separate
attributes, doing so is easy.
The multiplicity of the new attributes depends on the specific data requirements of
the application being modelled. In the example, where only one department is
involved in running a course, there will be no minor department, so the multiplicity
of minorDepartment must be [0..1]. Because the department attribute always has at least
one value, one of the two new attributes must have multiplicity [1..1] and this must be
leadDepartment. Again, this reflects a business rule – that there is always at least one
department involved in running a particular course. The table schema is shown in
Code listing 9.2.1.2.
417
IDWA_C09.QXD 10/5/07 11:40 AM Page 418
«persistent» «persistent»
Course Course
code : Char(4) {PK} code : Char(4) {PK}
title : Varchar {CK=1} title : Varchar {CK=1}
department [1..2] : Varchar leadDepartment : Varchar
level : Varchar minorDepartment [0..1] : Varchar
qualification : Varchar level : Varchar
mode : Varchar {CK=1, Value one qualification : Varchar
of 'FullTime', 'PartTime'} mode : Varchar {CK=1, Value one
of 'FullTime', 'PartTime'}
Code listing 9.2.1.2 Table schema derived from the logical ER diagram in
Figure 9.7
Course (code CHAR(4) NOT NULL,
title VARCHAR NOT NULL,
leadDepartment VARCHAR NOT NULL,
minorDepartment VARCHAR NULL,
level VARCHAR NOT NULL,
qualification VARCHAR NOT NULL,
mode VARCHAR NOT NULL
)
PRIMARY KEY (code)
CANDIDATE KEY (title, mode)
BUSINESS RULE mode must be one of {'Full-time', 'Part-time'}
Replacing a multivalued attribute with a number of single-valued attributes will only
work where the maximum number of values is fixed. If the maximum is many (‘*’),
then there is no way of knowing how many single-valued attributes are required. It
is also only really feasible where the maximum is a small number, so multivalued at-
tributes with a maximum of four values can be dealt with in this way. For example, a
person may have up to four phone numbers – home phone, work phone, mobile and
fax. It would be acceptable to store each of these in a separate attribute.
In many cases it is more effective to extract the multivalued attribute into a new
entity. This approach to multivalued attributes creates a look-up entity. The multiple
values of the original attribute are held in a separate entity and can be ‘looked up’ by
following an association between the original entity and the look-up entity. Figure 9.8
demonstrates this process, showing the original entity with its multivalued attribute
from the conceptual ER diagram and how it is represented in the logical ER diagram.
Note that there is no department attribute on the Course entity in the logical ER
418
IDWA_C09.QXD 10/5/07 11:40 AM Page 419
Figure 9.8 Replacing a multivalued attribute with a look-up entity and one-
to-many association.
«persistent»
Course «persistent»
code : Char(4) {PK} Department
Logical RunBy
title : Varchar {CK=1}
ER diagram level : Varchar
1..1 1..2 name : Varchar
qualification : Varchar
mode : Varchar {CK=1, Value
one of 'FullTime', 'PartTime'}
diagram. Instead, each occurrence of Course is associated with at least one, and possi-
bly two, occurrences of the new Department entity. This entity has a single attribute,
name, which is just a renamed copy of the department attribute on the Course entity.
This is equivalent to the situation in the conceptual ER diagram where each occurrence
of Course has at least one, and possibly two, values for the department attribute.
Whenever a multivalued attribute is extracted into a look-up entity, the look-up
entity has just the one attribute. The look-up entity will also be a weak entity – in
the example, two courses may be run by the same department, so there can be two
occurrences of Department that have the same value for the name attribute, but are
associated with different occurrences of Course.
The multiplicity constraint at the look-up entity end of the association is the same
as the multiplicity constraint on the original multivalued attribute. The multiplicity
constraint at the other end of the association is always 1..1. A particular occurrence of
the look-up entity Department is supposed to hold one value from the department
attribute on some particular occurrence of the Course entity. This value was associated
with a single occurrence of Course, so an occurrence of the look-up entity must also be
associated with a single occurrence of Course.
The table schema for this logical ER diagram are shown in Code listing 9.2.1.3. The
one-to-many association RunBy leads to the inclusion of a foreign key column code
in the Department table and, hence, the primary and foreign key constraints.
419
IDWA_C09.QXD 10/5/07 11:40 AM Page 420
Code listing 9.2.1.3 Table schema derived from the logical ER diagram in
Figure 9.8
Course (code CHAR(4) NOT NULL,
title VARCHAR NOT NULL,
level VARCHAR NOT NULL,
qualification VARCHAR NOT NULL,
mode VARCHAR NOT NULL
)
PRIMARY KEY (code)
CANDIDATE KEY (title, mode)
BUSINESS RULE mode must be one of {'Full-time', 'Part-time'}
: Department
name ='Business'
: Course
code = 'B2Y8'
title = 'Business Computing'
level = 'Undergraduate' : Department
qualification = 'HND'
mode = 'FullTime'
name = 'Computing'
: Course
: Department
code = 'B4T1'
title = 'Business Studies' name ='Business'
level = 'Undergraduate'
qualification = 'BA'
mode = 'FullTime'
420
IDWA_C09.QXD 10/5/07 11:40 AM Page 421
duplicate data about departments held on the database. Note, though, that the only
information held about a department is its name.
To avoid even this small amount of duplication, the binary association RunBy could
be modelled as a many-to-many association. The next section explains how to transform
many-to-many binary associations so that they can be implemented in the relational
data model and revisits this example to see what difference modelling the RunBy associ-
ation as a many-to-many association makes to the logical database design.
Uses
«persistent» «persistent»
5..20 1..*
Conceptual Module Course
ER diagram code : Char(7) {PK} code : Char(4) {PK}
421
IDWA_C09.QXD 10/5/07 11:40 AM Page 422
Code listing 9.2.2.1 Table schema derived from the logical ER diagram in
Figure 9.10
Module (code CHAR NOT NULL,
title VARCHAR NOT NULL,
department VARCHAR NOT NULL,
level VARCHAR NULL
)
PRIMARY KEY (code)
CANDIDATE KEY (department, title)
BUSINESS RULE level must be one of {'Foundation',
'Intermediate', 'Placement', 'Honours', 'Masters'}
Course (code CHAR NOT NULL,
title VARCHAR NOT NULL,
leadDepartment VARCHAR NOT NULL,
minorDepartment VARCHAR NULL,
level VARCHAR NOT NULL,
qualification VARCHAR NOT NULL,
mode VARCHAR NOT NULL
)
422
IDWA_C09.QXD 10/5/07 11:40 AM Page 423
CourseModule Module
One module associated
course:char module:varchar code:varchar tltle:varchar department:varchar level:char preReq:varchar course:char
with many courses
C3F5 CCFC0108 CCFC0108 Introduction to programming Computing F NULL C3F5
C3F8 CCFC0108 CIFC0084 Information systems Computing F NULL C3F8
C3F5 CIFC0084 BAFC0178 Introduction to business Business F NULL B4T1
C3F8 CIFC0084 BMFC0107 Introduction to management Business F NULL B2Y6
B2Y8 CIFC0084 CIIC0053 Application development Computing I CCFC0108 C3F5
B2Y9 CIFC0084 CCIC7009 Computing mathematics Computing I NULL C2Z1
B2Y8 BAFC0178 CCHO0418 Formal specification Computing H NULL C2Z1
B2Y9 BAFC0178 BAIC3427 Business audit Business I BAFC0178 B4T1
B4T1 BAFC0178 CCH09668 Advanced programming Computing H CCIC7009 C2Z1
B2Y6 BAFC0178 CIHO6008 Soft systems Computing H NULL C3F5
B2Y6 BMFC0107
B4T1 BMFC0107
B2Y8 CIIC0053
B2Y9 CIIC0053
Course
code:char title:varchar leadDepartment:varchar minorDepartment:varchar level:varchar qualification:varchar mode:varchar
C3F5 ICT Computing NULL Undergraduate BSc Full time
C3F8 ICT Computing NULL Undergraduate BSc Part time
B4T1 Business studies Business NULL Undergraduate BA Full time
One course associated
B2Y6 Management Business NULL Undergraduate BA Full time
with many modules
B2Y8 Business computing Business Computing Undergraduate HND Part time
B2Y9 Business computing Business Computing Undergraduate HND Full time
C2Z1 Computer science Computing NULL Undergraduate BSc Full time
423
IDWA_C09.QXD 10/5/07 11:40 AM Page 424
diagram, where the association between the Course entity and the new link entity
CourseModule involves aggregation. Note that the aggregation symbol does not move,
but remains attached to the same entity, even though it is a new association. Again,
the original meaning of the conceptual ER diagram is retained in the logical ER dia-
gram. An occurrence of Course is made up of occurrences of CourseModule, each of
which is associated with exactly one occurrence of Module.
Some many-to-many associations have attributes of their own. In this case, the
association’s attributes become attributes of the link entity in the logical ER diagram.
For example, at the Pennine University, each course is reviewed every five years and, at
each review, the modules that can be taken for a course may change. This means that
the link between an occurrence of Course and an occurrence of Module is time-limited.
The dates between which each occurrence of the Uses association is valid are clearly
attributes of the association itself.
This situation is shown in Figure 9.12, as is the result of resolving the many-to-
many association. Note how the association’s attributes become attributes of the link
entity. One-to-many binary associations never have attributes of their own as such
attributes can always be included on one or other of the two entities involved in the
association.
There is no need to model the dates between which a module for a course can
be taken in the conceptual ER diagram for the Web Timetable database application.
The Web Timetable database represents which modules are taught for which courses
during a particular year. There is no need to indicate future changes to modules
as such changes will be incorporated into the database instance before the timetable
is published. However, for a database holding more general information about cours-
es and modules, the validation period is an important piece of data and would need to
be included in the model.
Figure 9.12 Resolving a many-to-many association that has attributes of its own.
from : Date
to : Date
424
IDWA_C09.QXD 10/5/07 11:40 AM Page 425
«persistent» 1..2
Conceptual Student StudentRep
ER diagram 1..*
Represent
«persistent» 1..1
Student StudentRep
Logical
1..1 1..*
ER diagram
«persistent»
1..2 Representation
One situation that can puzzle novices is how to deal with a unary many-to-many asso-
ciation. The approach is exactly the same as for a binary many-to-many association.
Consider the unary many-to-many association Represent in Figure 9.13. This mod-
els how student representation works at the Pennine University. Each year group elects
one or two student representatives, who attend meetings to represent the views of stu-
dents in their year to the university. The multiplicity constraints capture this situation.
Any student acting as a StudentRep will represent one or more students. Each student
is represented by at least one, possibly two, students acting as a StudentRep.
In the logical ER diagram, the unary many-to-many association is replaced by the
link entity Representation and two one-to-many binary associations. Again, note that
the original multiplicity constraints ‘hop over’ the link entity.
At this point, it’s worth returning to the problem of how to model the relationship
between a course and the department offering that course. The logical ER diagram in
Figure 9.8 seemed wrong because the obvious way to model an association between
the Course and Department entities is as a many-to-many association. The conceptual
ER diagram in Figure 9.14 does this.
Notice that the name is now the primary key as each occurrence of Department
will be unique. In the logical ER diagram, the many-to-many association is replaced
by a link entity and two new one-to-many associations, which, again, is shown in
Figure 9.14.
It’s possible to write down the table schema derived from the three entities in this
logical ER diagram. They are shown in Code listing 9.2.2.2.
Code listing 9.2.2.2 Table schema derived from the logical ER diagram in
Figure 9.14
Course (code CHAR(4) NOT NULL,
title VARCHAR NOT NULL,
level VARCHAR NOT NULL,
qualification VARCHAR NOT NULL,
mode VARCHAR NOT NULL
)
425
IDWA_C09.QXD 10/5/07 11:40 AM Page 426
Figure 9.14 A many-to-many association involving an entity with just one attribute.
«persistent»
Course
«persistent»
code : Char(4) {PK} Department
Conceptual RunBy
title : Varchar {CK=1}
ER diagram level : Varchar
1..* 1..2 name : Varchar
qualification : Varchar
mode : Varchar {CK=1, Value
one of 'FullTime', 'PartTime'}
426
IDWA_C09.QXD 10/5/07 11:40 AM Page 427
427
IDWA_C09.QXD 10/5/07 11:40 AM Page 428
«persistent»
Staff
0..1
{Inner multiplicity is 0..*}
«persistent»
BookEquipment
0..* TeachingSession
bookedOn : Date
bookedFor : Date
«persistent»
Equipment
The best way to answer this question is to give an example. The BookEquipment
association from the Web Timetable is a three-ary association. It is shown in Fig-
ure 9.15 with both the UML outer multiplicity and the inner multiplicity constraints.
It also has attributes of its own, so is about as complex as an association gets.
In Figure 9.16, the BookEquipment association of the conceptual ER diagram is
replaced by the link entity EquipmentBooking in the logical ER diagram. This link
Figure 9.16 The three-ary association from Figure 9.15 resolved by deriving
multiplicity constraints from scratch.
«persistent»
Staff
1..1
Makes
0..*
«persistent»
EquipmentBooking For «persistent»
bookedOn : Date
0..* 1..1 TeachingSession
bookedFor : Date
0..*
Books
1..1
«persistent»
Equipment
428
IDWA_C09.QXD 10/5/07 11:40 AM Page 429
entity inherits the two attributes bookedOn and bookedFor. Three new binary associa-
tions are created, associating BookEquipment to each of the three original entities –
Staff, TeachingSession and Equipment. The three binary associations are named here to
make it easier to discuss them, but this is not necessary.
So far, the transformation has been straightforward. The n-ary association has been
replaced by a link entity and three binary associations. The association attributes have
been absorbed into the link entity. The final task is to calculate the multiplicity con-
straints on each of the three binary associations.
An occurrence of the link entity EquipmentBooking should be equivalent to an
occurrence of the association BookEquipment. By definition, an occurrence of an
association links together exactly one occurrence of each of the participating entities.
Hence, a single occurrence of BookEquipment must be associated with exactly one
occurrence of each of the other three entities. So, a particular occurrence of
BookEquipment must be associated with exactly one occurrence of Staff. This means
that the multiplicity constraint at the Staff end of the Makes association must be 1..1.
Similarly, the multiplicity constraint at the Equipment end of the Books association
must be 1..1, as it must also be at the TeachingSession end of the For association. All
three 1..1 multiplicity constraints are shown in the logical ER diagram in Figure 9.16.
To calculate the multiplicity constraints at the EquipmentBooking end of the three
binary associations, ask the usual questions to establish lower and upper bounds for
each one. The lower bound of the multiplicity constraint is calculated by asking this
question.
As some staff may never book additional equipment for their teaching sessions, the
lower bound is zero. To calculate the upper bound ask the next question.
Some staff may regularly make such bookings, so the upper bound is many and the
multiplicity constraint is 0..*. Note that this multiplicity constraint is the same as the
inner multiplicity constraint at the Staff end of the original n-ary association. This will
always be the case. For each of the binary associations, the inner multiplicity at the
link entity end will be the same as the inner multiplicity at the end of n-ary associa-
tion attached to the binary association’s other participating entity.
Similar reasoning leads to the conclusion that the two remaining multiplicity
constraints are also 0..* – a particular teaching session may not require any additional
equipment or may require lots and a particular piece of equipment may never be booked
or may be booked many times. Figure 9.16 shows these multiplicity constraints.
The transformation from a conceptual to a logical ER diagram loses information on
the outer multiplicity constraints on the original n-ary association. Because of this, the
logical ER diagram in Figure 9.16 is not equivalent to the conceptual ER diagram in
Figure 9.15. It is actually very difficult to show the outer multiplicity constraints of the
n-ary association in the logical ER diagram. For example, the outer multiplicity con-
straint on the Staff end of the BookEquipment association tells us that, for any pair of
occurrences of TeachingSession and Equipment, there is at most one occurrence of
429
IDWA_C09.QXD 10/5/07 11:40 AM Page 430
The preceding sections have discussed all the tools required to develop a logical data-
base design targeted at the relational data model. This section outlines a two-stage
process for applying these tools, discusses some refinements of the logical design and
summarizes what is left to be done during physical database design.
The conceptual ER diagram for the web timetable (Figure 9.17) forms a case study as
it illustrates most of the issues that need to be considered. The first stage is to create
the logical ER diagram. This removes those features of the conceptual ER diagram that
430
IDWA_C09.QXD 10/5/07 11:40 AM Page 431
cannot be translated directly into features of the relational data model. The second
stage generates the actual table schema.
431
IDWA_C09.QXD 10/5/07 11:40 AM Page 432
432
IDWA_C09.QXD 10/5/07 11:40 AM Page 433
Figure 9.18 Logical ER diagram derived from the conceptual ER diagram in Figure 9.17.
Consider the first task. In the initial table schema, the table name will usually be the
same as the entity it represents. Similarly, attribute names are usually the same. If, for
some business reason, an entity or attribute name is a reserved SQL keyword, then
change the name at this stage. For example, the SQL keywords ‘order’, ‘time’, ‘date’
and ‘group’ are all fairly common nouns used by businesses to describe their data
requirements so may have been used in the conceptual ER diagram. If there is no obvi-
ous synonym, then use a compound noun, such ‘PurchaseOrder’ or ‘CustomerOrder’
instead of ‘Order’.
433
IDWA_C09.QXD 10/5/07 11:40 AM Page 434
Some link entities do not have any attributes. For these, the initial table schema
consists simply of the table name. It is worth adding a brief comment to indicate that
there are no columns as yet. Even when a link entity has attributes of its own, it will be
a weak entity, so the primary key cannot yet be identified. The initial table schema for
the CourseModule table (Code listing 9.3.2.1) is typical.
Code listing 9.3.2.1 Initial table schema for the CourseModule table
CourseModule (COLUMNS NOT YET IDENTIFIED FOR THIS LINK TABLE)
PRIMARY KEY NOT YET IDENTIFIED
The data types of attributes in the conceptual ER diagram should have counterparts in
SQL, so use the appropriate SQL data type for the column. If attribute lengths are in-
cluded in the conceptual ER diagram, then note them in the column description, but
otherwise, leave them out. The same rule holds for default values – only include them
if they are mentioned. Finally, the attribute multiplicity will indicate whether or not
the column can be null. A multiplicity of [1..1] (remember, this is the default attribute
multiplicity) indicates a not-null column and [0..1] indicates that the column may be
null. Always include a clear indication of whether the column is, or is not, allowed to
be null.
The Equipment entity (Figure 9.17) includes attributes with all these features. Code
listing 9.3.2.2 shows an initial table schema for the Equipment table. In the first col-
umn description, for assetNo, the length of the column values is known to be, at
most, 15 characters, so this information is included in the column description, much
as it will be in the SQL create table statement.
The column description for acquired includes a default value. Note that, in
Figure 9.17, the default value is noted as Today, but, in the column description, it is
CURRENT_DATE. This is the name of the appropriate SQL value function. As SQL is
the likely implementation language, there is no harm in using the SQL terminology.
The cost and acquired attributes both have multiplicity [0..1], hence the cost and
acquired columns are both allowed to be null. This is noted by including the
keyword NULL at the end of the column descriptions. The other attributes have multi-
plicity [1..1], so the associated column descriptions end with the keywords NOT NULL.
Code listing 9.3.2.2 Initial table schema for the Equipment table
Equipment (assetNo VARCHAR(15) NOT NULL,
assetType VARCHAR NOT NULL,
description VARCHAR NOT NULL,
portable BOOLEAN NOT NULL,
cost NUMERIC NULL,
acquired DATE DEFAULT CURRENT_DATE NULL
)
PRIMARY KEY (assetNo)
The initial table schema in code listing 9.3.2.2 includes the primary key constraint. As
any business rules will translate into table constraints, it makes sense to include them
434
IDWA_C09.QXD 10/5/07 11:40 AM Page 435
in the initial table schema, too. This means that candidate keys and attribute con-
straints should also be noted down.
Code listing 9.3.2.3 shows the initial table schema for the Module table, with its
candidate key noted down.
Code listing 9.3.2.3 Initial table schema for the Module table
Module (code CHAR(7) NOT NULL,
title VARCHAR NOT NULL,
department VARCHAR NOT NULL,
level VARCHAR NOT NULL
)
PRIMARY KEY (code)
CANDIDATE KEY (department, title)
The second task deals with the associations. The logical ER diagram can include only
unary or binary associations. Furthermore, the multiplicity can only be one-to-many
or one-to-one. This leads to a number of cases. The simplest case is a one-to-one binary
association with a lower bound of 1 on both multiplicity constraints, as shown in
Figure 9.19. In this case, merge the two entities into a single table. Each attribute from
both of the entities will have a corresponding column in this single table. This may
involve some renaming of columns, but is otherwise straightforward.
All other cases lead to a foreign key and a foreign key constraint. Section 9.1
explained how the entity at the ‘many’ end of a binary one-to-many association
becomes the referencing table, and the entity at the ‘one’ end becomes the referenced
table. A copy of the primary key columns of the referenced table are added to the
referencing table. If the referenced table is itself a weak entity, then first deal with the
associations of which it is at the ‘many’ end. This will ensure that its primary key can
be identified. For example, in Figure 9.18 TeachingSession is a weak entity, so the
Teaches, Delivers and Hosts associations must be dealt with before the associations
with the link entities EquipmentBooking and Attendance.
For the association KeptIn, the Equipment table is the referencing table, so gets a
copy of the primary key columns of the Room table.
Code listing 9.3.2.4 shows the table schema for both tables. Notice that, as Room
has a composite primary key, the foreign key on Equipment is also composite. As
noted before, the names of the foreign key columns do not need to be the same as the
matching primary key column. Here, the Equipment table’s room column matches
the Room table’s roomNo column. The data type must be the same, though, even
down to the length specified for data values. Whether or not the foreign key columns
can be null is determined by the multiplicity constraint at the referenced table end of
435
IDWA_C09.QXD 10/5/07 11:40 AM Page 436
the association. If the lower bound is 0 (i.e. the multiplicity constraint is 0..1), then the
foreign key columns may be null. Otherwise, the lower bound is 1 (the multiplicity
constraint is 1..1) and the foreign key columns may not be null.
Code listing 9.3.2.4 Table schema for the Room and Equipment tables
showing the foreign key
Room (building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL,
capacity NUMERIC NULL
)
PRIMARY KEY (building, roomNo)
Equipment (assetNo VARCHAR(15) NOT NULL,
assetType VARCHAR NOT NULL,
description VARCHAR NOT NULL,
portable BOOLEAN NOT NULL,
cost NUMERIC NULL,
acquired DATE DEFAULT CURRENT_DATE NULL,
building VARCHAR NULL,
room VARCHAR NULL
)
PRIMARY KEY (assetNo)
FOREIGN KEY (building, room)
REFERENCES Room(building, roomNo)
Each link entity is at the ‘many’ end of two or more one-to-many associations. Each of
these leads to a set of foreign key columns in the corresponding link table. The primary
key of the link table is always the set of all foreign key columns. For example, the two
associations that the CourseModule link entity participates in lead to two sets of for-
eign key columns – one linking it with the Course table, the other linking it with the
Module table.
Code listing 9.3.2.5 shows the final table schema for this link table. The foreign key
columns have to be renamed as both the matching primary key columns are called
code. Neither of the foreign key columns can be null as the multiplicity at the ‘one’
end of both associations is 1..1.
436
IDWA_C09.QXD 10/5/07 11:40 AM Page 437
recorded on the ER diagram in a note. In the table schema, it can be recorded as the
default value of the foreign key columns, as shown in Code listing 9.3.2.6.
There are two problems to note. First, if users really wanted to add a piece of equip-
ment that was not associated with any room (such as a plasma screen mounted in a
foyer), they would need to tell the DBMS to make the building and room columns null.
The second problem is subtler. Suppose a user creates a new row in the Equipment
table and gives the DBMS a value for building, Wilson say, but not for room. The
DBMS will use the default value for the room in this row and deduce that the equip-
ment is stored in Wilson 12. If there is no such room, the insert statement will fail. If
there is, the database will hold incorrect information. It may be better not to include
the default values in the foreign key after all.
Table 9.5 Most likely referential actions for a foreign key constraint.
437
IDWA_C09.QXD 10/5/07 11:40 AM Page 438
as possible, the part (the referencing table) should be kept in step with the whole (the
referenced table). For composition, which indicates that the part cannot exist inde-
pendently, the only choice is CASCADE. For plain associations, with no aggregation or
composition, consider the multiplicity constraint associated with the referenced table.
Where this is 0..1, rows of the referencing table can exist independently, so SET NULL
is an option. Otherwise, SET NULL cannot be used.
For the Equipment table, the multiplicity at the ‘one’ end of the KeptIn associa-
tion is 0..1, so any referential action is possible. In this situation, consider the meaning
of the situation being modelled. If a room number or building name is changed, it
doesn’t affect the equipment in that room. This suggests ON UPDATE CASCADE is a
good choice. By contrast, if a room is removed from the database, it is important to
know what has happened to the equipment in that room. Then, ON DELETE NO
ACTION prevents a room from being deleted until the user has reallocated all equip-
ment to new rooms. Code listing 9.3.2.7 shows the revised table schema.
438
IDWA_C09.QXD 10/5/07 11:40 AM Page 439
«persistent»
Staff
staffNo : Varchar {PK} 1..1
title : Varchar {Value one of Mr, LineManager
Ms, Miss, Mrs, Dr, Prof}
fName : Varchar
lName : Varchar
email : Varchar {CK=1}
0..*
Manages
association is 1..1. If the multiplicity were 0..1, then the foreign key columns could be
null. The referential actions are chosen to reflect the business situation. Everyone
must have a designated line manager, so no member of staff can be removed from
the database until all the employees that person manages have been given a new line
manager. If, for some reason, the staff number of a member of staff changes, that man-
ager will still be managing the same staff, so this change is cascaded through to the
values of the foreign key column.
Code listing 9.3.2.8 Initial, and final, table schema representing the Staff
entity in Figure 9.20
(a) Initial table schema
Staff (staffNo VARCHAR NOT NULL,
title VARCHAR NOT NULL,
fName VARCHAR NOT NULL,
lName VARCHAR NOT NULL,
email VARCHAR NOT NULL
)
PRIMARY KEY (staffNo)
CANDIDATE KEY (email)
BUSINESS RULE title must be one of {'Mr', 'Ms', 'Miss',
'Mrs', 'Dr', 'Prof'}
439
IDWA_C09.QXD 10/5/07 11:40 AM Page 440
Referenced Referencing
Type of association table table Comment
Entity1 Entity2 For n ⱕ m and m ⱖ 2.
«persistent» ConnectWith «persistent»
Entity1 p..1 n..m Entity2
440
IDWA_C09.QXD 10/5/07 11:40 AM Page 441
«persistent»
Staff
staffNo : Varchar {PK} 0..1
title : Varchar {Value one of Mr,
Ms, Miss, Mrs, Dr, Prof}
fName : Varchar
lName : Varchar
email : Varchar {CK=1}
0..1
SharesWith
For example, consider the unary one-to-one association SharesWith, which indicates
that two academic members of staff share an office (Figure 9.21). The association does
not model information about non-academic staff. Adding a foreign key to represent
this association leads to the table schema shown in Code listing 9.3.2.9. The foreign key
consists of the single column {sharer}. The foreign key column can be null as some
staff do not share an office. The referential actions capture the business scenario out-
lined above. When a member of staff leaves, whoever they shared the office with will
now have an office to themselves, so the foreign key can be set to null. If a staff number
changes, the new staff number designates the same person, who shares an office with
the same colleague, so the foreign key should be updated to the new staff number.
Code listing 9.3.2.9 Table schema representing the Staff entity in Figure 9.21
Staff (staffNo VARCHAR NOT NULL,
title VARCHAR NOT NULL,
fName VARCHAR NOT NULL,
lName VARCHAR NOT NULL,
email VARCHAR NOT NULL,
sharer VARCHAR NULL
)
PRIMARY KEY (staffNo)
CANDIDATE KEY (email)
BUSINESS RULE title must be one of {'Mr', 'Ms', 'Miss',
'Mrs', 'Dr', 'Prof'}
FOREIGN KEY sharer REFERENCES Staff(staffNo)
ON DELETE SET NULL
ON UPDATE CASCADE
As the association only applies to academic members of staff and many of them do not
share an office, the sharer column will be null for most rows in the table. This is
not a particularly efficient design, but it is logical. Chapter 10 considers ways in which
to make the database design more efficient.
A conceptual ER diagram may include generalization relationships as well as associ-
ation relationships. As ISO SQL:2003 now includes object-orientated features, it is
441
IDWA_C09.QXD 10/5/07 11:40 AM Page 442
● Create one table to represent all the entities. For each subclass entity, include a dis-
criminator column to indicate which subclass a particular row represents.
● Create one table for each pair of superclass and subclass entities. Include all attributes
of the superclass and all attributes of the particular subclass.
● Create one table for the superclass entity and a look-up table for each subclass
entity.
● Create one table for superclass entity and a single look-up table to represent all the
subclass entities.
«persistent»
indicates the number of separate Room
workstations available in the lab. building : Varchar {PK}
This will be no less than half the roomNo : Varchar {PK}
room's capacity (i.e. students are capacity : Integer
not expected to work in groups of
more than two).
442
IDWA_C09.QXD 10/5/07 11:40 AM Page 443
Code listing 9.3.2.10 Single table representing the entities in Figure 9.22
Room (building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL,
capacity VARCHAR NOT NULL,
piano VARCHAR NULL,
workstations NUMERIC NULL,
seating VARCHAR NULL,
musicRoom BOOLEAN NOT NULL,
laboratory BOOLEAN NOT NULL,
lectureTheatre BOOLEAN NOT NULL
)
PRIMARY KEY (building, roomNo)
This version of the Room table can obviously represent any occurrence of the Room
entity and any occurrence of each of the three subclasses. Even so, the translation is
not wholly satisfactory. There is no way of ensuring that rows representing an occur-
rence of the MusicRoom entity do not have values for the workstations or
seating columns. The business rule that ensures this would be something like:
whenever musicRoom is TRUE, then workstations and seating are null.
To implement this requires an if ... then statement, so the business rule could be
implemented using procedural code (such as in the PHP script that gathers data on
rooms from end users and writes it to the database or in an SQL trigger).
The second approach is really a refinement of the first. Rather than lumping together
all the subclasses in one table, have a separate table for each of them. There is no need
for a discriminator column as each table represents a single subclass.
Code listing 9.3.2.11 shows the table schema. Notice that, in this approach, the de-
fault value and not-null constraint for all columns, whether representing an attribute
of the superclass or subclass entity, is the same as for the corresponding attribute. This
is because, for example, a row in the MusicRoom table always represents an occur-
rence of the MusicRoom entity. This approach works well when an occurrence of the
superclass must be an occurrence of one of the subclasses – that is, when every room is
either a music room, a laboratory or a lecture theatre.
Code listing 9.3.2.11 Representing the entities in Figure 9.22 with one table
for each superclass–subclass pair
MusicRoom (building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL,
capacity VARCHAR NOT NULL,
piano VARCHAR NOT NULL
)
443
IDWA_C09.QXD 10/5/07 11:40 AM Page 444
Code listing 9.3.2.12 Representing the entities in Figure 9.22 as one table for
the superclass entity and one look-up table for each of the subclass entities
Room (building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL,
capacity VARCHAR NOT NULL
)
PRIMARY KEY (building, roomNo)
MusicRoom (piano VARCHAR NOT NULL,
building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL
)
PRIMARY KEY (building, roomNo)
FOREIGN KEY (building, roomNo)
REFERENCES Room (building, roomNo)
Laboratory (workstations NUMERIC NOT NULL,
building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL
)
444
IDWA_C09.QXD 10/5/07 11:40 AM Page 445
Code listing 9.3.2.13 Representing the entities in Figure 9.22 as one table for
the superclass entity and one look-up table for all the subclass entities
Room (building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL,
capacity VARCHAR NOT NULL
)
PRIMARY KEY (building, roomNo)
RoomType (building VARCHAR NOT NULL,
roomNo VARCHAR NOT NULL,
piano VARCHAR NULL,
workstations NUMERIC NULL,
seating VARCHAR NULL,
musicRoom BOOLEAN NOT NULL,
laboratory BOOLEAN NOT NULL,
lectureTheatre BOOLEAN NOT NULL
)
PRIMARY KEY (building, roomNo)
FOREIGN KEY (building, roomNo)
REFERENCES Room (building, roomNo)
As with associations, the constraints on the generalization relationship help decide
which approach to take. Generalization relationships do not have multiplicity con-
straints, but do have disjoint and participation constraints. Table 9.7 indicates how to
use these constraints to decide which of the four approaches to take.
The final step in writing down the table schema is to identify the primary keys of
any weak entities. Typically, the primary key of a weak entity will include all the
445
IDWA_C09.QXD 10/5/07 11:40 AM Page 446
Table 9.7 Deciding how to represent superclass and subclass entities and
generalization relationships.
foreign key columns. This is because a weak entity is only unique relative to its associ-
ated strong entities. For example, the combination of the day and times of the support
sessions of each staff member will be unique, though two different staff members may
offer support sessions on the same day at the same time. Once this final step has been
resolved, the translation of the conceptual ER diagram into an initial logical database
design is complete.
This section has explained how to represent most of the features of the conceptual
ER diagram in the table schema of the logical database design. There are three features
that it has not covered:
Although business rules and derived attributes were noted in the table schema, opera-
tions were not discussed. The capabilities of the particular DBMS used to implement
the database will determine how these features can be implemented. Accordingly, they
are considered during physical database design.
9.4 Normalization
ER diagrams are a top-down design technique. The technique first identifies the
entities – the main things that the system holds information about – then asks what
particular information is held – the attributes and relationships. ER diagrams start
from the general and move down towards the particular.
Normalization is a bottom-up technique. It begins with the particular – what specific
items of information will be held in the database? Each item of data will be held in a
column. Once the list of all columns is available, they can be organized into tables.
446
IDWA_C09.QXD 10/5/07 11:40 AM Page 447
9.4 Normalization
Normalization was developed specifically for the relational data model, and is still
mostly used to organize relational databases. The goal of normalization is to minimize
repeated data and, hence, avoid the potential anomalies that can arise when repeated
data is modified in one place, but not another.
As with the relational data model itself, there is a formal mathematical theory un-
derlying the process of normalization. This guarantees that the process will generate
an efficient database structure for the given set of columns. The main drawback is that
the initial set of columns can be dauntingly large. Another limitation is that normal-
ization works best when there is good sample data. In practice, normalization is often
used to validate the table schema generated from a logical ER diagram, rather than to
generate the table schema themselves. Another common use is to turn a spreadsheet
into a relational database, because spreadsheets tend to have a limited number of
columns and ample sample data. The examples in this section are based on spread-
sheet data, but the approach can be applied to any set of data items.
Consider the spreadsheet shown in Figure 9.23 (a). There are four problems that can
arise when this spreadsheet is modified. First, when a new hall of residence is built,
there is no way of including information about it in the spreadsheet until a student ac-
tually rents a room there. If the information about halls of residence were kept sepa-
rately from information about rental agreements, then it would be easy to add
information about a new hall. The second problem is the mirror of the first. In the
studentNo name accomld hall room rentStart rentFinish rent staffNo bursar
2005453125 Peter Harris WM107 William Morris 123 11 Sep 2005 10 Jun 2006 550 56673 Joshua Bittaye
MC078 Marie Curie 14 10 Sep 2006 9 Jun 2007 500 35054 Selma Hutchins
2004023401 Alice Clough WM107 William Morris 123 12 Sep 2004 11 Jun 2005 450 56673 Joshua Bittaye
MW031 Maurice Wilkins 46 11 Sep 2005 10 Jun 2006 550 35054 Selma Hutchins
MW042 Maurice Wilkins 51 10 Sep 2006 9 Jun 2007 575 35054 Selma Hutchins
2004892857 Jamie Smith MC078 Marie Curie 14 12 Sep 2004 11 Jun 2005 400 35054 Selma Hutchins
WM107 William Morris 123 12 Jun 2005 10 Sep 2005 450 56673 Joshua Bittaye
studentNo name accomld hall room rentStart rentFinish rent staffNo bursar
2005453125 Peter Harris WM107 William Morris 123 11 Sep 2005 10 Jun 2006 550 56673 Joshua Bittaye
2005453125 Peter Harris MC078 Marie Curie 14 10 Sep 2006 9 Jun 2007 500 35054 Selma Hutchins
2004023401 Alice Clough WM107 William Morris 123 12 Sep 2004 11 Jun 2005 450 56673 Joshua Bittaye
2004023401 Alice Clough MW031 Maurice Wilkins 46 11 Sep 2005 10 Jun 2006 550 35054 Selma Hutchins
2004023401 Alice Clough MW042 Maurice Wilkins 51 10 Sep 2006 9 Jun 2007 575 35054 Selma Hutchins
2004892857 Jamie Smith MC078 Marie Curie 14 12 Sep 2004 11 Jun 2005 400 35054 Selma Hutchins
2004892857 Jamie Smith WM107 William Morris 123 12 Jun 2005 10 Sep 2005 450 56673 Joshua Bittaye
447
IDWA_C09.QXD 10/5/07 11:40 AM Page 448
data shown, if all rental agreements with Alice Clough are deleted, then there is no
longer any mention of Maurice Wilkins hall.
There can also be problems when existing data is changed. Suppose that Joshua
Bittaye retires, and a new Bursar is appointed. There will be many different rows in the
spreadsheet that need to be changed. If one is missed, then the data is inconsistent.
The spreadsheet data will also be inconsistent if, when a new rental agreement is
recorded, the wrong Bursar is named.
All these problems arise because the same data is stored in more than one row of the
spreadsheet. Normalization is guaranteed to produce a set of table schema that avoids
these problems. It generates separate tables to hold information about each distinct
thing, so the first two problems do not arise. More importantly, in a fully normalized
database, only the foreign key columns hold duplicate data. The DBMS can automati-
cally monitor this duplicate data and prevent the problems discussed above. In fact,
this is one of the main motivations for using databases rather than files to hold an
organization’s data (see Chapter 1).
Normalization begins with a single table of data. This single table has a column for
each data item that will be held in the system. Normalization is the process of identi-
fying repeating groups of columns and generating a new table for each repeating
group. The notion of a repeating group has a formal definition, which is explored later
in this section, but, for now, it is defined as follows.
Repeating group A set of two or more columns, such that the data in these
columns repeats across two or more rows.
The idea is that a repeating group of columns should actually be a separate table. The
repeating group is pulled out into a separate table and a foreign key is left behind on
the original table. The foreign key links each row in the original table to the data in
the new table that has been pulled out of this row. Of course, any duplicate rows in the
new table are eliminated, so data duplication is reduced. Obviously, for this to work,
the repeating group should have a candidate key. If there is no candidate key, then it is
likely that there are other columns that form part of the repeating group.
This notion of a repeating group of columns is not very precise, but an example
should make things clear. The Accommodation Office at the Pennine University
keeps data on student rentals in a spreadsheet. Each worksheet within a spreadsheet is
just a table of data and can be normalized. A sample of the spreadsheet data in
the StudentRentals worksheet is shown in Figure 9.23 (a). On spreadsheets, the
empty cells do not necessarily represent missing information (that is, empty cells
are not necessarily null). Empty cells can represent implied information. For example, in
the second row, the studentNo is 2005453125 and the name is Peter Harris. These
values aren’t shown because it is obvious from the structure of the worksheet that the
cell values are the same as those in the row above it. This is a common feature of
spreadsheet data. However, it does mean that empty cells in a worksheet can either
represent implied information or indicate that the data is not known for this row.
To avoid confusion, it helps to fill in any implied information, leaving only cells
with genuinely missing information blank. Thus, Figure 9.23 (b) shows the same
spreadsheet data with the implied information filled in.
448
IDWA_C09.QXD 10/5/07 11:40 AM Page 449
9.4 Normalization
If this latter spreadsheet data was simply loaded into a database table, the table
would have the description shown in Code listing 9.4.1. Looking at the data, {stu-
dentNo, rentStart} is a candidate key and this will do as the primary key of the
table. There are other candidate keys, but they will be easier to spot once the normal-
ization process has been completed.
Code listing 9.4.1 Table schema for the original table, before normalization
StudentRental(studentNo, name, accomId, hall, room,
rentStart, rentFinish, rent, staffNo, bursar)
PRIMARY KEY (studentNo, rentStart)
It is clear (in Figure 9.23) that the set of columns {studentNo, name} forms a
repeating group – that is, two columns, the combined data in which is identical in two
or more rows. For example, in the first two rows, these columns have the values
studentNo = '2005453125' and name = 'Peter Harris' and the next three have
studentNo = '2004023401' and name = 'Alice Clough'. Normalization seeks to mini-
mize repeated data by extracting such repeating groups into a new table. There are two
actions for each such repeating group:
1 write down a description for a new table, consisting of the columns of the repeating
group and identify the primary key of this table
2 write down a description for the original table, but leave out all the non-primary
key columns of the repeating group – the primary key columns of the repeating
group remain in the original table as a foreign key.
The primary key for the repeating group {studentNo, name} is {studentNo} as
it is possible that two students will have the same name, so name cannot be a candi-
date key. Notice that this argument holds despite the fact that, in Figures 9.23 (a) and
(b) all the students have different names. This is because the sample data is just that –
a sample. It can only indicate the sort of data likely to be held. A comprehensive set of
data would include an example illustrating that two different values of studentNo
can be associated with the same value of name. The table schema is shown in Code
listing 9.4.2.
Implied information in a spreadsheet always indicates a repeating group, but not all
repeating groups are indicated by implied information.
Removing the non-primary key columns from the original table simply means
removing the name column. This leaves studentNo, accomId, hall, room,
rentStart, rentFinish, rent, staffNo and bursar. The studentNo col-
umn will be a foreign key to the new Student table.
The revised table schema is also shown in Code listing 9.4.2. Figure 9.24 shows the
sample data after removing this first repeating group. Notice that removing the repeat-
ing group means that there is less data repetition as student names are now stored
once, and once only, rather than being repeated in several rows.
449
IDWA_C09.QXD 10/5/07 11:40 AM Page 450
Figure 9.24 The sample data after removing the first repeating group.
StudentRental
studentNo accomld hall room rentStart rentFinish rent staffNo bursar
2005453125 WM107 William Morris 123 11 Sep 2005 10 Jun 2006 550 56673 Joshua Bittaye
2005453125 MC078 Marie Curie 14 10 Sep 2006 9 Jun 2007 500 35054 Selma Hutchins
2004023401 WM107 William Morris 123 12 Sep 2004 11 Jun 2005 450 56673 Joshua Bittaye
2004023401 MW031 Maurice Wilkins 46 11 Sep 2005 10 Jun 2006 550 35054 Selma Hutchins
2004023401 WM042 Maurice Wilkins 51 10 Sep 2006 9 Jun 2007 575 35054 Selma Hutchins
2004892857 MC078 Marie Curie 14 12 Sep 2004 11 Jun 2005 400 35054 Selma Hutchins
2004892857 WM107 William Morris 123 12 Jun 2005 10 Sep 2005 450 56673 Joshua Bittaye
Student
studentNo name
2005453125 Peter Harris
2004023401 Alice Clough
2004892857 Jamie Smith
Code listing 9.4.2 Table schema after removing the repeating group, name
Student(studentNo, name)
PRIMARY KEY (studentNo)
StudentRental(studentNo, accomId, hall, room, rentStart,
rentFinish, rent, staffNo, bursar)
PRIMARY KEY (studentNo, rentStart)
FOREIGN KEY (studentNo) REFERENCES Student (studentNo)
There are clearly no more repeating groups in the Student table, but there is another
obvious one in StudentRental. The values for the columns accomId, hall and
room repeat across rows. For example, in the first, third and seventh rows, these three
columns have the same values – accomId = 'WM107', hall = William Morris and
room = '123'. As these three columns are listed next to each other in the sample data,
this is easy to spot. What is less obvious is that the staffNo and bursar columns
are part of this repeating group. On the first, third and seventh rows, staffNo =
'56673' and bursar = 'Joshua Bittaye'. The second and sixth rows also have identical
values in these five columns, as do the fourth and fifth. This means that the repeating
group includes the columns {accomId, hall, room, staffNo, bursar}.
Figure 9.25 shows the StudentRental table with its columns reordered so that the
repeating group {accomId, hall, room, staffNo, bursar} is obvious.
This repeating group is dealt with in the same way as the first. A new table, named
Accommodation, is defined as shown in Code listing 9.4.3 with the primary key
accomId. This makes business sense as accomId uniquely identifies a particular
room for rent. More importantly, the value of accomId determines the values of the
other columns in the repeating group, so is a candidate key. The StudentRental
450
IDWA_C09.QXD 10/5/07 11:40 AM Page 451
9.4 Normalization
StudentRental
studentNo accomld hall room staffNo bursar rentStart rentFinish rent
2004892857 MC078 Marie Curie 14 35054 Selma Hutchins 12 Sep 2004 11 Jun 2005 400
2005453125 MC078 Marie Curie 14 35054 Selma Hutchins 10 Sep 2006 9 Jun 2007 500
2004023401 MW031 Maurice Wilkins 46 35054 Selma Hutchins 11 Sep 2005 10 Jun 2006 550
2004023401 MW042 Maurice Wilkins 51 35054 Selma Hutchins 10 Sep 2006 9 Jun 2007 575
2004023401 WM107 William Morris 123 56673 Joshua Bittaye 12 Sep 2004 11 Jun 2005 450
2004892857 WM107 William Morris 123 56673 Joshua Bittaye 12 Jun 2005 10 Sep 2005 450
2005453125 WM107 William Morris 123 56673 Joshua Bittaye 11 Sep 2005 10 Jun 2006 550
table loses the four columns hall, room, staffNo and bursar and gains a foreign
key constraint to the Accommodation table.
Code listing 9.4.3 Table schema after removing the repeating group
{accomId, hall, room, staffNo, bursar}
Student(studentNo, name)
PRIMARY KEY (studentNo)
StudentRental(studentNo, accomId, rentStart, rentFinish,
rent)
PRIMARY KEY (studentNo, rentStart)
FOREIGN KEY (studentNo) REFERENCES Student(studentNo)
FOREIGN KEY (accomId) REFERENCES Accommodation(accomId)
Accommodation(accomId, hall, room, staffNo, bursar)
PRIMARY KEY (accomId)
The data in these three tables is shown in Figure 9.26. There are no more repeating
groups of columns in the StudentRental table, but there is one in Accommoda-
tion. The data in the last two rows shows that the group of columns {hall,
staffNo, bursar} forms a repeating group (highlighted in Figure 9.26).
Code listing 9.4.4 shows the table schema, and Figure 9.27 the sample data, after
removing this repeating group. In this case, the new table is called Hall and, to avoid
confusion, the column hall from the original table is renamed name. The Accom-
modation table gets a foreign key.
Code listing 9.4.4 Table schema after removing the repeating group
{hall, StaffNo, bursar}
Student(studentNo, name)
PRIMARY KEY (studentNo)
StudentRental(studentNo, accomId, rentStart, rentFinish,
rent)
PRIMARY KEY (studentNo, rentStart)
451
IDWA_C09.QXD 10/5/07 11:40 AM Page 452
Figure 9.26 The sample data after removing the second repeating group.
StudentRental
studentNo accomld rentStart rentFinish rent
2005453125 WM107 11 Sep 2005 10 Jun 2006 550
2005453125 MC078 10 Sep 2006 9 Jun 2007 500
2004023401 WM107 12 Sep 2004 11 Jun 2005 450 Student
2004023401 MW031 11 Sep 2005 10 Jun 2006 550 studentNo name
2004023401 MW042 10 Sep 2006 9 Jun 2007 575 2005453125 Peter Harris
2004892857 MC078 12 Sep 2004 11 Jun 2005 400 2004023401 Alice Clough
2004892857 WM107 12 Jun 2005 10 Sep 2005 450 2004892857 Jamie Smith
Accommodation
accomld hall room staffNo bursar
WM107 William Morris 123 56673 Joshua Bittaye
MC078 Marie Curie 14 35054 Selma Hutchins
MW031 Maurice Wilkins 46 35054 Selma Hutchins
MW042 Maurice Wilkins 51 35054 Selma Hutchins
Figure 9.27 The sample data after removing the third repeating group.
StudentRental
studentNo accomld rentStart rentFinish rent
2005453125 WM107 11 Sep 2005 10 Jun 2006 550
2005453125 MC078 10 Sep 2006 9 Jun 2007 500
2004023401 WM107 12 Sep 2004 11 Jun 2005 450 Student
2004023401 MW031 11 Sep 2005 10 Jun 2006 550 studentNo name
2004023401 MW042 10 Sep 2006 9 Jun 2007 575 2005453125 Peter Harris
2004892857 MC078 12 Sep 2004 11 Jun 2005 400 2004023401 Alice Clough
2004892857 WM107 12 Jun 2005 10 Sep 2005 450 2004892857 Jamie Smith
Accommodation Hall
accomld hall room name staffNo bursar
WM107 William Morris 123 William Morris 56673 Joshua Bittaye
MC078 Marie Curie 14 Marie Curie 35054 Selma Hutchins
MW031 Maurice Wilkins 46 Maurice Wilkins 35054 Selma Hutchins
MW042 Maurice Wilkins 51
452
IDWA_C09.QXD 10/5/07 11:40 AM Page 453
9.4 Normalization
Code listing 9.4.5 Table schema after removing the repeating group
{staffNo, bursar}
Student(studentNo, name)
PRIMARY KEY (studentNo)
StudentRental(studentNo, accomId, rentStart, rentFinish,
rent)
Figure 9.28 The sample data after removing the fourth repeating group.
StudentRental
studentNo accomld rentStart rentFinish rent
2005453125 WM107 11 Sep 2005 10 Jun 2006 550
2005453125 MC078 10 Sep 2006 9 Jun 2007 500
2004023401 WM107 12 Sep 2004 11 Jun 2005 450 Student
2004023401 MW031 11 Sep 2005 10 Jun 2006 550 studentNo name
2004023401 MW042 10 Sep 2006 9 Jun 2007 575 2005453125 Peter Harris
2004892857 MC078 12 Sep 2004 11 Jun 2005 400 2004023401 Alice Clough
2004892857 WM107 12 Jun 2005 10 Sep 2005 450 2004892857 Jamie Smith
453
IDWA_C09.QXD 10/5/07 11:40 AM Page 454
Functional determinant A set of columns that forms a ‘candidate key’ for a sub-
set of the columns in the table. The columns which have their value determined
are called the dependent columns.
454
IDWA_C09.QXD 10/5/07 11:40 AM Page 455
9.4 Normalization
This definition means that every candidate key is a functional determinant as a can-
didate key determines the values of all the other columns and, in maths, the whole set
is also considered a subset of itself. This leads to a clear definition of when to stop the
normalization process – when every functional determinant is a candidate key, there
are no more repeating groups.
455
IDWA_C09.QXD 10/5/07 11:40 AM Page 456
Figure 9.29 Checking whether or not the value in one column determines the values of other
columns.
StudentRental
studentNo name accomld hall room rentStart rentFinish rent staffNo bursar
2005453125 Peter Harris WM107 William Morris 123 11 Sep 2005 10 Jun 2006 550 56673 Joshua Bittaye
2005453125 Peter Harris MC078 Marie Curie 14 10 Sep 2006 9 Jun 2007 500 35054 Selma Hutchins
2004023401 Alice Clough WM107 William Morris 123 12 Sep 2004 11 Jun 2005 450 56673 Joshua Bittaye
2004023401 Alice Clough MW031 Maurice Wilkins 46 11 Sep 2005 10 Jun 2006 550 35054 Selma Hutchins
2004023401 Alice Clough MW042 Maurice Wilkins 51 10 Sep 2006 9 Jun 2007 575 35054 Selma Hutchins
2004892857 Jamie Smith MC078 Marie Curie 14 12 Sep 2004 11 Jun 2005 400 35054 Selma Hutchins
2004892857 Jamie Smith WM107 William Morris 123 12 Jun 2005 10 Sep 2005 450 56673 Joshua Bittaye
are any of the other columns. Choose any other column and there are two rows that
have the same value in studentNo but different values in the chosen column. This
means that name is the only dependent column. This is written:
{studentNo} → {name}
Whenever a functional determinant is written down, it must show all the dependent
columns.
A similar argument shows that, based on the data in Figure 9.29:
456
IDWA_C09.QXD 10/5/07 11:40 AM Page 457
9.4 Normalization
{hall, room} must determine the value of accomId, which uniquely identifies a
room. Similarly, each hall is managed by a particular bursar, so {hall, room} also de-
termines the values of staffNo and bursar. As {accomId} → {hall, room,
staffNo, bursar}, we’ve shown that the functional determinants {accomId}
and {hall, room} are two different ‘candidate keys’ for the same repeating group.
Each repeating group will be dealt with once, so the database designer must decide
which of these two functional determinants is the main one for the repeating group.
This is just like deciding which of a table’s candidate keys becomes the primary key.
In theory, each set of columns should be examined to see whether or not it forms a
functional determinant. For a table with 10 columns there are 210 ⫺ 2 = 1022 sets of
columns that are potentially a functional determinant. For a table with n columns,
there are 2n ⫺ 2. This means that the size of the task increases exponentially with the
size of the table. Clearly, it is just not practical to check all the possible functional de-
terminants by hand, even if the sample data could be guaranteed to be complete and
comprehensive.
There are computer programs that will do this job, but their accuracy is limited by
the quality of the sample data, and their speed by the size of the original table.
A better approach is for the database designer to use his or her understanding of the
scenario being modelled to identify likely functional determinants and use the sample
data to check these. For example, any column the name of which indicates that it is
an identifier of some sort is likely to be a functional determinant. For example,
accomId is an identifier for a particular room in a particular hall of residence,
studentNo identifies a student and staffNo identifies a bursar.
It is worth running through the normalization of the StudentRental data again,
using the notion of a functional determinant to help identify repeating groups. This
will show that the order in which the repeating groups are removed does not change
the final result (though the choice of the main functional determinant for a repeating
group does). Nor should the order of the rows and columns in the spreadsheet change
which repeating groups are identified, though it may make some repeating groups
easier to spot.
The Accommodation Office at the Pennine University actually has two versions of
the data in the StudentRental spreadsheet. The version shown in Figure 9.23 (a) is
sorted by student and allows Accommodation Office staff to quickly see the rental his-
tory of a particular student. The other, RoomRental, sorts the data by room (Fig-
ure 9.30). This allows staff to check on room usage over a number of years.
The table schema is shown in Code listing 9.4.6. Note that the primary key is
{accomId, rentStart} and {studentNo, rentStart} is still a candidate
key for the data, but, simply because of the structure of the spreadsheet, {accomId,
rentStart} is easier to spot.
457
IDWA_C09.QXD
458
10/5/07
11:40 AM
Figure 9.30 The RoomRental spreadsheet – a reordered version of the StudentRental spreadsheet
data.
accomld room hall bursar staffNo rentStart rentFinish rent studentNo name
MC078 14 Marie Curie Selma Hutchins 35054 12 Sep 2004 11 Jun 2005 400 2004892857 Jamie Smith
10 Sep 2006 9 Jun 2007 500 2005453125 Peter Harris
MW031 46 Maurice Wilkins Selma Hutchins 35054 11 Sep 2005 10 Jun 2006 550 2004023401 Alice Clough
MW042 51 Maurice Wilkins Selma Hutchins 35054 10 Sep 2006 9 Jun 2007 575 2004023401 Alice Clough
WM107 123 William Morris Joshua Bittaye 56673 12 Jun 2005 10 Sep 2005 450 2004892857 Jamie Smith
12 Sep 2004 11 Jun 2005 450 2004023401 Alice Clough
11 Sep 2005 10 Jun 2006 550 2005453125 Peter Harris
IDWA_C09.QXD 10/5/07 11:40 AM Page 459
9.4 Normalization
RoomRental
accomld room hall bursar staffNo rentStart rentFinish rent studentNo name
MC078 14 Marie Curie Selma Hutchins 35054 12 Sep 2004 11 Jun 2005 400 2004892857 Jamie Smith
MC078 14 Marie Curie Selma Hutchins 35054 10 Sep 2006 9 Jun 2007 500 2005453125 Peter Harris
MW031 46 Maurice Wilkins Selma Hutchins 35054 11 Sep 2005 10 Jun 2006 550 2004023401 Alice Clough
MW042 51 Maurice Wilkins Selma Hutchins 35054 10 Sep 2006 9 Jun 2007 575 2004023401 Alice Clough
WM107 123 William Morris Joshua Bittaye 56673 12 Jun 2005 10 Sep 2005 450 2004892857 Jamie Smith
WM107 123 William Morris Joshua Bittaye 56673 12 Sep 2004 11 Jun 2005 450 2004023401 Alice Clough
WM107 123 William Morris Joshua Bittaye 56673 11 Sep 2005 10 Jun 2006 550 2005453125 Peter Harris
Bursar RoomRental
staffNo name accomld room hall staffNo rentStart rentFinish rent studentNo name
35054 Selma Hutchins MC078 14 Marie Curie 35054 12 Sep 2004 11 Jun 2005 400 2004892857 Jamie Smith
56673 Joshua Bittaye MC078 14 Marie Curie 35054 10 Sep 2006 9 Jun 2007 500 2005453125 Peter Harris
MW031 46 Maurice Wilkins 35054 11 Sep 2005 10 Jun 2006 550 2004023401 Alice Clough
MW042 51 Maurice Wilkins 35054 10 Sep 2006 9 Jun 2007 575 2004023401 Alice Clough
WM107 123 William Morris 56673 12 Jun 2005 10 Sep 2005 450 2004892857 Jamie Smith
WM107 123 William Morris 56673 12 Sep 2004 11 Jun 2005 450 2004023401 Alice Clough
WM107 123 William Morris 56673 11 Sep 2005 10 Jun 2006 550 2005453125 Peter Harris
determines the value of bursar. The data in the first and third rows shows that this
is the only dependent column, so:
{staffNo} → {bursar}
Now create a new table consisting of all the columns in the functional determinant
and all its dependent columns. Finally, remove the dependent columns from the
RoomRental table. Figure 9.31 shows the transformation from a single RoomRental
table to a RoomRental table plus a Bursar table. Code listing 9.4.7 shows the table
schema.
Code listing 9.4.7 Table schema after dealing with the functional
determinant {staffNo} → {bursar}
RoomRental(accomId, room, hall, staffNo, rentStart,
rentFinish, rent, studentNo, name)
PRIMARY KEY (accomId, rentStart)
FOREIGN KEY (staffNo) REFERENCES Bursar(staffNo)
Bursar(staffNo, name)
PRIMARY KEY (staffNo)
Next, consider the column studentNo. Both the data and the scenario suggest a
functional determinant {studentNo} → {name} and the data in the third and
fourth rows indicates that name is the only dependent attribute.
Code listing 9.4.8 and Figure 9.32 show the tables at this stage of the normalization
process.
459
IDWA_C09.QXD 10/5/07 11:40 AM Page 460
Figure 9.32 The RoomRental data after two functional determinants have
been dealt with.
RoomRental
accomld room hall staffNo rentStart rentFinish rent studentNo
MC078 14 Marie Curie 35054 12 Sep 2004 11 Jun 2005 400 2004892857
MC078 14 Marie Curie 35054 10 Sep 2006 9 Jun 2007 500 2005453125
MW031 46 Maurice Wilkins 35054 11 Sep 2005 10 Jun 2006 550 2004023401
MW042 51 Maurice Wilkins 35054 10 Sep 2006 9 Jun 2007 575 2004023401
WM107 123 William Morris 56673 12 Jun 2005 10 Sep 2005 450 2004892857
WM107 123 William Morris 56673 12 Sep 2004 11 Jun 2005 450 2004023401
WM107 123 William Morris 56673 11 Sep 2005 10 Jun 2006 550 2005453125
Bursar Student
staffNo name studentNo name
35054 Selma Hutchins 2004892857 Jamie Smith
56673 Joshua Bittaye 2005453125 Peter Harris
2004023401 Alice Clough
Code listing 9.4.8 Table schema after dealing with the functional
determinant {studentNo} → {name}
RoomRental(accomId, room, hall, staffNo, rentStart,
rentFinish, rent, studentNo)
PRIMARY KEY (accomId, rentStart)
FOREIGN KEY (staffNo) REFERENCES Bursar(staffNo)
FOREIGN KEY (studentNo) REFERENCES Student(studentNo)
Bursar(staffNo, name)
PRIMARY KEY (staffNo)
Student(studentNo, name)
PRIMARY KEY (studentNo)
The next obvious candidate for a functional determinant is {accomId} as, like
staffNo and studentNo, it is clearly some form of unique identifier. Figure 9.32
indicates that {accomId} → {room, hall, staffNo}. This is not quite the
same repeating group as that identified when normalizing the StudentRental ver-
sion of this data. There, the repeating group also included the column bursar (see
Figure 9.25 and Code listing 9.4.3). This time, the bursar column has already been
extracted into a separate table, becoming the Bursar table’s name column.
Dealing with the functional determinant {accomId} leads to a new table,
Accommodation. The process is a little trickier than previously because the depen-
dent column staffNo is a foreign key (see Code listing 9.4.8). When the staffNo
460
IDWA_C09.QXD 10/5/07 11:40 AM Page 461
9.4 Normalization
Figure 9.33 Sample data after dealing with the functional determinant
{accomId} → {room, hall, staffNo}.
RoomRental
accomld rentStart rentFinish rent studentNo
MC078 12 Sep 2004 11 Jun 2005 400 2004892857
MC078 10 Sep 2006 9 Jun 2007 500 2005453125
MW031 11 Sep 2005 10 Jun 2006 550 2004023401 Student
MW042 10 Sep 2006 9 Jun 2007 575 2004023401 studentNo name
WM107 12 Jun 2005 10 Sep 2005 450 2004892857 2004892857 Jamie Smith
WM107 12 Sep 2004 11 Jun 2005 450 2004023401 2005453125 Peter Harris
WM107 11 Sep 2005 10 Jun 2006 550 2005453125 2004023401 Alice Clough
Accommodation
accomld room hall staffNo
MC078 14 Marie Curie 35054 Bursar
MW031 46 Maurice Wilkins 35054 staffNo name
MW042 51 Maurice Wilkins 35054 35054 Selma Hutchins
WM107 123 William Morris 56673 56673 Joshua Bittaye
column is copied to the new table Accommodation, so is the associated foreign key
constraint. Then, as staffNo is a dependent attribute, the column and its associated
foreign key constraint are removed from RoomRental. This leads to the table schema
in Code listing 9.4.9 and the sample data in Figure 9.33.
Code listing 9.4.9 Table schema after dealing with the functional
determinant {accomId} → {room, hall, staffNo}
RoomRental(accomId, rentStart, rentFinish, rent,
studentNo)
PRIMARY KEY (accomId, rentStart)
FOREIGN KEY (studentNo) REFERENCES Student(studentNo)
FOREIGN KEY (accomId) REFERENCES Accommodation(accomId)
Accommodation(accomId, room, hall, staffNo)
PRIMARY KEY (accomId)
FOREIGN KEY (staffNo) REFERENCES Bursar(staffNo)
Bursar(staffNo, name)
PRIMARY KEY (staffNo)
Student(studentNo, name)
PRIMARY KEY (studentNo)
461
IDWA_C09.QXD 10/5/07 11:40 AM Page 462
RoomRental
accomld rentStart rentFinish rent studentNo
MC078 12 Sep 2004 11 Jun 2005 400 2004892857
MC078 10 Sep 2006 9 Jun 2007 500 2005453125
MW031 11 Sep 2005 10 Jun 2006 550 2004023401 Student
MW042 10 Sep 2006 9 Jun 2007 575 2004023401 studentNo name
WM107 12 Jun 2005 10 Sep 2005 450 2004892857 2004892857 Jamie Smith
WM107 12 Sep 2004 11 Jun 2005 450 2004023401 2005453125 Peter Harris
WM107 11 Sep 2005 10 Jun 2006 550 2005453125 2004023401 Alice Clough
Accommodation
accomld room hall Hall
MC078 14 Marie Curie name staffNo Bursar
MW031 46 Maurice Wilkins Marie Curie 35054 staffNo name
MW042 51 Maurice Wilkins Maurice Wilkins 35054 35054 Selma Hutchins
WM107 123 William Morris William Morris 56673 56673 Joshua Bittaye
462
IDWA_C09.QXD 10/5/07 11:40 AM Page 463
9.4 Normalization
Comparing the table schema in Code listings 9.4.5 and 9.4.10, there is one signifi-
cant difference. In Code listing 9.4.10, the RoomRental table has a primary key
{accomId, rentStart}, while the equivalent StudentRental table in Code
listing 9.4.5 has a primary key {studentNo, rentStart}. Whenever there is a
choice of candidate keys, there will be different normalized sets of table schema.
463
IDWA_C09.QXD 10/5/07 11:40 AM Page 464
rents may go up, while competition between landlords means that they may come
down). Thus {accomId, studentNo} is not a functional determinant. In fact,
there are no other functional determinants and so the RoomRental table is also fully
normalized.
As has been demonstrated, provided all the dependent columns are identified for
each functional determinant, then it does not matter in which order the repeating
groups are dealt with. Where there are two or more functional determinants for a
single repeating group, the choice of primary key for the new table will affect the final
table schema. However, the difference is not significant for the logical design as any
candidate key could be used as the primary key. Thus, normalization can proceed
according to the following steps.
1 Identify a functional determinant and all its dependent columns. Always ensure
that for each dependent column columnName the claim ‘the functional determi-
nant determines the value of columnName’ makes sense within the scenario being
modelled.
2 Create a new table consisting of the columns in the functional determinant and all
its dependent columns.
(a) Make the functional determinant the primary key of this table.
(b) If the dependent columns include the columns of a pre-existing foreign key,
then include the foreign key constraint on the new table.
3 Remove the dependent columns from the original table, but leave the columns of the
functional determinant behind.
(a) If the dependent columns include the columns of a foreign key, then remove the
foreign key constraint on the original table.
(b) Add a new foreign key constraint on the original table to indicate that the func-
tional determinant references the primary key of the new table.
Chapter summary
■ This chapter has discussed logical database design for the relational data model,
demonstrating how to develop a relational database schema from a conceptual ER
diagram. The task was broken down into two basic steps. First, create a logical ER
diagram by eliminating those features that could not be expressed directly using
the organization structures of the relational data model. Next, write down a set of
table schema. The process is a little involved, but largely mechanical.
■ The chapter has concluded with a discussion of normalization. This is an alterna-
tive approach to relational database design that is particularly useful in situations
when an existing data set (such as a spreadsheet) needs to be stored in a new data-
base. Both an informal, and formal, approach to normalization have been de-
scribed. The approach to normalization allowed for the repeating groups to be
dealt with in any order. There is an alternative approach, which distinguishes dif-
ferent kinds of repeating group and deals with them in a strict order. See the Fur-
ther reading section, next, for sources on this stepped approach to normalization.
464
IDWA_C09.QXD 10/5/07 11:40 AM Page 465
Exercises
Further reading
Logical database design for the relational data model is a well-established process.
The steps described in this chapter are also covered by most database textbooks. The
approach to normalization taken here is based on that described in Howe (2001).
Other books – Date (2004), Connolly and Begg (2004), Elmasri and Navathe (2007) –
use the stepped approach to normalization.
Review questions
9.1 Explain how logical database design differs from conceptual database design. Why
are they kept separate?
9.2 Explain the purpose of the logical ER diagram. How does it differ from the ER
diagram for the conceptual database design?
9.3 In logical database design for a relational database, how can the designer deal with
multivalued attributes?
9.5 Explain why including default values in foreign key columns can lead to the database
holding incorrect data. Under what circumstances would it be safe to include default
values?
9.6 Explain how to derive the referential action for a foreign key from the structure of the
logical ER diagram.
9.7 What is a discriminator column and when would you use one?
9.8 In the context of normalization, what is a repeating group and how is it dealt with?
9.10 Explain how the ER diagram can help to determine whether or not a foreign key
column may be null.
Exercises
465
IDWA_C09.QXD 10/5/07 11:40 AM Page 466
9.12 Write down logical designs for all the entities in Figure 12.4, the logical ER diagram
for the Web Timetable application.
9.13 Figure E.1 is a conceptual ER diagram for the Pennine University’s Accommodation
Office.
(a) Draw the corresponding logical ER diagram.
(b) Write down the table schema.
Figure E.1 A conceptual ER diagram for the Pennine University’s Accommodation Office.
«persistent»
Staff
dateRaised : Date staffNo :Varchar {PK}
problem : Varchar fName : Varchar
IName : Varchar
phone [0..1] : Varchar
email : Varchar {CK=1}
«persistent» photo [0..1] : Blob
RaisesProblem
Student 0..* 0..* title : Varchar
{Value one of Mr, Ms,
studentNo : Char(10) {PK} Miss, Mrs, Dr, Prof}
fName [1..*] : Varchar 1..1 Bursar
IName : Varchar
email [1..2] : Varchar {CK=1} Manages
phone [0..1] : Varchar
0..* 0..*
«persistent»
0..* Hall
«persistent» name :Varchar {PK}
Room /capacity : Integer {Total
Rents Has number of rooms}
accomld : Varchar {PK}
0..* 1..* 1..1 address : Varchar
roomNo : Varchar
ensuite : Boolean phone [2..3] : Varchar
floor : Integer postcode : Varchar
selfCatering : Boolean
{Floor of the building the room is distanceFromCampus : Integer
start : Date
on: 0 for ground floor, 1 for first
end : Date
floor, etc.}
rent : Money
9.14 An analysis of the different sorts of students who attend the Pennine University led to
the generalization hierarchy shown in Figure E.2. This exercise examines portions of
the generalization hierarchy before attempting to write down a complete set of table
schema.
(a) An Undergraduate must be either a Foundation student (studying a Founda-
tion degree) or a Bachelors student (studying a Bachelors degree). Founda-
tion students will have a workplace and a workplace mentor while Bachelors
students may be studying on a sandwich course. Write down the table schema
for these three entities (ignore the other entities for now). Ensure that you take
account of the constraint on the generalization relationships.
466
IDWA_C09.QXD 10/5/07 11:40 AM Page 467
Exercises
«persistent»
Student
studentNo : Char(10) {PK}
fName [1..*] : Varchar
IName : Varchar
email [1..2]: Varchar {CK=1}
phone [0..1]: Varchar
{overlapping, optional}
{disjoint, optional}
«persistent»
Foundation
workplace : Varchar
«persistent» «persistent» «persistent»
mentor : Varchar
Undergraduate Postgraduate ShortCourse
467
IDWA_C09.QXD 10/5/07 11:40 AM Page 468
(ignore the other entities for now). Ensure that you take account of both the con-
straints on the generalization relationships.
(e) Now write down table schema that implement the complete generalization hierar-
chy. How do they differ from the table schema written in previous parts of this
question?
(f) Critically evaluate the benefits and drawbacks of modelling generalization hierar-
chies in this way.
9.15 Use normalization to develop a set of table schema from the spreadsheet data
shown in Figure E.3.
LecturerModule
staffNo fName IName moduleCode module semester
31210 Paul Smith CIFC0084 Information systems 1
CIHO6008 Soft systems 2
35054 Selma Hutchins CCFC0108 Introduction to programming 1
CCH09668 Advanced programming 2
CCHO0418 Formal specification 2
56673 Joshua Bittaye CIFC0084 Information systems 2
9.16 Use normalization to develop a set of table schema from the spreadsheet data
shown in Figure E.4. What problem with the normalization process does this example
illustrate?
ProgrammeOfStudy
department head name programme pManager pManagerName course courseTitle cDirector cDirectorName
Business 12231 John Smith Foundation 81134 Emmy Noether B2Y6 Management 76674 Paul Dirac
B4T1 Business studies 81134 Emmy Noether
Bachelors 43188 Jeff Smith B5Y7 Management 43188 Jeff Smith
B4R1 Business studies (top-up) 81134 Emmy Noether
Computing 23257 Freya Stark Foundation 31210 Paul Smith C1F8 Software engineering 67884 David Davies
Bachelors 31210 Paul Smith C5T1 Software engineering 56673 Joshua Bittaye
C2Z1 Computer science 31210 Paul Smith
Postgraduate 35054 Selma Hutchins C3M5 Information systems 35054 Selma Hutchins
C2M3 Computer science 31210 Paul Smith
Investigations
9.17 Investigate the stepped approach to normalization (first, second and third normal
forms), summarizing the process and explaining the similarities and differences to
the approach used in this chapter.
468
IDWA_C09.QXD 10/5/07 11:40 AM Page 469
Investigations
9.18 Although spreadsheets are a good starting point for normalization, the process can
be used with other initial sources of data. All that is needed is a list of data items (the
columns). Investigate how to use normalization when the initial source of data is a set
of entities and their attributes, a form (such as university application form) and items
on the form, and a web page for an e-commerce retailer.
9.19 In Investigation 8.19, you developed a conceptual database design for an organiza-
tion you know. Use the ER diagram to create a logical database design, including a
logical ER diagram and a set of table schema.
469
IDWA_C10.QXD 10/5/07 11:42 AM Page 470
Chapter objectives
Chapter outline
Physical database design takes the logical database design and investigates how the
facilities provided by a particular DBMS can best be used to implement it. It is a much
more diverse process than either conceptual or logical database design. Some decisions
are relatively straightforward, such as which data types provided by the DBMS best
match the data types of the logical design. Others could have been made during the
logical design process, but were postponed because different DBMS offer different
solutions – how to implement derived attributes and business rules, for example.
There are also more general physical design decisions, such as whether or not to rein-
troduce redundant data to help improve performance. The common thread linking
these decisions is that they are reliant on the facilities provided by the chosen DBMS
and improved performance is an important goal.
A crucial step is that of understanding the design’s non-functional requirements,
such as the volume of data to be stored, response times and the processing power
needed to run database transactions. Analysing storage requirements and transactions
can offer some useful guidance for database designers and developers, but the conclu-
sions are indicative, rather than prescriptive. Most commercial DBMS have powerful,
470
IDWA_C10.QXD 10/5/07 11:42 AM Page 471
Chapter outline
■ Section 10.1 completes the translation of the conceptual design into a design that
can be implemented on a relational DBMS. It discusses how to map the data types
used in the conceptual and logical designs to those supported by a particular
DBMS. It also discusses how to implement derived attributes and business rules.
These issues were not considered during logical database design as the choice of
DBMS affects which options are available. The section concludes by suggesting one
way to document the physical database design using SQL DDL statements. The ap-
proach in this section leads to a first draft of the physical database design. As with
the chapters on conceptual and logical design, this first section provides an
overview of the basic tasks, rather than a comprehensive description of physical
database design.
■ Section 10.2 considers how to assess the likely performance of the database design.
It introduces techniques for assessing the likely data storage capacity required by
individual tables and, hence, the database as a whole, and for assessing the perfor-
mance of individual database transactions (SQL DML statements). The techniques
can be applied to either the logical database design or the first draft of the physical
471
IDWA_C10.QXD 10/5/07 11:42 AM Page 472
database design. The conclusions may lead the database designer to reassess the
first draft of the physical database design or even change the target DBMS.
■ Section 10.3 considers some general issues that the database designer should con-
sider during physical database design. These include whether or not to introduce
surrogate keys, how to deal with binary data and whether or not to introduce
controlled redundancy.
■ The first three sections described above are all focused on how best to design tables
and manage the data they hold using features defined in the ISO SQL standard.
Section 10.4 considers the data storage structures and underlying data files used by
the DBMS to hold the database instance. It outlines the three basic ways in which to
organize records within data files – heap, ordered and hash files. It then discusses
the role of indexes in managing data.
■ Section 10.5 concludes the chapter with a discussion of database security issues,
based on the security model of the ISO SQL:2003 standard.
This section deals with those pieces of information from the conceptual database
design that were set aside during logical database design. It uses features of the
ISO SQL:2003 standard that are available on most DBMS, though the particulars of
implementation may vary.
472
IDWA_C10.QXD 10/5/07 11:42 AM Page 473
Table 10.1 Differences in data types between ISO standard and two com-
mercial DBMS.
approximate numeric value. Most business applications only need exact numeric data,
but scientific applications will need to use approximate numerics.
The variations on date and time data types are greater than for the other data types.
The ISO SQL and MySQL DBMS DATE and TIME hold a date and a time respectively.
The Oracle database’s DATE data type and the MySQL DBMS DATETIME data type both
hold a date and time, such as ‘21 January 2006 18:36:00’. The idea of the TIMESTAMP
data types is that the values provide precision down to fractions of a second (not pro-
vided by the Oracle DATE or MySQL DATETIME data types). Support for time zones
and automatic time stamping varies for the data types on different DBMS.
473
IDWA_C10.QXD 10/5/07 11:42 AM Page 474
The INTERVAL data types of the ISO SQL standard offer a comprehensive and
flexible approach to dealing with intervals of time. Year and month intervals can be
expressed as a number of years, months or both. Day and time intervals can deal with
days, hours, minutes, seconds or some subsequence of these, such as days, hours and
hours, minutes, seconds (this is expressed in Table 10.1 as INTERVAL MAJOR TO
MINOR).
The Oracle database offers the same coverage with a more compact syntax, so
INTERVAL YEAR TO MONTH holds intervals such as ‘3 years and 7 months’, ‘2 years’
and ‘5 months’. INTERVAL DAY TO SECOND can hold intervals such as ‘2 days,
4 hours, 17 minutes and 27.35 seconds’, ‘12 seconds’, ‘1 hour, 7 minutes’ and so on.
The MySQL DBMS allows limited support for hour, minute and second intervals
with its TIME data type and greater for support for intervals in the functions used to
manipulate dates and times.
Support for the Boolean data type is patchy, which is odd as it is the only type of
data that is actually required by the relational data model. It was included in ISO
SQL:1999 and is implemented in the MySQL 5.1 DBMS as a synonym for
TINYINT(1) – that is, a single-digit integer. This is a common approach when the
Boolean data type is not available – simply designate a single digit (or character) to
represent ‘True’ and all others to represent ‘False’.
Non-scalar data types also have patchy support. The ISO SQL standard defines
ARRAY and MULTISET. The Oracle database implements the VARRAY data type. The
MySQL DBMS implements the SET data type, though the possible elements of the set
must be defined when the table is created, so this is closer to an enumeration than a
true set data type.
The ISO SQL:2003 standard includes the ability for users to define their own data
types. User-defined types are supported by the Oracle 10g database (through its object
type), though they are not supported in the MySQL 5.1 DBMS. In many relational
databases, the system-defined data types are all that are needed.
One common use for user-defined data types is ensuring that foreign key columns
have exactly the same data type as their matching primary key column. The ISO
SQL:2003 distinct types can be used for this. A distinct type is created as a restriction
of an existing system-defined type. Code listing 10.1.1.1 shows how to do this for the
Staff Directory database tables. A new distinct type, called StaffNo, is defined using
the CREATE TYPE statement and used as the data type for the staffNo columns in
the Staff and SuportSession tables.
As SQL enforces strong typing on user-defined types, it is not possible to compare,
for example, a StaffNo value to a CHAR(5) value, even though these two data types
have the same set of possible values. There are facilities to cast a user-defined type back
to its underlying system-defined type. Check the manual for the target DBMS before
using this facility, though.
Code listing 10.1.1.1 Using user-defined data types for primary and foreign
key columns
CREATE TYPE StaffNo AS CHAR(5) FINAL;
CREATE TABLE Staff(
staffNo StaffNo NOT NULL ,
474
IDWA_C10.QXD 10/5/07 11:42 AM Page 475
The same advice applies for those thinking of using ISO SQL:2003 structured types.
These provide many of the features of classes from object-orientated programming
languages, but there is one important difference. To SQL, every instance of a user-
defined type is a value and it isn’t possible to distinguish between two different
instances that have the same value. In a fully object-orientated language, every
instance of a class has a unique identifier, so two instances with the same value can
still be distinguished. Structured types are beyond the scope of an introductory data-
base textbook.
475
IDWA_C10.QXD 10/5/07 11:42 AM Page 476
When the value of a derived attribute is stored in a column, the database applica-
tion needs to maintain the redundant data. When a row is added to the database, or
modified, no value is supplied for the derived attribute. Instead, the application calcu-
lates the correct value based on the other values. If necessary, this calculation can be
done by the database client (such as the web server) before submitting the database
transaction to the DBMS. A better way is by using a database trigger.
Each SQL DML data modification statement causes a database event to occur. A
trigger is a piece of application code that is executed when some specified database
event occurs on a named table.
There are three database events that can cause a trigger to execute. An insert event
occurs when data is added to a table (typically using an insert statement, but a data-
base import may also cause an insert event). An update event occurs when data on a
table is modified and a delete event when data is removed.
A trigger that executes just once in response to a database event is called a
statement-level trigger. A row-level trigger executes once for each affected row. For
example, suppose an update statement affects three rows. A statement-level update
trigger would execute once, whereas a row-level update trigger would execute three
times. (If the affected table did not have an update trigger, then nothing would
happen.)
One additional piece of flexibility offered by triggers is that they can be timed to ex-
ecute either before or after the database event. This is where their usefulness for deal-
ing with derived data comes in. Before a new row is added to a table, a row-level before
insert trigger can calculate the value of any derived data from the values supplied
for the other columns. Similarly, the trigger can ensure that derived data is correctly
modified when the other column values on a row change.
Figure 10.1 shows the general form of the ISO SQL create trigger statement. It begins
with the keywords CREATE TRIGGER followed by the name of the trigger. The
triggerEvent indicates what actions by users cause the trigger to execute
(INSERT, UPDATE or DELETE). The triggerTime is either BEFORE or AFTER.
Each trigger is associated with changes to a single table and this is specified in the on
clause. The referencing clause allows the database developer to define correlation
names for the columns being changed and access both the old and new values of the
476
IDWA_C10.QXD 10/5/07 11:42 AM Page 477
Neither the MySQL DBMS nor the Oracle database support the referencing clause.
Instead, both include standard correlation names NEW and OLD.
The statement in Figure 10.1 creates a row-level trigger, using the keywords FOR EACH
ROW. To create a statement-level trigger, use FOR EACH STATEMENT instead. It also
includes a trigger condition.
Trigger conditions allow finer control over when a trigger is executed. The
TriggerCondition is a Boolean expression. When the triggering event occurs, the
DBMS checks the trigger condition. If it is true, then the triggered action is carried out,
but otherwise no further action is taken.
As an example of using triggers to maintain duplicate data, the logical design for the
TeachingSession entity on the Web Timetable database includes two derived attributes.
The value of endTime is calculated by adding duration to startTime. The value of
runsFor is calculated by subtracting startDate from endDate. Code listing
10.1.2.1 shows the create table statement for the TeachingSession table with both
these derived attributes implemented as columns. It also shows the before insert and
before update triggers that maintain the redundant data. Although insert and delete
events always affect whole rows, update events may affect only a single column. It
would be pointless recalculating the value of endTime unless one of duration or
startTime had changed. The trigger event for an update trigger can include a list of
column names and, then, only if one of these is affected is the trigger executed.
Notice that each trigger is triggered by a separate database event. The MySQL DBMS
also insists on this, but the Oracle database allows multiple database events to cause one
trigger to execute. The before insert trigger tbiTeachingSession (tbi stands for
‘trigger, before insert’) has two statements in its triggered action that calculate the two
redundant data values. The before update triggers tbuTeachingSessionEndTime
and tbuTeachingSessionRunsFor calculate the redundant data value for one
column each and are only triggered by changes to the values of the relevant columns.
477
IDWA_C10.QXD 10/5/07 11:42 AM Page 478
In general, triggered actions can involve multiple statements and use the full SQL
procedural language (if statements, loops and so on.).
In Code listing 10.1.2.1, there are a lot of semicolons. Each statement in the triggered
action must end with a semicolon. This is called a delimiter as it delimits the state-
ment (it shows where one statement ends and the next begins). The semicolon is also
the delimiter for SQL DDL statements. Although the DBMS should be able to sort out
what each semicolon means, be aware that it may be necessary to use a different
statement delimiter for the SQL DDL to avoid confusion. This is the case with the
MySQL DBMS.
478
IDWA_C10.QXD 10/5/07 11:42 AM Page 479
479
IDWA_C10.QXD 10/5/07 11:42 AM Page 480
database. If these are not available, then implement the application code on the web
server by, for example using server-side scripting in PHP. As a last resort, implement
the calculation using browser scripting on the web browser.
480
IDWA_C10.QXD 10/5/07 11:42 AM Page 481
481
IDWA_C10.QXD 10/5/07 11:42 AM Page 482
Code listing 10.1.4.2 Physical table design for the logical table schema in
Code listing 10.1.4.1 as an SQL script
--
**********************************************************
--Create the base tables with their primary key constraints
--
**********************************************************
482
IDWA_C10.QXD 10/5/07 11:42 AM Page 483
--
**********************************************************
-- Add foreign keys
--
**********************************************************
ALTER TABLE Equipment
ADD CONSTRAINT frnEquipRoom
FOREIGN KEY (building, room) REFERENCES Room(building,
roomNo);
--
**********************************************************
-- Implement candidate keys and business rules
--
**********************************************************
ALTER TABLE Room
ADD CONSTRAINT chkRoomCapacity CHECK (capacity >= 20);
--
**********************************************************
-- Create views
--
**********************************************************
483
IDWA_C10.QXD 10/5/07 11:42 AM Page 484
--
**********************************************************
-- Create triggers
--
**********************************************************
-- taiEquipment maintains the derived column
Room.equipmentValue
CREATE TRIGGER taiEquipment
AFTER INSERT ON Equipment
FOR EACH STATEMENT
BEGIN ATOMIC
UPDATE Room
SET equipmentValue =
(SELECT SUM(cost)
FROM Equipment
WHERE
Equipment.building=Room.building
AND Equipment.room=Room.roomNo);
END;
The structure outlined in Code listing 10.1.4.2 provides a reasonable approach to doc-
umenting the physical database design. It also provides a first draft of the actual data-
base implementation script. An alternative approach is to have a more detailed written
description of the physical table designs. This can include information on data
volumes and performance. The following section discusses how to assess the perfor-
mance of a physical database design.
484
IDWA_C10.QXD 10/5/07 11:42 AM Page 485
Code listing 10.2.1.1 Physical table design for the Staff table
CREATE TABLE Staff(
staffNo CHAR(5) NOT NULL ,
fName VARCHAR(50) NOT NULL ,
lName VARCHAR(20) NOT NULL ,
phone VARCHAR(20) ,
email VARCHAR(50) NOT NULL ,
photo BLOB ,
department VARCHAR(20) NOT NULL ,
title VARCHAR(5) ,
jobType VARCHAR(20) ,
jobTitle VARCHAR(30) ,
CONSTRAINT pkStaff PRIMARY KEY (staffNo)
);
For example, the CHAR and VARCHAR columns in the Staff table have a total,
maximum length of 220 characters, giving a maximum of 220 bytes of character data
in each row. The photo column has data type BLOB and stores a JPEG with maxi-
mum size of 100 kB. This gives an estimated total size for each row of 102,620 bytes
(remember, 1kB = 1024 bytes, and 1MB = 1,048,576 bytes). There are, at most, 1200
staff working at the university, so a maximum of 1200 rows in the Staff table.
Thus, the Staff table has a maximum size of 123,144,000 bytes, which is roughly
117 Mb.
It is also useful to do the calculation without any binary columns as one common
design decision in web database applications is to store image files (and other binary
data) on the web server, rather than the database server. In this situation, the photo
column would store a URL to the appropriate image file (see below). Assuming that the
photo column has data type VARCHAR(30), a row of the Staff table has a maxi-
mum size of 250 bytes and the table itself a maximum size of 300,000 bytes, or roughly
293kB. Clearly, removing the binary data from the database dramatically reduces the
size of the Staff table.
485
IDWA_C10.QXD 10/5/07 11:42 AM Page 486
This calculation takes no account of the fact that some columns may be null or that
the length of the VARCHAR columns is a maximum. A more sophisticated approach
would also consider the average length of data values in each column, with null having
a length of zero. These give the average row size. With an estimate of the average num-
ber of rows in the table instance at any time, it is possible to calculate the average size
of the table. Good sample data is essential, but otherwise the calculation remains pretty
simple. Figure 10.2 shows a spreadsheet for calculating this data. Note that the photo
column is included twice – once as a BLOB and once as a VARCHAR. The calculations
only use one of these different versions.
There are two more technical problems to face when calculating data volumes. First,
modern character sets may use more than one byte for some, or all, character encod-
ings. A two-byte character encoding doubles the amount of storage needed for a char-
acter column. It is worth checking which character set is used to store data and
Row size
Column name Data type Maximum length Average length
staffNo CHAR 5 5
fName VARCHAR 50 15
IName VARCHAR 20 8
phone VARCHAR 20 4
email VARCHAR 50 24
photo BLOB 100000 90000
photo VARCHAR 30 14
department VARCHAR 20 12
title VARCHAR 5 3
jobType VARCHAR 20 15
jobTitle VARCHAR 30 20
Excl. binary 250 120
With binary 100220 90106
Table size
Maximum rows 1200
Average rows 1100
486
IDWA_C10.QXD 10/5/07 11:42 AM Page 487
considering whether or not this affects the assumption that one character equates to
one byte. Second, numeric data can be stored in a range of formats – integer, floating
point, decimal – and these use different numbers of bytes, as do dates and times. The
technical documentation for the DBMS should provide information on this, though
often an estimate based on the size of literals will suffice (as literals usually take up
more space than the actual data values, this gives a generous estimate of disk space
requirements).
Database transactions were considered briefly in the section on access path analysis
in Chapter 8.
The way a table is used can affect choices about its data storage structures (the internal
level of the ANSI/SPARC architecture). A table that is only ever queried or populated by
a database import is effectively read-only. Some DBMS offer very fast data storage
structures for read-only tables.
Tables containing data that is modified regularly may need careful design to ensure
efficient performance. The careful choice of indexes is particularly important as they
speed up some database transactions but slow down others. A QUID matrix (QUID
stands for query, update, insert, delete) records whether a database transaction queries
a table, inserts a new row, updates an existing row or deletes a row. It is reasonable to
expect every table to have database transactions that insert a row, query the table and
delete a row.
Occasionally a table will be created as a data export from another database,
in which case the database transaction inserting the rows would be part of another
application.
Occasionally a table will never have data deleted, such as an archive table. Other-
wise, a table that is never queried or that is not used by any database transaction, sug-
gests that either a database transaction has been missed or the table is unnecessary.
Figure 10.3 shows an extract from the QUID matrix for the Web Timetable database.
The tables (including link tables) are shown down the left-hand side and four database
487
IDWA_C10.QXD 10/5/07 11:42 AM Page 488
Figure 10.3 An extract from the QUID matrix for the Web Timetable
database application.
Transaction
Import timetable Book equipment View staff Choose teaching
data timetable session
Entity Attendance I
Course UID
CourseModule UID
Equipment Q
EquipmentBooking QUID
Module UID Q Q
Room
Staff UID Q Q
Student Q
StudentModule Q
TeachingSession UID Q Q
transactions are shown across the top. So, for example, the ‘Book equipment’ transac-
tion queries the Equipment, EquipmentBooking and Staff tables and may in-
sert, update or delete from the EquipmentBooking table. The idea is that this
transaction represents one behaviour captured by the use cases. It is a relatively coarse-
grained transaction as the insert, update and delete behaviour are rolled into one.
The database transaction ‘Choose teaching session’ is more fine-grained as it only
represents a student choosing which teaching sessions to attend; separate database
transactions (not shown) would allow them to change their minds.
Its important to realize that a transaction to insert rows into a table may use an SQL
INSERT statement or an import facility of the DBMS. It is the behaviour of the
transaction that is important, not the actual SQL statement used to carry it out.
Other names for the QUID matrix include CRUD (create, read, update, delete) and
IRUD (insert, read, update, delete).
488
IDWA_C10.QXD 10/5/07 11:42 AM Page 489
transaction runs and how many rows are accessed is added to the access path. The idea
is that the more rows a transaction reads, the more work the DBMS must do and the
longer the transaction will take to execute. A transaction that reads many rows and
runs frequently is a good candidate for careful performance tuning.
Access paths are based on the logical ER diagram rather than the table schema. (Re-
member, there is a one-to-one correspondence between entities in the logical ER dia-
gram and table schema.) Consider the database query ‘List all staff who teach on
courses in the computing department’ (a variant of the query considered in Chapter 8;
see, for example, Figure 8.46 (b)).
An access path based on the conceptual database design will not include any link or
look-up entities created during logical database design, so the access path should be
redrawn as shown in Figure 10.4. As before, the access path indicates which tables
must be accessed by the DBMS to respond to the query. Given a particular course in
the computing department, all modules that could be taken on that course are identi-
fied via the CourseModule and Module tables. For each module, a list of teaching
sessions for that module is compiled from the TeachingSession table. Finally, for
«persistent»
22
Course
1..1
5..20
«persistent»
17
CourseModule
1..*
1..1
«persistent»
1
Module
1..1
Delivers
1..*
«persistent»
4
TeachingSession
Access path analysis for
1..*
the database transaction
Teaches 'List all staff who teach on
courses in a given department'
1..1
«persistent» Rows accessed = 3762
1
Staff Frequency < 1 per day
489
IDWA_C10.QXD 10/5/07 11:42 AM Page 490
each teaching session, the single member of staff who teaches that session is identified
from the Staff table.
For each table, the number on the right-hand side is an estimate of the average num-
ber of rows accessed by the DBMS for each row of the table above it in the access path.
The first table is a special case – the number beside it estimates the average number of
rows initially accessed by the DBMS. The query requires a list of all courses run in the
computing department and the staff teaching them. The department offers 22 different
courses, so the number 22 is written beside the Course entity on the access path.
Moving down the access path, for each course at the Pennine University, there are
typically 17 modules (6 each in years 1 and 2, 5 in the final year). This means that, for
one row of Course, the DBMS will need to access 17 rows of CourseModule. (For some
courses the number will be higher, for others lower, but 17 is the typical number.) In
turn, each row of CourseModule is linked to exactly one row of Module, hence the 1
beside that entity. A typical module has four teaching sessions – one lecture and two
tutorials. Again, this will vary, but four is a reasonable estimate to place beside
TeachingSession. Finally, each teaching session is taught by a single member of
staff. This can be summarized by noting that the DBMS must access:
It is now a simple exercise to calculate an estimate for the total number of rows
accessed by the DBMS when it executes this query. The calculation is recursive. The
number of rows accessed is:
(22 rows in Course) ⫹ 22 ⫻ (the number of other rows accessed for each row in
Course)
The number of other rows accessed for each row of Course is:
The access path diagram should include a description of the transaction, the total
number of rows accessed and the frequency with which the transaction is run. The
transaction shown in Figure 10.4 accesses only a few rows and runs very infrequently,
so wouldn’t normally be a candidate for close scrutiny.
One role of access path analysis is to help the database developer write SQL state-
ments to implement a database transaction. Some database transactions will have two
or more possible access paths – that is, two or more different SQL statements that imple-
ment them. For example, the access path for the transaction to list all staff who teach on
a course could go via the Student table rather than the Module table. This alternative
access path is shown in Figure 10.5. Notice that here the number of rows accessed is four
490
IDWA_C10.QXD 10/5/07 11:42 AM Page 491
«persistent»
22
Course
1..1
0..*
«persistent»
100
Student
1..1
1..*
«persistent»
2
Attendance
0..*
Delivers
1..1
«persistent»
1
TeachingSession
Access path analysis for
1..*
the database transaction
Teaches 'List all staff who teach on
courses in a given department'
1..1
«persistent» Rows accessed = 15,422
1
Staff Frequency < 1 per day
times greater than the access path in Figure 10.4. As both these access paths translate
into different SQL SELECT statements, it makes sense to use the one that implements
the most efficient access path.
Some physical database design decisions need to be made regardless of the chosen
DBMS. These decisions usually involve trading off improved performance of database
queries against poorer performance of database modifications. It is crucial that the
database designer properly assesses the implications of any change, to ensure that it
does not worsen the situation. The analysis techniques discussed above can help to as-
sess the strengths and weaknesses of the alternative designs, but differences between
DBMS mean that decisions can only really be made by someone who understands the
technical capabilities of the target DBMS in detail. It is worth emphasizing this as a
general textbook cannot substitute for a DBMS manual or training course. That said,
the issues are the same, it’s just the solutions that might be different.
491
IDWA_C10.QXD 10/5/07 11:42 AM Page 492
492
IDWA_C10.QXD 10/5/07 11:42 AM Page 493
capacity INTEGER,
CONSTRAINT priRoom PRIMARY KEY (roomId)
);
There are problems with using surrogate keys. First, the DBMS assigns the value of the
surrogate key so it rarely has any meaning for end users (note that staff numbers, mod-
ule codes and so on are not surrogate keys as such values are dictated by the users, not
the DBMS). This can actually increase the complexity of some database transactions.
The database transaction ‘List all equipment and the rooms they are located in’ could
be answered by querying a single table in the logical design – all the required data is in
the Equipment table as the building and roomNo columns are included in this
table as a foreign key. With a surrogate key on the Room table, the building and
roomNo columns no longer appear in the Equipment table as foreign key columns.
The surrogate key appears instead. In this design, to find which room equipment is
stored in requires a table join.
The situation is even more complicated when the foreign key is a part of the refer-
encing table’s primary key. Replacing the referenced table’s primary key with a surro-
gate key will alter the structure of the referencing table’s primary key. One solution is
to give the referenced table a surrogate key as well. Another is to use the referenced
493
IDWA_C10.QXD 10/5/07 11:42 AM Page 494
table’s logical primary key, which is still a candidate key, as the foreign key on the ref-
erencing table. This has the benefit of a surrogate key on the referenced table and a
meaningful foreign key on the referencing table.
One final problem arises when data is copied, or imported, into a table with a surro-
gate key. If this data already has values for the surrogate key column, then the DBMS
may struggle to cope. The ISO standard allows GENERATED ALWAYS to be replaced
by GENERATED BY DEFAULT. In this latter case, the DBMS will only generate a
value for the surrogate key if none is supplied. However, there is no guarantee that the
supplied values comply with the primary key constraint. If they do not, then the
inserts will be rejected.
Code listing 10.3.2.1 Revised column definitions for the Staff table
CREATE TYPE StaffNo AS CHAR(5) FINAL;
CREATE TABLE Staff(
staffNo StaffNo NOT NULL ,
fName VARCHAR(20) NOT NULL ,
lName VARCHAR(20) NOT NULL ,
494
IDWA_C10.QXD 10/5/07 11:42 AM Page 495
phone VARCHAR(20) ,
email VARCHAR(30) NOT NULL ,
photo VARCHAR(100)
DEFAULT
'http://www.pennine.ac.uk/StaffDirectory/NoPhoto.jpg'
NOT NULL ,
department VARCHAR(30) NOT NULL ,
title VARCHAR(4) ,
jobType VARCHAR(15) ,
jobTitle VARCHAR(30) ,
CONSTRAINT pkStaff PRIMARY KEY (staffNo)
);
No information about contact details is needed, just the staff member’s name and the
days and times of their support sessions. However, details of the staff member’s name
are held in the Staff table, so this query must be answered using the join statement
given in Code listing 10.3.3.1.
Code listing 10.3.3.1 A join query showing details of staff support sessions
SELECT fName || ' ' || lName AS name, dayOfWeek, startTime,
endTime
FROM Staff, SupportSession
WHERE Staff.staffNo = SupportSession.staffNo
ORDER BY lName, fName;
495
IDWA_C10.QXD 10/5/07 11:42 AM Page 496
Code listing 10.3.3.2 A join query showing details of staff support sessions
(a) Physical table schema for SupportSession with the fName and lName
columns of Staff duplicated
CREATE TABLE SupportSession(
staffNo StaffNo NOT NULL ,
dayOfWeek VARCHAR (9) NOT NULL ,
startTime TIME NOT NULL ,
endTime TIME NOT NULL ,
fName VARCHAR(20) NOT NULL ,
lName VARCHAR(20) NOT NULL ,
CONSTRAINT pkSupportSession
PRIMARY KEY (staffNo, dayOfWeek, startTime)
);
(b) The revised SQL statement to list details of staff support sessions
SELECT fName || ' ' || lName AS name, dayOfWeek, startTime,
endTime
FROM SupportSession
ORDER BY lName, fName;
Another occasion when controlled redundancy may improve performance is when
statistical information is required. For example, consider the database query:
provide a list of all teaching sessions showing the room, the room’s capacity and
the number of students attending that teaching session.
This query requires data from the TeachingSession and Room tables even before
starting to calculate the number of students attending a teaching session. It would
make sense to add a noOfStudents column to the TeachingSession table to
store the number of students attending each teaching session as this value is unlikely
to change after the start of term. This last is an important point – if statistical data is
likely to change frequently, then it is better to recalculate it on demand rather than
attempt to store the derived data.
One important point about repeated and derived data is that it must be main-
tained. If a member of staff gets married and changes her name, then her data in both
the Staff and SupportSession tables must be modified. Similarly, if a student
changes tutorial group, then the value of the derived data in the noOfStudents
column in the TeachingSessions table would need to be recalculated. There is a
trade-off between better performance of database queries and more work involved
during data modification.
496
IDWA_C10.QXD 10/5/07 11:42 AM Page 497
The impact of maintaining repeated and derived data must be considered carefully.
It can actually be costlier to maintain the redundant data than it is to run the original
join and aggregate queries. If this is the case, then don’t introduce redundant data. The
section on derived attributes examined how to present such data using views and this
is a good alternative approach when data is likely to change regularly.
Creating duplicate data by copying information between tables linked by a
foreign key is sometimes called denormalization. Strictly speaking, it isn’t. Denormal-
ization occurs when a table created as part of the normalization process is reabsorbed
into the table it came from. During normalization, each repeating group is extracted
into a new table, with a foreign key referencing the original table. Denormalization
reverses this – two tables linked by a foreign key are joined back together into a single
table.
Consider the Student Rental database, designed by normalization in Chapter 9. The
normalized table schema is shown in Code listing 10.3.3.3. There are three very small
tables – Student, Hall and Bursar – and all have just two columns. It’s worth
asking if it makes sense to have separate tables to hold this data.
Code listing 10.3.3.3 The logical table schema for the fully normalized
Student Rental database
Student(studentNo, name)
PRIMARY KEY (studentNo)
StudentRental(studentNo, accomId, rentStart, rentFinish,
rent)
PRIMARY KEY (studentNo, rentStart)
FOREIGN KEY (studentNo) REFERENCES Student(studentNo)
FOREIGN KEY (accomId) REFERENCES Accommodation(accomId)
Accommodation(accomId, hall, room)
PRIMARY KEY (accomId)
FOREIGN KEY (hall) REFERENCES Hall(name)
Hall(name, staffNo)
PRIMARY KEY (hall)
FOREIGN KEY (staffNo) REFERENCES Bursar(staffNo)
Bursar(staffNo, name)
PRIMARY KEY (staffNo)
Consider the Bursar table. This holds information about the current bursar for each
hall of residence. Figures 10.6 (a) and (b) show table instances for the normalized and
denormalized versions of this data. The advantages of the normalized version are that,
if the name of a bursar changes, just one row of the Bursar table needs to be updated
and the name of each bursar is stored only once.
In this particular case, neither of these advantages is significant. There are only a
small number of halls of residence, so storing the bursars’ names more than once is
not a big issue. Also, the names of bursars don’t change that often.
497
IDWA_C10.QXD 10/5/07 11:42 AM Page 498
Hall Bursar
name staffNo staffNo name
William Morris 56673 56673 Joshua Bittaye
Marie Curie 35054 35054 Selma Hutchins
Maurice Wilkins 35054
Hall
name staffNo bursar
William Morris 56673 Joshua Bittaye
Marie Curie 35054 Selma Hutchins
Maurice Wilkins 35054 Selma Hutchins
The disadvantages come from having to perform table joins to find the name of each
hall’s bursar. In fact, the Bursar table is only ever queried as part of a table join with
the Hall table. Under these circumstances the normalized logical database design is
unnecessarily complex. Denormalizing the database, by merging the two tables back
into one, is a sensible decision. Figure 10.6 (b) shows the denormalized table.
In contrast, it does not make sense to denormalize the database by merging the Hall
and Accommodation tables. Some halls have over 200 rooms, so there would be a lot
of redundant data (bursars’ details repeated over and over again). Also, the hall name is
a foreign key on the Accommodation table, so queries about accommodation that
only need the hall name can be made without actually using the Hall table.
With the table designs shown in Code listing 10.3.3.3, it would probably also make
sense to merge the Student table back into StudentRental. However, it seems
unlikely that the only data held about a student is his or her name. It is much more
likely that the student’s home address, contact phone numbers and so on are also held
in the database. If this is true, then it would not be a good idea to merge these tables.
The lack of this data is a point that should be raised with the end users during concep-
tual database design. (It was left out of the example purely for practical reasons – it
makes the data set too large to fit on a page!)
The situation discussed here is analogous to that when surrogate keys are used.
Surrogate keys and the controlled reintroduction of redundant data are useful tech-
niques, but both can lead to more join queries. These are harder work for the DBMS
than single table queries. In many cases, the DBMS will easily cope with the extra
work, but, when a particular join query will take up a lot of a DBMS resources, there is
a case for reconsidering the design.
Access path analysis can highlight join queries that are likely to cause problems.
Whenever redundant data is introduced into the physical database design, the com-
plexity of data modification operations will increase. Some form of application code
498
IDWA_C10.QXD 10/5/07 11:42 AM Page 499
must be written to ensure that all copies of the redundant data are modified at the
same time. This could be DBMS triggers or code on the application server. The added
costs of this code should be balanced against the reduced query complexity that
results from introducing redundant data.
Both e-mail addresses and British postcodes actually have more structure than
described here, but these details aren’t important unless an application actually has
to decode them.
Both the telephone number and British postcode predate computer technology and
were developed as structured codes that people could easily use. However the idea of
structured codes was enthusiastically adopted by early application developers and they
are widely used in computerized information systems.
The module and course codes at the Pennine University are structured codes. Mod-
ule codes begin with two letters indicating the department and subject area that deliv-
ers the module, followed by a third letter indicating the level, then a letter and four
digits. The first letter of a course code indicates the department offering the course.
In applications that use flat files to store data, structured codes have a number of
benefits. They have a regular structure that can simplify data processing, they store
complex information compactly, reducing disk space requirements, and are often easy
for experienced staff to use.
These benefits do not apply in contemporary database applications. The reduced
data storage is insignificant in most cases – disk space is cheap and plentiful. A com-
mercial DBMS can process separate data items efficiently enough that structured codes
become an impediment. For example, if the computing department were to change its
name to the informatics department, it would be easy to make this change on the
database. It would be much harder to change the first character of all the module
codes from ‘C’ to ‘I’. Another problem arises when users wish to see a list of all mod-
ules offered by the information systems subject area. The DBMS needs to find all those
modules that have the module code ‘I’ as their second character. In both these situa-
tions, the string manipulation functions discussed in Chapters 5 and 6 allow the
DBMS to parse the string (pull it apart and reassemble it), but there are significant
performance problems associated with doing so.
499
IDWA_C10.QXD 10/5/07 11:42 AM Page 500
Despite the difficulties with structured codes and the fact that they are not really
necessary in contemporary information systems, end users can be very attached to
them. Changing the codes means changing working practices and that is rarely some-
thing people welcome. So, during the conceptual and logical database design processes,
many systems will include structured codes. A good physical design will seek to mini-
mize the need for the DBMS to parse strings, so should try to avoid structured codes.
The simplest approach is to split the structured code into separate attributes and
manage each part of the code separately. In this approach, the Module table would
lose its code column and acquire four new columns – deptCode, subjectCode,
levelCode and sequenceId. This also means that there is now no need for a sep-
arate column called level as it simply duplicates information included in the mod-
ule code. The logical and physical table schema are shown in Code listing 10.3.4.1.
The end users can be shown the original structured code easily – just create a suitable
SQL view to merge the four columns back into one. However, this won’t help when
data is inserted into the database. End users will expect to type in a single string for the
module code, so the application program will need to parse this string before passing
the four parts to the DBMS.
(b) The physical table schema – the column code has been replaced by its
constituent parts and level removed
CREATE TABLE Module (
deptCode CHAR(1) NOT NULL,
subjectCode CHAR(1) NOT NULL,
levelCode CHAR(1) NOT NULL,
sequenceId CHAR(5) NOT NULL,
title VARCHAR(100) NOT NULL,
department VARCHAR(20) NOT NULL,
leader CHAR(5) NOT NULL,
CONSTRAINT priModule PRIMARY KEY
(deptCode, subjectCode, levelCode, sequenceId)
);
500
IDWA_C10.QXD 10/5/07 11:42 AM Page 501
(b) The revised physical table schema (without the business rules)
CREATE TABLE Module (
code moduleCode NOT NULL,
title VARCHAR(100) NOT NULL,
department VARCHAR(20) NOT NULL,
leader CHAR(5) NOT NULL,
CONSTRAINT priModule PRIMARY KEY (code)
);
If structured types are not supported by the target DBMS, then it makes more sense to
stick with the original definition of the Module table. In this case, the application
program will need to deal with the structured code.
Usually, the application developer will create one function that can parse a structured
code into its component parts and a second to reassemble the parts into a single string.
501
IDWA_C10.QXD 10/5/07 11:42 AM Page 502
These functions could be implemented using stored routines on the database server or
server-side scripting on the application server. They could even be implemented on the
client, though, in a web database application it would not be a sensible approach.
Yet another possibility is to introduce a surrogate key as the primary key of the
module table and include the component parts of the actual module code as separate
columns (see Code listing 10.3.4.3). In this case, a unique constraint should be placed
on the candidate key {deptCode, subjectCode, levelCode, sequenceId}.
502
IDWA_C10.QXD 10/5/07 11:42 AM Page 503
room and PortableEquipment for the rest. With this design, the foreign key
columns in FixedEquipment are never null and there is no foreign key on
PortableEquipment.
A similar situation occurs in the Staff table. In the logical ER diagram, the unary
one-to-one association SharesWith indicates that two academic members of staff share
an office. The association does not model information about non-academic staff.
Adding a foreign key to represent this association leads to the physical table schema
shown in Code listing 10.3.5.1.
The foreign key consists of the single column {sharer}. The foreign key column
can be null as some staff do not share an office. The referential actions capture the
business scenario that when a member of staff leaves, whoever they shared the office
with will now have an office to themselves, so the foreign key can be set to null. If a
staff number changes, the new staff number designates the same person, who shares
the office with the same colleague, so the foreign key should be updated to the new
staff number.
503
IDWA_C10.QXD 10/5/07 11:42 AM Page 504
504
IDWA_C10.QXD 10/5/07 11:42 AM Page 505
where clause that are actually needed. This can be a particular problem where a select
statement includes an SQL view in the from clause.
Consider the SQL view ComputingCourse that only includes courses involving
the computing department as the lead department (Code listing 10.3.6.1 (a)). The
select statement in Code listing 10.3.6.1 (b) queries this view, but its where clause is
unnecessary as it simply repeats the where clause in the select statement that creates
the view.
Code listing 10.3.6.1 A redundant where condition arising from the meaning
of an SQL view
(a) The SQL view
CREATE VIEW ComputingCourse AS
SELECT code, title, leadDepartment, minorDepartment,
level, qualification, mode
FROM Course
WHERE leadDepartment = 'Computing';
505
IDWA_C10.QXD 10/5/07 11:42 AM Page 506
(b) Only rooms with a capacity of 25 or greater appear on the Room table
SELECT building, roomNo, capacity
FROM Room
WHERE capacity>=25;
The application developer is on surer ground with the next guideline – only include
tables in the from clause that are actually needed. Which columns appear in which table
is entirely within the control of the database designer.
Consider the query in Code listing 10.3.6.3. There is actually no need to include the
Room table here as the only columns required exist as foreign key columns on
the Equipment table. Rewriting this query so that it retrieves all columns from
the Equipment table will improve performance on any DBMS.
The third general guideline for writing efficient SQL statements is to use equality con-
ditions wherever possible. For example, Code listing 10.3.6.4 (a) lists all equipment
bookings between two dates. It is the sort of select statement that might lie behind a
web form. The actual date literals in the where clause are gathered from an end user
and the select statement built using PHP. When those two dates are the same, it is
worth writing the application code in such a way that it actually builds the select state-
ment in Code listing 10.3.6.4 (b). There is a trade-off here, between more complex ap-
plication code and a more efficient SQL statement, but it is usually worth the effort.
Testing whether or not two date literals are equal in PHP will take a lot less time than
testing whether the value of the bookedFor column is both greater than or equal to
and less than or equal to the same date literal.
506
IDWA_C10.QXD 10/5/07 11:42 AM Page 507
The fourth guideline is to use the group by and having clauses with care. Always be
sure that the query really needs to group data, rather than simply list it in some partic-
ular order. If the group by clause involves a primary key, then each group will consist
of a single row. In this case, the correct approach is to use an order by clause instead.
These general guidelines will not ensure that the most efficient SQL statement is used,
but will avoid the most obvious flaws.
Even when SQL statements have been optimized for a particular DBMS, it is worth
listening to users’ feedback on the performance of the database application and con-
tinuing to tune the database transactions to ensure peak performance. This is particu-
larly true of applications that begin with an empty database instance – as the data is
added by end users, the performance is likely to degrade. Any serious problems should
be addressed as soon as they are reported.
So far, the physical design decisions have affected the organization structures of the
logical and external views of the ANSI/SPARC architecture. In this section, the focus is
on the data storage structures of the internal view and the data files of the underlying
physical level. It covers features that are generally implemented in DBMS-specific ways
as it is in the internal view that DBMS vendors can compete most effectively. This may
be via data storage structures that offer high performance or special features for a niche
market. The ISO SQL:2003 standard mentions some of these features, but leaves a lot
of leeway in how DBMS vendors implement them.
Before examining the data storage structures of the internal view, it is useful to
delve a little deeper and consider how data is actually stored at the physical level.
507
IDWA_C10.QXD 10/5/07 11:42 AM Page 508
When the DBMS needs to search a table, one block at a time is copied from the rel-
evant data file into memory. The DBMS turns the block into a table, with each record
becoming a row. If the required row is not present, then the next block is read, then
the next until the required row is found. When the table instance is modified, the
DBMS uses the same process to read the relevant data block into memory. It makes
the required changes to the data in memory, then gets the operating system to rewrite
the entire block back to the same place on the hard disk, overwriting the original block.
At the physical level, it is blocks, not records, that are manipulated.
There are three basic kinds of data file. The simplest is called a heap file. Heap files
are exactly what they sound like – an unordered heap of records. When a new row is
added to a table, the new record is appended to the last block in the table’s data file. If
this block is full, then the operating system allocates the data file a new block and the
record is added to that block. This makes inserting a new row into a table stored as a
heap file very fast indeed. The downside is that it is very time-consuming to find,
update or delete a row as the data file must be searched one block at a time for the
relevant record.
In an ordered file (also called a sequential file, though, confusingly, this term is
sometimes used for heap files) the records are ordered according to the value of some
designated field – the ordering field. Figure 10.7 shows the Staff table and the or-
dered file that actually stores the data on disk. The file (and, hence, the table instance)
is ordered by lName. The data file starts with a file header, which holds information
such as the number of blocks used for the file, plus the location of each block on the
hard disk (as a pointer to the start of a block). In the example shown, each block holds
up to four records, and the first four blocks are full. The records are shown as comma-
separated values for convenience (the three dots – ‘. . .’ – indicate that not all the
record’s fields are shown in the illustration), though, in reality, a more sophisticated
record structure is likely.
Tables stored as ordered files can quickly be searched using the column that corre-
sponds to the ordering field, because the DBMS can hop around rather than searching
each block in order. For instance, in this example, to find the first record with the
value ‘Stark’ in the ordering field, the DBMS actually starts at the middle block,
Block 3. It checks the ordering field of the first record, which is ‘O’Connor’, and finds
that the required record must come after this record. It now checks the last record of
Block 3 and finds that the required record must be in a later block. The DBMS has now
eliminated the entire first half of the data file, simply by checking one block. In an
actual example with hundreds of blocks, this is a major saving in time. The DBMS con-
tinues by reading the middle block of the last half of the data file – Block 4 in the exam-
ple. Again, it compares the required value to the ordering field of the first record in this
block, which is ‘Smith’, and finds that the required record must come after this record.
It now checks the last record of Block 4, and finds that the required record must be in
the block. It can now scan the block for the required record.
Mathematicians reading this will recognize that this search mechanism is based on
the bisection method for finding the root of an equation.
block, and all subsequent blocks, need to be rewritten as the new record pushes all
subsequent records ‘down’ the data file. Clever ordered file structures leave some addi-
tional space in each block to allow for new records. However, this only postpones
the reorganization. Eventually the block fills up and, when a new record needs to be
added, the data file must be reorganized just as before. Deleting a row from the table
also causes problems. The relevant record is removed, leaving a gap in the block.
Again, the data file will need to be reorganized, though this can be postponed until a
later date (for example, the ‘compact and repair’ facility in Microsoft Access reorga-
nizes the underlying data files to remove such gaps only when the user requests it).
Updating the value of the column that corresponds to the ordering field causes both
problems as the original record must be deleted and a new record inserted at the
appropriate point in the data file.
509
IDWA_C10.QXD 10/5/07 11:42 AM Page 510
Even faster access to records can be achieved using a hash file. This places each
record in a particular block based on the value of one of the fields. A hash function
takes the field value as input and outputs the address of one of the blocks in the data
file. The same field value always gets the same block address. Thus, to read a particular
record, the DBMS simply uses the hash function to find the correct block. It copies this
block into memory and scans it for the required data. This avoids the need to read all
the preceding blocks, which happens with a heap file, or to hop around in search of
the right block, as with an ordered file.
10.4.2 Indexes
An index provides a way to locate required information quickly. This book has an
index. To find a particular topic, use the index. The index entry will indicate on which
page(s) of the book the topic is discussed. In a similar way, a database column can have
an index. To find a particular column value, use the index. The index entries indicate
on which row(s) of the table the required column value occurs. In a relational data-
base, an index is stored in a table, but one that the DBMS recognizes as an index table
rather than a data table. The index table is itself stored on disk as an ordered file, with
each index entry a record in this file. This means that searching the index is very quick
(see the discussion above). Thus, when an index is defined on a database table, the
DBMS can copy the index file into memory and search it, rather than copy the table’s
data file. As the index will be smaller than the data file, this provides a significant per-
formance improvement for database searches, but only when the search criteria uses
the indexed column. An index can be defined on almost any combination of columns,
but the concept is most easily explained by considering indexes on a single column.
Indexes can be used whether the table is stored as an ordered file or a heap file. For an
ordered file, there is a difference between the way indexes work when the indexed col-
umn is the ordering field of the file and when it is not. This situation is discussed first.
When a database uses an ordered file as the data storage structure for a particular
table, it almost always includes an index on the ordering field of the data file. When
the ordering field is a candidate key on the table, the index is called the primary index
of the ordered file. (Only ordered files can have a primary index.) Rather than keep an
index record for each data file record, it makes more sense to keep an index record for
each block of the data file.
This situation is shown in Figure 10.8, where the ordering field is staffNo (the
primary key of the Staff table). Each record in the index file consists of two fields.
One field holds a pointer to a block in the data file. The other field holds a copy of the
value of the ordering field in the first record of that block. So, in Figure 10.8, the first
index record holds a pointer to the first block of the data file and the value 10780,
which is the value of the ordering field in the first record of the first block of the data
file. The second index record points to the second block of the data file and holds the
value 25448, the value of the first record of the second block, and so on. Note that
only some values of the indexed column are present in the index itself.
Searching for a particular row in the table using the value of staffNo is easy. For
example, suppose the DBMS wishes to retrieve the row of the table where the
staffNo value is 33935. Rather than searching the data file itself, the DBMS searches
the index file for an index record with this value. It doesn’t find one, but it does find
510
IDWA_C10.QXD 10/5/07 11:42 AM Page 511
index records with staffNo values 25448 and 35054. As the data file is ordered by
staffNo, the data file record with staffNo value 33935 must lie between the data
file records with values 25448 and 35054. As both these data file records appear at the
start of a block (that is why they are in the index), the required data file record must be
in the block that starts with the data file record with staffNo 25448.
Searching using the primary index will generally be much faster than searching the
ordered file itself. The index file always has fewer blocks than the data file, so the stan-
dard search procedure for ordered files will be faster on the index file. There are fewer
blocks for two reasons. First, the index file has fewer records – only one index record
for each data file block – and, second, each index record is smaller than the data file
record, so there are more index records per block (this is not shown in Figure 10.8).
However inserting or deleting rows and updating the value of the column that corre-
sponds to the ordering field are all slower as both the data file and the index file must
be modified in each case.
511
IDWA_C10.QXD 10/5/07 11:42 AM Page 512
The ordering field of an ordered file does not have to be a candidate key on the
table. In this situation, several rows of the data file will have the same value in the or-
dering field. Figure 10.9 shows the data for the Staff table stored in an ordered file,
but this time the ordering field is the department. There can be many rows with the
same value in this ordering field – there are six rows with the value ‘Computing’, for
example. Rows with the same value in the ordering field appear in a contiguous clus-
ter of records within the data file. There are five such clusters in Figure 10.9, one for
each distinct value of department. Note that some clusters start in one block, but
stretch into another. In practice, one cluster could span several blocks in the data file.
This data storage structure makes it very easy to find a particular cluster of records –
find the first record with the required value in the ordering field, then simply read
through the file until a record with a different value is encountered. Storing the
Staff table as an ordered file with the department as the ordering field makes it
very easy to find all staff in a particular department. If this query is run frequently,
then this data storage structure makes sense.
When many rows can have the same value in the ordering field, a different ap-
proach to creating an index is needed. Rather than have the index records point to the
start of a block in the data file, it makes more sense to have the index records point to
the start of a cluster. Such an index is called a clustering index.
In practice, the index records of a clustering index point to the data file block where
the first record of the cluster occurs. This means that two index records may point to
the same block in the data file. For example, the first and second index records in Fig-
ure 10.9 point to block 1. Also, some blocks will not have an index record pointing to
them – block 5 does not have any clusters starting in it.
In a clustering index, there is one index record for each cluster. This means that
there is exactly one index record for each distinct value of the data file’s ordering field.
This is not true of a primary index, where some values of the data file’s ordering field
do not appear in the index.
A clustering index provides similar advantages to those of a primary index and
suffers from similar problems when rows are inserted, deleted or have the data in the
ordering field changed. In the example given in Figure 10.9, some blocks of the data
file include records from two different clusters. A refinement of this data storage struc-
ture only allows records from the same cluster in any one block. This means that there
is usually space available to add a new record to a cluster, but does mean that the data
file is larger than necessary.
The terms ‘cluster’ and ‘clustering’ are also used to mean a cluster of databases
rather than a cluster of records. This is an advanced database technology offered by
some DBMS manufacturers, but is beyond the scope of an introductory textbook.
A table stored in an ordered file can have either a primary index or a clustering index,
but not both. Which of these two it has depends on the ordering field chosen. If the
ordering field is a candidate key, then the table will have a primary index, but, if the
ordering field is not a candidate key, then the table will have a clustering index.
There is another kind of index called a secondary index that can be defined on any
column except a column used as an ordering field. Apart from this restriction, the table’s
512
IDWA_C10.QXD
10/5/07
513
10.4 Designing data storage
IDWA_C10.QXD 10/5/07 11:42 AM Page 514
data storage structure doesn’t limit the choice of secondary indexes – tables stored as
heap, ordered and hashed files can all have secondary indexes. The index file is itself an
ordered file, so the fast search process for ordered files can be used to quickly search the
index file.
Figure 10.10 shows a secondary index defined on the email column of the Staff
table (the pointers are shown as arrows, but only for block 1 of the index file as includ-
ing arrows for all the pointers makes the diagram difficult to read). The table’s data
storage structure is a heap file (unordered), so the email column is definitely not used
as an ordering field for the data file. Inspection of Figure 10.10 should show that the
value of the email field does not affect the position of a record within the data file.
There is one index record for each data file record (each row of the table).
As usual, the index record has two fields. One field holds a copy of the value of the
data file record’s index field (in Figure 10.10, this is the e-mail address). The other
field holds a pointer to the data file. In Figure 10.10, the index pointer points to the
block in the data file where the record occurs. It’s also possible to have the index
pointer point at the actual record within the block. Notice that in Figure 10.10 there
are more index records in a block than there are data file records. This will typically
be the case, making the index a much smaller file and so quicker to search. Even so,
the file for a secondary index is always larger than that for a primary index, because
there is one index record for every data file record rather than one index record for
each data file block.
In Figure 10.10, the underlying data storage structure is a heap file. The secondary
index would be basically the same if it were an ordered file with some other column
acting as the ordering field, or a hash file with some other column providing the hash-
ing value.
A secondary index can be created for as many columns (or combinations of columns)
as the database designer desires. This includes a non-candidate key column. In this
case, there are different ways to structure the index file to deal with repeated values in
the indexed field. The simplest is to stick with one index record for each data file
record, which means that some index records have the same value but a different
pointer. Secondary indexes speed up data retrieval, but, like all indexes, they must be
maintained, so have the effect of slowing down insert, delete and update operations.
The main reason for defining any sort of index is to improve the performance of
database queries. The index can do this in two ways. First, it speeds up select state-
ments that use the indexed column in the where clause. For example, searching the
Staff table for the member of staff with a particular e-mail address will be quicker if
there is an index on the email column than if the DBMS must search the underlying
data file. Similarly, indexing a foreign key will speed up a table join.
Second, some queries can be evaluated simply be looking at the index, with no need
to read any records from the data file. For example, to find out if the e-mail address
‘[email protected]’ has already been taken, the DBMS only needs to search the
index file.
If indexing a particular column (or combination of columns) will speed up several
database queries, then there is a strong case for creating the index.
514
IDWA_C10.QXD
10/5/07
Figure 10.10 A secondary index for a column in a table that has a heap file as its underlying data storage structure.
No of blocks: 5
Block 1: <pointer to block 1>
Block 1 staffNo:Varchar email:Varchar …
Block 2: <pointer to block 2>
Block 3: <pointer to block 3> 10780, [email protected], … 10780 [email protected] …
25447, [email protected], … 25447 [email protected] …
Block 1 33935, [email protected], …
89987, [email protected], … 33935 [email protected] …
[email protected], <pointer to data file block 1>
[email protected], <pointer to data file block 2> Block 2 89987 [email protected] …
[email protected], <pointer to data file block 2>
77712, [email protected], … 77712 [email protected] …
[email protected], <pointer to data file block 3>
[email protected], <pointer to data file block 2> 14443, [email protected], …
14443 [email protected] …
[email protected], <pointer to data file block 3> 23257, [email protected], …
[email protected], <pointer to data file block 5> 25448, [email protected], … 23257 [email protected] …
Block 3 25448 [email protected] …
Block 2
35054, [email protected], … 35054 [email protected] …
[email protected], <pointer to data file block 2> 45965, [email protected], …
[email protected], <pointer to data file2 block 1> 45965 [email protected] …
35155, [email protected], …
[email protected], <pointer to data file block 4> 55776, [email protected], … 35155 [email protected] …
[email protected], <pointer to data file block 4>
[email protected], <pointer to data file block 1> Block 4 55776 [email protected] …
[email protected], <pointer to data file block 3>
56893, [email protected], … 56893 [email protected] …
[email protected], <pointer to data file block 1>
56673, [email protected], …
56673 [email protected] …
31210, [email protected], …
Block 3
78893, [email protected], … 31210 [email protected] …
[email protected], < pointer to data file block 4>
[email protected], < pointer to data file block 4> Block 5 78893 [email protected] …
[email protected], < pointer to data file block 3> 33509, [email protected], … 33509 [email protected] …
515
10.4 Designing data storage
IDWA_C10.QXD 10/5/07 11:42 AM Page 516
The main reason for not creating an index is that it will usually slow down data
modifications as changes to the table’s underlying data file usually entail changes to
the index file as well. Insert statements will normally be slower.
The effect of an index on update and delete statements is more complex. The in-
creased search speed will mean that the affected records are found faster than they
would be otherwise, but the need to maintain the index can mean that the actual data
modification is slower.
It is important to consider the cost of maintaining the index, too, and weigh this
against the advantages of faster database searches. Sometimes the most effective way
to do this is to create the table with and without indexes, add a good helping of sam-
ple data and try the queries out.
Most DBMS will automatically define an index on the primary key of a table. There
are two reasons for this. First, the primary key is chosen as the main way to uniquely
identify a row. This suggests that most select statements using the table will use the pri-
mary key in their where clause, so it makes sense to index it. Second, the DBMS must
enforce the primary key constraint. This means checking that the primary key value on
a new row is distinct from the primary key value on all existing rows. It’s clearly going
to be easier to do this if there is an index. In fact, this is one circumstance where an
index can actually speed up insert statements, too. For a DBMS using SQL as its data
language, columns defined as UNIQUE would also benefit from being indexed, for the
same reason.
The unique constraint is used to define candidate keys (see above). If the table’s un-
derlying data storage structure is an ordered file and the primary key columns (or
UNIQUE columns) correspond to the ordering fields, then the DBMS could define a
primary index for the column. Otherwise, it must use a secondary index.
In most other circumstances, the database designer or DBA will instruct the DBMS
when to create an index. Although not part of the ISO SQL standard, most DBMS
implement the SQL DDL create index statement. The SQL DDL create index statement
will create a secondary index on the specified columns. It cannot normally be used to
define primary or clustering indexes, though, as they are tied to the structure of the
underlying data file (but check the documentation of your DBMS).
Code listing 10.4.2.1 shows the typical format of this statement. Code listing
10.4.2.1 (a) shows how to create a secondary index on the Staff table’s email col-
umn. Code listing 10.4.2.1 (b) creates a secondary index on the combination of
columns lName and fName. Within the index file, index records will include a value
for both columns and a pointer to the appropriate block (or record) in the data file. The
index records will be ordered first by the value of lName and then by the value of
fName, which is how names are sorted in most telephone directories.
516
IDWA_C10.QXD 10/5/07 11:42 AM Page 517
The create index statement can include the keyword UNIQUE, as in CREATE UNIQUE
INDEX, and this has the same effect as adding a unique constraint and indexing the
column all in one go. For preference, use the unique constraint rather than the unique
index as most developers would look in the data dictionary for the constraint rather
than the index. Also, the DBMS may be able to use a primary index on the columns
with the unique constraint – the CREATE UNIQUE INDEX statement will create a
secondary index.
As mentioned above, the create index statement is not normally used to create
clustering indexes as not all DBMS support clustering indexes. One that does is the
Oracle database. Creating a clustering index for a table in Oracle involves three steps.
First, use the Oracle-specific create cluster statement to inform the Oracle database
that a clustering index is about to be used. This reserves space for the ordered file on
disk. Next, use the SQL DDL create table statement to create the table, but include a
cluster clause to indicate how to cluster data in the ordered file just created. Finally,
create an index on the cluster.
Code listing 10.4.2.2 shows how to implement the clustering index shown in Fig-
ure 10.9. The create cluster statement must indicate the data type of the column with
the clustering index and this must match the data type of the associated column in
the create table statement (the name of the column in the create cluster statement is
not significant). In the create table statement, the cluster clause comes after the list of
columns and constraints. It names the cluster being used and indicates which col-
umn in the table is used for the clustering index. Finally, the clustering index itself is
created.
517
IDWA_C10.QXD 10/5/07 11:42 AM Page 518
This is particularly useful where there is a foreign key linking the tables and one of the
tables is derived from a weak entity. For example, the Staff and SupportSession
tables fall into this category. Data from SupportSession is not much use without
data from Staff. As SupportSession data will usually be retrieved at the same time
as the associated Staff data, it might make sense to store the data in the same file.
Figure 10.11 shows a sample data file. Notice how it includes records for both
Staff and SupportSession data. The clever bit is that support sessions offered by
Block 5
Index file for the cluster
index on staffNo 56673, Friday, 10:00:00, 11:00:00,
56673, Friday, 15:30:00, 16:30:00,
56673, Thursday, 15:30:00, 16:30:00,
89987, Dan, Lin, 8514, [email protected]
78893, Jo Karen, O’Connor, 8871, [email protected]
33509, Helen, Timms, 8661, [email protected]
518
IDWA_C10.QXD 10/5/07 11:42 AM Page 519
a particular member of staff are stored immediately after their Staff record. For
example, in Block 2 of the data file, the Staff record for Freya Stark is followed by
her single SupportSession record. Selma Hutchins’ Staff record is followed by
her three SupportSession records, with one in Block 2 and two more in Block 3.
The associated clustering index is also shown, with a single index record for each
distinct value of staffNo and a block pointer pointing to the block each cluster
starts in.
Code listing 10.4.2.3 shows the SQL DDL to instruct the Oracle database to create
this data storage structure.
519
IDWA_C10.QXD 10/5/07 11:42 AM Page 520
The MySQL DBMS documentation calls data storage structures storage engines,
emphasizing that they are about data storage (internal view) rather than data organi-
zation (logical and external views), so this is the term adopted here. Understanding
the basics of data files and indexes is a pre-requisite for understanding storage engines,
but the discussion above is only an introduction to this topic.
Choosing the correct storage engine for a table can significantly improve perfor-
mance as each storage engine is optimized for a particular role. For example, the
MyISAM storage engine of the MySQL DBMS is optimized for data retrieval. It uses
ordered data files and indexes to achieve this. The MyISAM storage engine does not
support commit and rollback functionality – there is no ‘undo’ facility for modifica-
tions to the table instance. Nor does the MySQL DBMS remember, or enforce, foreign
key constraints on a MyISAM table (though it will happily check the SQL DDL state-
ments that define them and point out your syntax errors). If foreign keys or commit
and rollback support are required, one of the other storage engines should be used,
such as the InnoDB storage engine.
When deciding on which storage engine to use, the DBA and database designer
should work together to match the storage engine features to the requirements of the
application.
All DBMS have a default storage engine, though this can usually be changed by the
DBA. A standard ISO SQL create table statement will use the default storage engine as
the underlying data storage structure for a table. The default storage engine of the
MySQL DBMS is the MyISAM storage engine. Each DBMS will provide its own way of
specifying which storage engine to use.
Code listing 10.4.3.1 shows how to create the Room table using the InnoDB storage
engine on the MySQL DBMS.
10.5 Security
520
IDWA_C10.QXD 10/5/07 11:42 AM Page 521
10.5 Security
The DBMS has its own security procedures and these are usually a lot more sophis-
ticated than a simple login screen. From the database perspective, there are two main
security issues. These are deciding who is allowed to connect to the DBMS and what
they are allowed to do once they have connected.
Deciding who is allowed to connect to the database is done by means of the
processes of identification and authentication. Identification states who you are and
authentication proves it. Typically, users identify themselves by their usernames and
authenticate their usernames with passwords.
The process of deciding what users are allowed to do once connected is called
authorization. Each time users request a particular action, such as query a table or cre-
ate an index, the DBMS checks whether or not they are authorized to make the
request. If they are, it complies, but, if not, it rejects the request with a suitable error
message.
The ISO SQL standard does not specify how a DBMS should identify and authenti-
cate its users. It simply assumes that this is done. Once users have successfully identi-
fied themselves and connected to the DBMS, unique authorization identifiers are
associated with their connections. In ISO SQL, the CURRENT_USER value function
returns authorization identifiers for users on their current connection. Many DBMS
include the SQL DDL create user statement to allow the DBA to assign authorization
identifiers, in the form of usernames, to end users.
Code listing 10.5.1 (a) shows the basic create user statement. It begins with the key-
words CREATE USER, followed by a string literal representing the authorization
identifier (username) for a new user – in this case, the authorization identifier is ‘an-
gela’. The keywords IDENTIFIED BY are optional and used to define the initial pass-
word for this user, the string literal ‘89gleo06’. Some DBMS provide extensions on this
basic scheme, such as a password expiry date, while others eschew the create user
statement and use a graphical user interface to manage user accounts. A variation on
the ubiquitous DROP statement removes a user (Code listing 10.5.1 (b)).
The ISO SQL standard uses the term ‘authorization identifier’ instead of ‘username’
to allow for more sophisticated identification and authentication schemes than the
traditional username and password. For example, users might identify themselves to
the system by means of a biometric smart card and authenticate the smart card infor-
mation with an iris scan or thumb print.
The process of identification and authentication provides a first line of defence for the
database by controlling who can connect to the DBMS. A DBMS can manage many
521
IDWA_C10.QXD 10/5/07 11:42 AM Page 522
databases, each of which can have many tables. Most users only need access to a por-
tion of this data, hence the need for the authorization process.
The ISO SQL privilege system offers a relatively straightforward way for the DBA to
limit what users can do once they’ve connected to the DBMS. First, users have full con-
trol of any databases (SQL schema) or database objects (tables, views and so on) that
they have created. Each is called the owner of the database or database object. Once a
database has been created the owner is the only user who may create tables, views and so
on within the database. The ISO SQL standard is quite clear about this – only the data-
base owner can use SQL DDL statements to create or alter objects within the database.
This restriction radically simplifies the problem of defining who can do what.
Authorization of (almost) all data definition actions are now dealt with. However, it is
a little too radical a simplification for most DBMS, which, instead, extend the privilege
system to allow many users to create objects within the same database.
Authorizing other actions on the database – in particular, data manipulation
actions – is done via the privilege system. Users who don’t own a particular database
object can’t do anything with it until the object’s owner grants them suitable privi-
leges. So, for example, no user can query the Staff table in the Staff Directory data-
base until the owner grants them the privilege to run a select statement on this table.
A user who does not have this privilege but tries to run a select statement with the
Staff table in the from clause will be told that there is no such table – users who
don’t have the privilege to access a table cannot even know that it exists.
The ISO SQL standard specifies four data manipulation privileges:
● select privilege
● insert privilege
● update privilege
● delete privilege.
These allow users to use the obvious SQL DDL statement. There are two privileges re-
lated to data definition that need to be mentioned. The trigger privilege allows users
to create a trigger on a specified table. The references privilege allows users to add a
foreign key or check constraint to a table they own that references a table they don’t
own. An example of this situation was included in Chapter 6 in the section discussing
subqueries and the alter table statement is repeated in Code listing 10.5.2. The state-
ment adds a check constraint to the Web Timetable database’s Equipment table. The
check condition refers to the Asset Tracking database’s AssetType table. The owner
of the Asset Tracking database must grant the references privilege on the AssetType
table to the owner of the Web Timetable database or the alter table statement will
not work.
Code listing 10.5.2 An alter table statement that requires the references
privilege
ALTER TABLE Equipment
ADD CONSTRAINT chkEquipmentAssetType
CHECK (assetType IN (SELECT type
FROM AssetTracking.AssetType));
522
IDWA_C10.QXD 10/5/07 11:42 AM Page 523
10.5 Security
Table 10.2 lists these six privileges, the associated SQL keyword and states whether
the privilege can apply to tables and columns or just tables. For example, it is possible
to give a user the select privilege on only some columns in a table. This provides an
alternative to the view mechanism for hiding information from users.
The owner of a database object grants other users privileges to use that object with
the SQL grant statement. Code listing 10.5.3 shows a typical grant statement, giving
two users – ‘angela’ and ‘peter’ – select and update privileges on the two tables of the
Staff Directory database. It begins with the keyword GRANT, followed by a
comma-separated list of the privileges being granted. The keyword ON is followed by a
comma-separated list of the database objects affected. Then, the keyword TO is fol-
lowed by a comma-separated list of the users being granted these privileges. The list of
users can be replaced by the keyword PUBLIC to grant the listed privileges on the
listed database objects to all DBMS users. The select, update, insert and references
privileges can be restricted to individual columns (see Code listing 10.5.6, below, for an
example).
Code listing 10.5.4 Granting a user the right to pass on her privileges
GRANT SELECT
ON Staff
TO angela
WITH GRANT OPTION;
To remove privileges from a user, use the revoke statement. Code listing 10.5.5 (a)
shows how to remove the update privilege on the Staff table from user Peter.
523
IDWA_C10.QXD 10/5/07 11:42 AM Page 524
Code listing 10.5.5 (b) shows how to remove the right of user Angela to pass on her
select privilege on the Staff table. Angela still has the select privilege on Staff
herself, but can no longer pass it on to others.
A problem arises concerning what to do about users who were previously granted
their privileges by Angela. Do they retain them or should their privileges also be
revoked?
The ISO SQL standard does define this behaviour, but it is complicated and not all
DBMS follow the standard. Read the manual for the particular DBMS you are using
carefully before setting out to revoke privileges and always think very carefully before
allowing anyone other than the DBA to grant privileges to others.
Two significant problems with granting privileges to individual users are that it can be
time-consuming and difficult to manage. The situation illustrated in Code listings 10.5.3
and 10.5.4 is fairly common – a standard set of privileges for all users and additional
privileges for some of them. Managing this situation is helped by introducing the notion
of a role.
A role is a similar to a UML actor – it represents a particular way of using the system,
rather than a particular user. Some users will play several different roles and one role
can be played by many users. Adding roles to the security model adds complexity, but
does make it easier to manage users and privileges. The DBA creates roles to represent
different kinds of users and grants privileges to the role rather than one or more indi-
viduals. Then, the DBA tells the DBMS which users play which roles and they get all
the privileges of the roles they play.
Code listing 10.5.6 illustrates the steps involved. First, it creates two roles –
UnivMember and Academic. Then, it grants the select privilege to the tables of the
Staff Directory database to both roles. The second grant statement allows academics to
modify data on SupportSession. The third, to change the data in the fName,
lName and phone columns of the Staff table. The final grant statement grants
the roles UnivMember and Academic to two users. These users then have all the
privileges associated with those roles.
524
IDWA_C10.QXD 10/5/07 11:42 AM Page 525
Chapter summary
Chapter summary
■ This chapter has discussed a wide range of issues related to the physical design of a
database. The issues either must or should be considered before beginning to im-
plement a database application.
■ Much of the chapter has focused on changes to the logical and external views of
the ANSI/SPARC architecture – that is, to tweaking the logical design to fit a partic-
ular DBMS. The section on data files, indexes and storage engines delved a little
deeper into the internal view of the ANSI/SPARC architecture and the underlying
data files. The chapter concluded by discussing the ISO SQL standard’s security
model.
525
IDWA_C10.QXD 10/5/07 11:42 AM Page 526
Further reading
Most database textbooks cover physical database design. Kroenke (2006) has a couple
of good chapters and Connolly and Begg (2004) provide a nice summary of the
process.
The websites of DBMS suppliers provide useful material on physical design for their
particular products. It is only possible to give a brief introduction to data files and in-
dexing in an introductory textbook, so take a look at these websites. In particular,
there’s been no discussion of how best to implement the index file itself. Needless to
say, a simple ordered file is not the best way to store an index file.
Elmasri and Navathe (2007) and Silberschatz et al. (2002) provide a more technical dis-
cussion of these issues. Silberschatz et al. (2002) also have interesting case studies of
three of the major commercial DBMS – the Oracle database, IBM DB2 Universal Data-
base and Microsoft SQL Server. Troels Arvin (2006) is developing a comparison of the
facilities provided by the ISO SQL:2003 standard and various commercial DBMS and
publishing it as a web page.
Review questions
10.1 List the data types of the ISO SQL:2003 standard, including its distinct types,
together with a brief summary of how they should be used.
10.2 Outline the two main approaches to implementing columns that hold derived data.
10.4 What is a surrogate key and how does it differ from a logical primary key? When
would you use one?
10.5 Outline the usual approach to storing binary data in a web database application.
10.8 Describe the record structure of each of the following types of data file:
(a) heap
(b) ordered
(c) hash.
526
IDWA_C10.QXD 10/5/07 11:42 AM Page 527
Exercises
Exercises
10.11 Code listing 10.1.4.2 documents the physical design for the Room and
Equipment tables. Correct the following weaknesses in this design.
(a) When a new row is added to the Equipment table, the equipment is stored in,
at most, one room. This means that the value of the column Room.
equipmentValue will change in, at most, one row. Replace the statement-level
after insert trigger taiEquipment with a row-level after insert trigger, so that,
when new rows are added to the Equipment table, only the corresponding row
in the Room table has the value of the column equipmentValue recalculated.
(b) Write a new trigger to maintain the value of the column Room.equipmentValue
when the cost column in a row of the Equipment table is updated.
(c) Write a new trigger to maintain the value of the column Room.equipmentValue
when a row of the Equipment table is deleted.
10.12 Estimate the storage requirements for the Room and Equipment tables in Code list-
ing 10.1.4.2. You will need to make estimates of the maximum and average number
of rows for each table. Note that all teaching rooms have an OHP, a PC and a digital
projector.
provide a list of all teaching sessions showing the room, the room’s capacity
and the number of students attending that teaching session.
The idea is to check that each room has sufficient capacity for each teaching ses-
sion timetabled to take place in it.
(a) Draw an access path and calculate the transaction workload based on the
physical table schema below and the following assumptions:
• there are 4000 teaching sessions
• there are, on average, 25 students per teaching session.
527
IDWA_C10.QXD 10/5/07 11:42 AM Page 528
--*********************************************************
-- Create the base tables with their primary key constraints
--*********************************************************
CREATE TABLE TeachingSession(
moduleCode CHAR(7) NOT NULL,
staffNo VARCHAR(5) NOT NULL,
building VARCHAR(10) NOT NULL,
roomNo VARCHAR(4) NOT NULL,
day VARCHAR(9) NOT NULL,
startTime TIME NOT NULL,
duration INTERVAL MINUTE NOT NULL,
endTime TIME NOT NULL,
startDate DATE NOT NULL,
endDate DATE NULL,
runsFor INTERVAL DAY NULL,
type VARCHAR(9) NOT NULL,
sessionCode CHAR(1) NULL,
CONSTRAINT priModule PRIMARY KEY
(moduleCode, building, roomNo, day, startTime)
);
528
IDWA_C10.QXD 10/5/07 11:42 AM Page 529
Exercises
-- ********************************************************
-- Add foreign keys
-- **************************************************
ALTER TABLE TeachingSession
ADD CONSTRAINT frnTeachingSessionRoom
FOREIGN KEY (building, roomNo)
REFERENCES Room (building, room)
(b) Amend the physical table schema for the TeachingSession table to include a
noOfStudents column, holding derived data. Repeat the transaction path
analysis for the database query.
(c) Amend the physical table schema for the TeachingSession table, replacing
the composite primary keys with surrogate keys. What effect does this have on
your transaction path analysis?
10.14 The table instances in Exercise 10.13 rarely change after the start of the academic
year. Given this fact, answer the following questions.
(a) Outline the benefits and drawbacks of using surrogate keys and derived data on
the TeachingSession table.
(b) Explain which kind of data file (heap, ordered or hashed) you would recommend
for the four tables. If an ordered file, what would you choose as the ordering fields?
(c) What secondary indexes would you recommend to improve the performance of
the query examined in Exercise 10.13? Why?
(d) What other factors need to be considered before deciding between the different
physical table schema for the TeachingSession table? Which design would
you recommend, and why?
10.15 Using Code listing 10.3.4.1 as a guide, implement two versions of the Module table
– one with the single code column to hold a structured code and one with struc-
tured code implemented as four separate columns (its constituent parts).
(a) Write SQL select statements for both tables that list all ‘F’ (Foundation) level
modules, all modules offered by the computing department (code is ‘C’) and all
modules offered by the information systems subject area (subject area code ‘I’).
(b) For each table, write a PHP script that allows end users to search for modules
by department, subject area and level.
(c) For each table, write a PHP script that allows end users to add a new module or
modify an existing one, including modifying the module code.
(d) Which table is easiest to use? Why?
529
IDWA_C10.QXD 10/5/07 11:42 AM Page 530
10.16 Examine your solutions to Exercises 9.15 and 9.16. Would it be appropriate to
denormalize the database schema? Explain your reasoning.
Investigations
10.17 Choose one or more of the physical database design issues discussed in this chap-
ter and investigate the advice given in the manuals or technical support documen-
tation for two commercial DBMS. What facilities does each DBMS provide to deal
with the issue? Write a short, technical report explaining your findings and evaluat-
ing the facilities provided by the two DBMS to deal with your chosen design issues.
10.18 The discussion of indexes in this chapter has focused on the underlying concepts
rather than the technicalities of implementation. One popular approach to imple-
menting indexes is the B-tree index. Using the resources suggested in the Further
reading section, investigate this approach. Write a short report explaining B-tree
indexes and discussing their strengths and weaknesses.
10.19 In Investigation 9.19, you developed a logical database design for an organization
you know. Develop a physical database design for your DBMS. Write a report
documenting your design decisions and the evidence that informed them.
10.20 Choose a particular DBMS and investigate its security mechanisms. What security
threats do these mechanisms guard against? What security threats are not dealt
with? How does the approach compare to that of the ISO SQL:2003 standard?
530
IDWA_C11.QXD 10/5/07 11:47 AM Page 531
Chapter objectives
Chapter outline
Database design is only one part of the design for a web database application. In partic-
ular, it only considers those aspects of the application’s behaviour that directly affect
the structure of the database, such as referential actions and triggers that maintain re-
dundant data. It does not address the application’s behaviour in general, nor does it
replicate the user experience design.
There are good reasons for this ‘divide and conquer’ approach to application de-
sign. The different aspects require different skills, so different members of the design
team can work in parallel on these different aspects. This speeds up the development
process and can lead to all the designs being completed at around the same time. The
drawback is that, once all the designs have been completed, it’s necessary to check that
they all fit together.
On a well-managed project, the different design teams will seek to ensure that their
designs are compatible as they go along. Even so, it is always worth checking things
through before beginning to implement the design. Mapping the other designs on to
the final database design provides an effective means of comparing them. Mismatches
could indicate that one or other of the designs has missed or misinterpreted a require-
ment or that the users were not clear about them in the first place. If this is spotted,
then the design can be corrected before implementation.
531
IDWA_C11.QXD 10/5/07 11:47 AM Page 532
In this book the process of reconciling the different design is called web data de-
sign as it shows how the database design supports the web application’s data require-
ments. As yet, there is no standard name for this process.
Web data design works best when there is a design for both the database and the
application’s behaviour. Sometimes, though, the database design is completed before
any of the other design tasks have started. This is always true when a web database
application uses an existing database. In such a case, the web data design can be
based on the use cases. The designer scans the use cases to identify an initial set of
data-centric web pages – web pages that draw their content from, or allow users to
modify, the database instance. (This technique is also used to identify entities, attrib-
utes and relationships, as we saw in Chapter 8). Consider the Get staff contact details
use case, part of the users’ requirements for the Staff Directory (see Figure A.5, Appen-
dix A). It includes the sentences, ‘This allows any user to get the contact details
(name, phone and e-mail) for a named member of staff at the university’ and ‘The
user can see the full contact details for any entry in the list, along with a photo of the
member of staff if one is available, by selecting that entry.’ Together, these two sen-
tences clearly suggest that data about an individual member of staff is delivered to
users. The second sentence also suggests that the user interface includes a way to list
members of staff. Producing a model showing both the database and user interface
structures provides additional reassurance that the database design can support the
users’ requirements.
Web data design can be split into conceptual, logical and physical stages, just like
database design. The conceptual web data design models what data is delivered to, or
gathered from, end users and where in the database it comes from or goes to. It shows
the links between the database entities and those parts of the user interface that inter-
act directly with them. The design is documented as a UML class diagram (in effect, an
enhanced ER diagram), but should also be included in the application glossary (see
Chapter 7). At this stage, there is no Web-specific terminology and so the resulting
model could be used as a basis for developing a traditional GUI application. It’s this in-
dependence from the specific approach to developing a user interface that suggests the
name conceptual web data design.
The logical web data design focuses on the particular approach to implementing a
user interface taken by web technology and also on the chosen application architec-
ture. It’s logical by analogy with the logical database design, which focuses on a partic-
ular approach to data organization. In both cases, it is the approach, rather than
particular technologies, that are important. So, for example, decisions about which
browser technologies to support or which server-side scripting language to use are not
relevant at this stage.
As most web database applications follow a three-tier architecture, the logical web
data design needs to recognize the split between browser-side and server-side process-
ing. The browser side deals with the presentation of data-centric content to users.
This may be delivering data drawn from the database or gathering it from users to
store in the database. The server side deals with the actual communication with the
DBMS.
Physical web data design recognizes that each screen will be implemented using
web pages and these will use certain web technologies. Several screens may be used to
532
IDWA_C11.QXD 10/5/07 11:47 AM Page 533
Chapter outline
build a single web page. For example, the login screen may appear as part of every web
page on a website. Thus, physical web data design tries to ensure that the communica-
tion between the web pages and the database is as efficient as possible. It is focused
closely on data-centric issues and does not attempt to provide a full physical design for
the website.
An important issue in physical web data design is whether or not to define data-
base views, and stored functions, so that each web page queries a single database ob-
ject. This simplifies the implementation of the server-side processing on the web
server by shifting data-centric tasks firmly on to the database server. The performance
implications of the alternative approaches should be assessed before a decision is
made.
Another issue for physical web data design is whether or not to implement paging
for database queries that will return many rows. Delivering thousands of results in a
single web page will not endear the application to users.
There is no single, widely use approach to designing web database applications, so
no single diagramming technique for web data design. Atzeni et al. (1999) suggest an
approach that adds symbols for web pages to the traditional ER diagram (the
same approach is used in Eaglestone and Ridley, 2001). The WebML initiative
(www.webml.org) is a similar attempt to extend ER diagramming. It uses its own spe-
cialized diagramming notation to allow conceptual design for both the database and
the website to be carried out in parallel (see Ceri et al., 2003). Conallen (2003) suggests
extensions to the UML designed to facilitate web application design in its entirety. All
three approaches recognize the need to model:
● what data is delivered by a web page
● which part of the database this data comes from
● navigation between data-centric web pages
● how the data is transported from the database to the web page.
The first three requirements are appropriate to the conceptual or logical stages of de-
sign. The fourth is closely related to the chosen implementation technologies, so is
more of a physical design issue. It’s important to model navigation because data-
centric behaviour may involve more than one web page. For example, in the Staff
Directory application, members of staff can update their support sessions. This entails
first listing the support sessions for a member of staff. After they have chosen one, its
details are displayed ready to be updated. This can involve two separate web pages. No-
tice that there is no consideration of the actual presentation of the web pages – only
their data-centric content is considered during web data design. Conallen’s work
provides the basis for the approach discussed in this chapter as UML is the modelling
language used in the rest of the book.
■ Section 11.1 introduces the notion of a screen and explains its use in conceptual
web data design. It discusses examples of screens to model web search pages, result
pages and data entry forms.
■ Section 11.2 examines the split between client-side and server-side behaviour.
■ Section 11.3 considers some issues that arise during physical design. In particular,
it introduces the notion of wrapper functions for database updates.
533
IDWA_C11.QXD 10/5/07 11:47 AM Page 534
Conceptual web data design expands the traditional ER diagram to include screens to
represent those parts of the user interface that deliver data to end users. A screen may
represent only a portion of a web page (or other GUI) that gathers data from, or deliv-
ers data to, end users. In both cases, the screen may manipulate the data in some way.
It is important to emphasize that, in web data design, screens model data-centric behav-
iour only.
Consider the Get staff contact details use case (see Figure A.5, Appendix A). As men-
tioned above, the sentence ‘This allows any user to get the contact details (name,
phone and e-mail) for a named member of staff at the university.’ and ‘The user can
see the full contact details . . . along with a photo of the member of staff if one is avail-
able . . .’ suggest the need for a screen to deliver the contact details for a single member
of staff. This screen is represented in the diagram as a class with the stereotype
«screen». Each screen has a set of attributes that represent its dynamic content – that
is, the data it gathers from or delivers to users. In Figure 11.1 (a), the conceptual design
for the StaffDetail screen has four attributes. Note that the data types of the screen’s at-
tributes are different from those used for an entity’s attributes. Rather than the data
types typically available in a DBMS, they are the data types available on web pages.
PCData is parsed character data – the data type of plain text content in an HTML doc-
ument. The photograph of a member of staff has data type image as HTML documents
treat images separately from character data. Other kinds of content, such as audio,
video and so on, could be given the catch-all HTML data type Object or a more explicit
data type, such as Video. The multiplicity constraint on the photo attribute indicates
that a photograph of the member of staff is optional. Static content is not taken from
the database, so is omitted from the screens of the conceptual web data design.
Using UML class diagrams is not the only possible approach to web data design. An-
other is to use HTML prototypes. These are better for modelling the static rather than
the dynamic content, but can be used to model both. Figure 11.1 (b) shows the
StaffDetails web page modelled using an HTML prototype. The image placeholder and
hyperlink in Figure 11.1 (b) are dynamic content, but also indicate positioning on the
final web page (not an issue covered by web data design). The labels ‘Name’, ‘Phone’
and ‘Email’ indicate positioning for the other dynamic content. If more information
about the dynamic data is needed, then the image is annotated, typically by hand.
Figure 11.1 Designs for dynamic and static aspects of the Staff Details
web page.
«screen»
StaffDetails
name: PCData
phone: PCData
email: PCData
photo: [0..1]: Image
(a) The conceptual web data design for the StaffDetails screen, showing dynamic
content.
534
IDWA_C11.QXD 10/5/07 11:47 AM Page 535
(b) An HTML prototype design for the StaffDetails web page, showing static and
dynamic content.
Each screen either delivers data drawn from the database to end users or gathers data
from them to query or modify the database instance. This means that a screen is al-
ways associated with one or more database tables. Figure 11.2 shows the DisplayedOn
association between the StaffDetails screen and the Staff entity. Unlike associations
535
IDWA_C11.QXD 10/5/07 11:47 AM Page 536
between two entities, this association can be navigated in one direction only – from
screen to entity. Clearly an occurrence of the screen needs access to data held by an oc-
currence of the entity as it is the entity occurrence that holds the official version of
these details. The screen merely delivers the data, it does not store it – screen data is
transient, not persistent. That is why an occurrence of the entity does not need access
to data delivered to end users by the screen. To indicate that the association is naviga-
ble in one direction only, an arrowhead is added to the association end by the Staff
entity.
Don’t mix up association navigability and the association name direction. In Fig-
ure 11.2, the association is navigable from screen to entity, but the association name is
read in the other direction. Association navigability changes the meaning of the dia-
gram. Making the DisplayedOn association navigable in one direction only means that
the Staff entity cannot access the attributes or operations of the StaffDetails screen. The
association name direction does not alter the meaning, so, renaming the association
Displays, say, would change the name direction but not the meaning.
Another common mistake is to assume that the arrow notation indicates the direc-
tion that data flows between the two classes. In fact, the actual data flow in Figure 11.2
is in the opposite direction to the association navigation. The StaffDetails screen can ac-
cess the attributes of the Staff entity to display data about a member of staff. Thus, the
data must flow from the Staff entity to the StaffDetails screen – the opposite direction
to the association navigability. In fact, in general, data can flow in either direction
even if the association is navigable in one direction only. Suppose the StaffDetails
screen was supposed to add details for a new member of staff to the database. It would
still need to access the Staff entity, so the association would be navigable in the same
direction. However, data now flows from the StaffDetails screen to the Staff entity!
UML class diagrams are static models – the system is at rest, with no data flowing and
no messages being passed. A one-way navigable association indicates that one class
has access to the attributes and operations of the other class, but there is no such acces-
sibility in the other direction. It says nothing about data flow.
Association navigability is important when it comes to implementing the design.
The DisplayedOn association is navigable in one direction only, so there is no need to
implement a mechanism to give the Staff entity access to the StaffDetails screen. On
the other hand, the application developer must implement some mechanism to give
the StaffDetails screen access to the Staff entity. As the StaffDetails screen displays a
single member of staff, it must query the database using a candidate key.
A simple mechanism to implement the association is to ensure that the web page
implementing the StaffDetails screen has, in addition to the listed attributes, a copy of
the appropriate staffNo (or other candidate key for the Staff entity). This is not explic-
it in the use case description of the behaviour, but is a reasonably obvious solution.
Just as foreign keys aren’t included in ER diagrams, the staff number is not included as
an attribute of the StaffDetails screen as it is an implementation mechanism, not a part
of the conceptual design.
The multiplicity constraints on the DisplayedOn association provide additional in-
formation. Each occurrence of the StaffDetails screen displays a single occurrence of the
Staff entity. A particular occurrence of the Staff entity may be displayed on zero or
many occurrences of the StaffDetails screen – if no one asks the application for a
person’s details, then they will never be displayed, but if many users ask for those
details, they will be displayed on many different occurrences of the screen.
536
IDWA_C11.QXD 10/5/07 11:47 AM Page 537
Information about screens can be described using text rather than UML diagrams.
Simply adapt the application glossary to hold information on screens and associations
between screens and entities. Text-based approaches to web data design work well with
prototyping, as it is awkward to add this information to a prototype web page. HTML
comments could be used, but it is difficult to gather the information together.
The conceptual web data design will include a screen for each data-centric behav-
iour, just as the ER diagram includes an entity for each data organization structure.
Typically, a use case mentions several related pieces of data-centric behaviour. For
example, the Get staff contact details use case includes the following information:
● ‘The user provides the staff member’s surname and first initial and requests his or
her contact details’ – implying a screen that gathers data from users
● ‘A list showing the name and phone number of each member of staff matching the
search criteria is delivered, together with a count of the number of matches from
the Staff Directory’ – implying a screen that lists several members of staff.
In fact, these two pieces of behaviour are closely related. The first – gathering data
from users – provides the search criteria for the second – listing staff who match these
criteria. Thus, although this situation could be modelled as two separate screens, it
makes sense to recognize that this is a single piece of data-centric behaviour consisting
of two related actions. Modelling these as a single screen leaves open the option of
splitting them at a later stage.
It is also possible to implement all the screens from the Get staff contact details use
case as a single web page. The web page would consist of an HTML form to gather the
search criteria, a section to list matches and a section to display details for a chosen
match. The conceptual web data design does not limit the implementation options.
The Get staff contact details use case states that ‘A list showing the name and phone
number of each member of staff matching the search criteria is delivered, together with
a count of the number of matches from the Staff Directory.’ The screen for this behaviour
clearly lists data from several occurrences of the Staff entity – a group of repeated data.
One simple approach to modelling a group of repeated data is to use attribute
multiplicity. Figure 11.3 shows the StaffList screen modelled using this approach. The
537
IDWA_C11.QXD 10/5/07 11:47 AM Page 538
attributes name and phone both have a multiplicity of [0..*], indicating that they are
part of the repeating group of data. The single-valued attribute numMatches tells users
how many members of staff match their search criteria. Notice that its data type is PC-
Data, rather than Numeric as there is no way to distinguish between character and nu-
meric data on a web page.
This approach is simple to use, but has its drawbacks. There is no easy way to insist
that each attribute repeats the same number of times, for example. An entity con-
straint attached as a note would do this, but seems a little inelegant. Also, if there were
two or more different repeating groups on the same screen, more notes would be
required.
The alternative, more sophisticated, approach to repeating groups is to include two
screens. Figure 11.4 (a) shows how this is done. The original StaffList screen retains
538
IDWA_C11.QXD 10/5/07 11:47 AM Page 539
the single-valued attributes, while the attributes of the repeating group are extracted
into a subscreen, StaffEntry. The attributes of StaffEntry are single-valued because each
occurrence of StaffEntry represents a single entry on the list of staff members matching
the search criteria. The composition association joining StaffList with StaffEntry indi-
cates that the StaffEntry screen is really a part of the StaffList screen and does not have
an independent existence. The multiplicity on this composition association says that
the StaffList screen may contain zero or more StaffEntry on its list of staff members.
This is what makes it a repeating group. The DrawnFrom association is still between the
StaffList screen and the Staff entity. Because StaffEntry is a part of StaffList, it has access
to the Staff entity via StaffList.
Figure 11.4 (b) shows an alternative way to model composition, which is by
drawing the subscreens inside the main screen. In this case, an additional subscreen,
InfoSection, is used to hold the non-repeating attributes. Also, each subscreen is given
a multiplicity constraint in the name compartment. Thus, for each occurrence of
StaffList, there is exactly one occurrence of InfoSection and zero or many occurrences
of StaffEntry. This alternative to modelling composition is a little closer to the proto-
typing approach to web data design.
Screens such as StaffDetails and StaffList deliver data drawn from the database to
end users. Clearly this involves a database query. The association identifies the target
entity, but there is no explicit indication of the search criteria. When the multiplicity
at the entity end of the association is 0..1 or 1..1, it’s clear that the screen must identify
which data it wants using the entity’s candidate key (as discussed above). When
the upper bound on the multiplicity is greater than one, it clearly does not use a
candidate key. In this case, it is useful to indicate what the search criteria are. The
mechanism for this is the association qualifier. Figure 11.5 shows an association qual-
ifier at the StaffList end of the DrawnFrom association. The association qualifier has two
attributes – lName and initial. They represent the search criteria specified in the use
case ‘The user provides the staff member’s surname and first initial and requests his or
her contact details.’ The attributes of an association qualifier could be given a data
type, though this is not done in Figure 11.5.
539
IDWA_C11.QXD 10/5/07 11:47 AM Page 540
When an association end has a qualifier, the meaning of the multiplicity constraint
at the other end alters. It now represents the number of occurrences of the Staff entity
linked with the StaffList screen for particular values of the qualifier attributes. In this
case, adding the association qualifier hasn’t changed this multiplicity constraint as the
search criteria will match zero or more occurrences of Staff. Multiplicity at the end of
the association with the qualifier is not affected.
The attributes of the association qualifier are part of the association itself, and pro-
vide additional information on the data requirements of the application. Notice that
the attributes of the association qualifier need not be attributes of either of the classes
that participate in the association. Neither the StaffList screen nor the Staff entity has
an attribute called initial. This allows for the situation where the final web page does
not display the search criteria to end users. In other situations, the search criteria will
be displayed on the final web page.
Suppose that the Staff Directory application includes a screen SuppSessSearch that
allows users to enter a day and time and lists all staff who offer support sessions at that
time. It makes sense to echo back the search criteria in this situation, so the screen
SuppSessSearch includes searchDay and searchTime attributes (Figure 11.6). Notice
that these screen attributes have the data type PCData, while the association qualifier
attributes have data types appropriate to data drawn from the database.
The screens named StaffEntry in Figures 11.5 and 11.6 are identical. It would be fea-
sible, in a combined diagram, to use a single StaffEntry screen and give it composi-
tion associations with both StaffList and SuppSessSearch. The alternative is to
rename one of the screens.
The SuppSessSearch screen draws its data from the Staff entity, but searches the
entity based on attributes of the SupportSession entity. Code listing 11.1.1 shows an
540
IDWA_C11.QXD 10/5/07 11:47 AM Page 541
SQL select statement that will achieve this for particular values of the day and start
time. Hence, this is a viable application behaviour.
Searches
0..*
«persistent» Offer «persistent»
Staff 1..1 0..* SupportSession
staffNo :Varchar {PK} dayOfWeek : Varchar
fName : Varchar startTime : Time
lName : Varchar endTime : Time
phone [0..1] : Varchar
email : Varchar {CK=1}
photo [0..1] : Blob
department : Varchar
title : Varchar
jobType : Varchar Finds all occurrences of Staff who
jobTitle : Varchar offer a SupportSession on the
searchBySupport(day Varchar, start Time) stated day, and at the stated time
541
IDWA_C11.QXD 10/5/07 11:47 AM Page 542
The names of these parameters match the two attributes of the association qualifier.
This is the simplest way to indicate that the attributes of the association qualifier are
inputs to the searchBySupport( ) operation.
The operation has, as usual, a method note attached to it to document what the op-
eration does. Figure 11.7 also shows that the association qualifier can be attached to
any side of the relevant class, as long as its box is drawn a little shorter or narrower
than the class itself.
Only data-centric behaviour, such as database searches, should be included in the
conceptual web data design. Figure 11.7 could include an operation on StaffEntry to
model the merging of the Staff.fName and Staff.lName attributes into the StaffEntry
screen’s name attribute. An alternative would be to model this as a derived attribute on
Staff, possibly with an operation to maintain its value. The SuppSessSearch could also
have an operation to calculate its numMatches attribute, although the attribute name
alone makes it fairly clear how to calculate this particular value. It is a matter of per-
sonal style how far to take this. If the behaviour is already fully worked out in a sepa-
rate part of the application design, then there is little benefit to be gained from
redoing that work during web data design. However, it is worth checking that the data-
base design meets the data requirements of any operations already defined.
The examples so far have only involved a screen retrieving data from the database.
Some screens will modify the database. For example, members of staff can maintain
their support session details.
Figure 11.8 shows a design for the data requirements of this behaviour. The
MaintainSuppSess screen is the main screen. It shows the number of support sessions
found for the member of staff named and contains a repeating group of support ses-
sions, modelled by the SuppSessEntry screen.
542
IDWA_C11.QXD 10/5/07 11:47 AM Page 543
The Searches association indicates that the support session data for MaintainSuppSess
and, by composition, for SuppSessEntry too, is retrieved from the SupportSession
entity. The association qualifier has an attribute staffNo, so support sessions are
retrieved based on their association with a particular occurrence of Staff, via the Offer
association.
A separate association, Modifies, models the actual data modification (either
adding a new support session or modifying an existing one). As it is the individual
support session entries that are modified, this association links SuppSessEntry with
SupportSession. Notice that the multiplicity constraint on Searches means that a
search may not retrieve any support sessions, while that on Modifies means that each
data modification affects exactly one support session.
One last point worth noting. The conceptual web data design can actually be
based on the logical database design. This is because it is concerned with developing
a conceptual understanding of the way that the user interface interacts with the
database. The logical database design is a better representation of the actual database,
so should be used in preference to the conceptual database design. In particular, in
a conceptual database design it is possible for associations to have attributes of
their own. If a web page needs to deliver these attributes to end users, then there
will be an association between the screen and the association that has these
attributes rather than between the screen and some persistent entity. This is allowed
in UML, but looks very odd. In a logical database design, all attributes belong to
entities, so the oddity of an association between a screen and another association
will not arise.
Conceptual web data design identifies what data is required by the user interface. Data
delivered to users from the database or gathered from them to update the database is
represented as attributes of a screen. Data gathered from users as search criteria are rep-
resented as attributes of an association qualifier.
Logical web data design takes these requirements and begins to describe how they
will be met. The main focus is on adapting the design to the technology used to deliver
a typical three-tier web database application. The logical web data design models the
split between the web client and web server. Database communication is carried out by
server-side scripts, while the client side deals with user interaction. If the conceptual
web data design was based on the conceptual database design, then the first task of
logical web data design is to revise the UML class diagram to take account of any
changes made in moving from the conceptual to the logical ER diagram. This is mostly
concerned with adding new link and look-up entities and considering how they affect
associations between the screens and database entities. Once this has been done, the
main task of logical web data design begins.
Each screen in the conceptual web data design communicates with the database, so
there will be a server page in the logical web data design. The server page is represented
543
IDWA_C11.QXD 10/5/07 11:47 AM Page 544
Figure 11.9 The client page and server page generated from the StaffDetails
screen.
as a class with the stereotype «server page», as shown in Figure 11.9. The server page in-
herits the original screen’s association with the database entity. As before, this associa-
tion is navigable in one direction only – from the server page to the database entity.
The attributes of the StaffDetail server page in Figure 11.9 are slightly different from
those of the StaffDetail screen in Figure 11.2. The staff member’s name is modelled as
fName and lName rather than the single attribute name. This is a minor change and
simply emphasizes that the server page will need to combine the fName and lName
values drawn from the Staff entity to deliver the required data to end users. The data
types of these attributes are different from those of the screen and entity and are
the ones likely to be supported by a server-side scripting language (string, float, int
and so on).
Each screen also communicates with end users. This aspect of the screen’s behaviour is
modelled as a client page – a class with the stereotype «client page» (Figure 11.9).
In the logical web data design, a client page will always be built by some server
page. This reflects how server-side scripting works. An end user actually requests the
server page and this builds an HTTP response, which the web server sends to the end
user as a client page (see Chapter 4). To indicate this, an association stereotyped «build»
is drawn between the client page and the server page that builds it.
The build association is navigable from the server page to the client page. An occur-
rence of the server page builds the associated occurrence of the client page, line by
544
IDWA_C11.QXD 10/5/07 11:47 AM Page 545
line. Clearly, the server page needs access to the client page, but the client page does
not even need to know which server page built it. The multiplicity constraints on a
build association are always 1..1 at both ends. Each occurrence of the server page is
executed by the web server in response to a single HTTP request and builds a single
occurrence of the client page as the HTTP response. Similarly, each occurrence of
the client page is an HTTP response to a single HTTP request, so was itself built by a
single server page.
In fact, Figure 11.9 represents the simplest possible situation – one screen from the
conceptual web data design has been replaced by a server page and a client page with
a build association between them. This is typical of how the screens in the conceptual
web data design transform into client pages and server pages in the logical web data
design. For each association between a screen and an entity, create a server page asso-
ciated with that entity. Turn the original screen into a client page and introduce a
build association between the server page and the client page.
The conceptual web data design for the StaffList screen (Figure 11.5) provides a
more complex example. Here, the screen is split into two parts – StaffList and StaffEntry,
joined by a composition association. This division reflects the need to deliver a repeat-
ing group of attributes to end users, along with a single-valued attribute numMatches.
Clearly, this is a client-side issue, which is reflected in the design of the client pages
in the logical web data design (Figure 11.10). A build association links the main client
page, StaffList, to the server page StaffSearch. There is no need to include a build asso-
ciation between the server page StaffSearch and the StaffEntry client page as the com-
position association indicates that StaffEntry is really just a part of StaffList, so will be
built by the same server page.
Figure 11.10 introduces two additional features of the logical web data design:
Figure 11.10 Logical web data design for searching for staff by surname and initial.
545
IDWA_C11.QXD 10/5/07 11:47 AM Page 546
The StaffSearchCriteria class is the form, indicated by the stereotype «form». Notice
that the data types of form attributes are different again as they represent the different
types of control on an HTML form – text, password, tick box and so on.
Each form gathers data from end users and submits the data to a server page for pro-
cessing, indicated by a submit association between the form and server page. A submit
association is only navigable from the form to the server page. It is the form that calls
an occurrence of the server page into existence, so clearly the form needs to access the
server page. In contrast, once the server page is up and running, it does not care where
its search criteria came from, so does not need access to the form. The multiplicity at
the server page end of a submit association is always 1..1. The multiplicity at the form
end could be 1..1 or 0..1. The multiplicity 1..1 indicates that the form is the only way
to request the server page and 0..1 that there are other ways to request the server page.
It is useful to include forms such as StaffSearch in the logical web data design as
they complete the specification of the search behaviour – they gather search criteria
from users, use these to search the database and deliver search results back to users.
The alternative is to rely on the association qualifier to indicate what search criteria are
used by a server page.
Conallen states that a form must be part of a client page as, on a valid web page, an
HTML form element is always contained in an HTML element. This is true, but it
seems simpler to regard the «form» stereotype as a special kind of «client page» –
one that gathers data from end users. As one client page can include other client
pages (such as repeating groups), there is no problem with a client page containing a
form.
Screens that modify the database instance are more complex again. There are, typically,
two pieces of server-side behaviour:
Figure 11.8 shows the conceptual web data design for the screen that allows users to
maintain support session data. In this case, users are presented with a list of support
sessions and can modify any one of them. The main screen, MaintainSuppSess, is in-
volved in retrieving data from the database (the Searches association). Clearly it must
split into a client page and a server page. The subscreen SuppSessEntry displays an
existing support session, gathers changes to this from users and writes the changed
support session back to the database (the Modifies association). It, too, must also
split – this time into a form and a server page.
Figure 11.11 shows the logical web data design for the above. The original
MaintainSuppSess screen becomes the client page MaintainSuppSess and the server
page RetrieveSuppSess. The server page RetrieveSuppSess inherits the Searches asso-
ciation with the SupportSession entity. The SuppSessEntry screen becomes the
SuppSessEntry form and the server page ModifySuppSess. SuppSessEntry must be a
form rather than a plain client page because it allows users to change its attribute values.
546
IDWA_C11.QXD 10/5/07 11:47 AM Page 547
Figure 11.11 Logical web data design for database retrieval and update
behaviour.
1..1
staffNo :Varchar
0..* 0..*
«form» «persistent»
SuppSessEntry Searches Staff
dayOfWeek : PCData staffNo :Varchar {PK}
startTime : PCData 0..* fName : Varchar
endTime : PCData lName : Varchar
«persistent» 0..* 1..1
1..1 phone [0..1] : Varchar
SupportSession Offer email : Varchar {CK=1}
«submit»
dayOfWeek : Varchar photo [0..1] : Blob
1..1 startTime : Time department : Varchar
«server page» endTime : Time title : Varchar
1..1 jobType : Varchar
ModifySuppSess Modifies jobTitle : Varchar
staffName : PCData 0..*
numSuppSess : PCData
The original SuppSessEntry screen was associated with a database entity via Modifies
and, here, the server page ModifySuppSess inherits this association. ModifySuppSess is
also associated with the SuppSessEntry form, but it is not a build association. The pur-
pose of the server page is to accept data from the form and write it to the database.
Hence, the two are connected by a submit association. The composition association be-
tween MaintainSuppSess and SuppSessEntry is inherited directly from the conceptual
web data design. It implies that SuppSessEntry is part of the MaintainSuppSess client
page, so RetrieveSuppSess must build both the client page and the form it contains.
This example is more complex than that of the search behaviour, but the same gen-
eral rules clearly apply. Screens associated with a database entity split into a server
page, which handles communication with the database, and a client page or form,
which handles communication with end users. If the original screen only delivered
data to users, then it would become a client page and have a build association with the
server page. If the original screen only gathered data from users, then it would become
a form and have a submit association with its server page. When, however, a single
screen both retrieves data from the database and gathers data from end users, it will
have two separate associations with the database – one to represent data retrieval and
one to represent data modification. The screen itself therefore becomes a form as it
must both deliver data to and gather data from end users. Each of the associations gen-
erates a server page – one to build the form and one for the form to submit its data to.
An alternative approach is to have a single server page and model the different parts of
the behaviour as different operations of that page.
547
IDWA_C11.QXD 10/5/07 11:47 AM Page 548
The actual implementation as, for example, PHP scripts may bundle two server
pages into a single script, or split the operations of a single server page into two or
more scripts. This is because UML is an object-orientated modelling language, but
server-side scripts tend to be procedural (though PHP does support the object-
orientated approach).
548
IDWA_C11.QXD 10/5/07 11:47 AM Page 549
Figure 11.12 Logical web data design for the complete staff search behaviour.
«form»
StaffSearchCriteria
lName : Text
initial : Text
0..1
«submit»
1..1
«server page» IName 0..* 0..* «persistent»
StaffSearch initial DrawnFrom Staff
staffNo : string staffNo :Varchar {PK}
fName : string fName : Varchar
lName : string lName : Varchar
phone [0..1] : string phone [0..1] : Varchar
email : Varchar {CK=1}
photo [0..1] : Blob
«build» department : Varchar
title : Varchar
jobType : Varchar
«client page» jobTitle : Varchar
StaffList 1..1
numMatches : PCData
DisplayedOn
1..1
0..*
«server page»
0..*
StaffDetails
«client page»
fName: string
StaffEntry «link»
lName: string
name : PCData phone: string
phone : PCData email: string
photo: [0..1]: Image
«build»
«client page»
StaffContactDetails
name: PCData
phone: PCData
email: PCData
photo: [0..1]: Image
Finally, plain associations connect server pages to the database entities they retrieve
data from or modify. Association qualifiers at the server page end of these associa-
tions indicate the search criteria (the DrawnFrom association) and where the search
is by candidate key the association qualifier is usually omitted (the DisplayedOn
association).
549
IDWA_C11.QXD 10/5/07 11:47 AM Page 550
Table 11.1 summarizes the stereotypes used in logical web data design to represent
the different elements of a web database application.
The aim of physical web data design is to consider how the particular choice of web
technologies affects the way the web pages communicate with the database. This is
quite a narrow scope, excluding many Web-only matters, such as optimizing the work-
ings of the web server itself. Another factor narrowing the scope of this design task is
that physical database design may already have been done (see Chapter 10). If not,
then it is likely that the issues discussed here will be considered at the same time as the
general physical database design issues.
In physical web data design, improving response times is an important task. Web
database applications tend to have slower response times than those using a traditional
graphical user interface (GUI). There are three good reasons for this. The stateless
request–response model of the HTTP protocol means that web client and web server
must re-establish contact for each request whereas a traditional GUI client will have a
permanent connection to its server, even in a three-tier architecture.
The distance (in network terms) between the web client and the web server is
greater as they tend to run over an internet (public or private) whereas GUI applica-
tions tend to run over local area networks (LANs).
Finally, web database applications have a three-tier architecture. The web server will
usually need to communicate with the database server before responding to an HTTP
request and may need to re-establish a connection first. In a traditional database appli-
cation, however, the connection is persistent.
550
IDWA_C11.QXD 10/5/07 11:47 AM Page 551
One approach to improving the response times for a web page is to ensure that it
queries a single database table. Single-table selects tend to be quicker than table joins
and are certainly quicker than executing two select statements and trying to combine
their data in the PHP (or other server-side) code. This is the approach taken in the Staff
Directory – one web page delivers staff contact details and a second delivers details of
his or her support sessions (Figures 11.13 (a) and (b)).
However, each separate web page entails an additional HTTP request. If the addi-
tional information is viewed infrequently, then the performance cost of the occasional
additional HTTP request is probably acceptable. If, instead, it is almost always viewed
by end users, then both the performance cost of the additional HTTP request and the
plain inconvenience of having to click on that hyperlink will outweigh the cost of
implementing a table join. In this latter case, the design should combine staff contact
details and support sessions on a single web page.
Once the decision to deliver data from two tables on a single web page has been
taken, the design should ensure that the database query is quick and easy. The data-
base designer should certainly consider adding an index on the foreign key columns –
the SupportSession.staffNo column in this case. Another refinement is to
create a view. As mentioned in Chapter 6, the view definition – a join query in this
case – is stored in the data dictionary so the DBMS can optimize how the query is
Figure 11.13 Using separate pages for staff details and their support sessions.
551
IDWA_C11.QXD 10/5/07 11:47 AM Page 552
executed. This means that querying the view will generally be quicker than querying
the underlying base tables with a join query. Another benefit of views is that they sim-
plify the server page. The application developer must still write an SQL select state-
ment, but it involves only a single table (remember, a view is a derived table).
Database modification statements can be shifted entirely to the database server by
using a stored function. A stored function is a piece of application code stored in the data-
base, like a trigger. The main difference is that triggers execute in response to database
events, whereas stored functions are executed on the instructions of the database client.
SQL stored functions have the same role as PHP functions and are defined in a similar
way. The SQL create function statement instructs the DBMS to create a stored function
on the database. Figure 11.14 shows the structure of the SQL create function statement. It
begins with the keywords CREATE FUNCTION, followed by an SQL identifier – the
name of the stored function. There is an optional list of parameters, consisting of a para-
meter name (an SQL identifier as usual) and data type. The parameter list is followed by
the keyword RETURNS and the data type of the value returned by the function.
SQL also allows you to create stored procedures. These do not include a return
value. Instead, each parameter declaration begins with one of the three keywords
IN, OUT and INOUT. These indicate whether the parameter passes data into the
procedure from the calling program or back to the calling program or both. There is
no return value from a stored procedure – data can only be passed back to the call-
ing program via OUT and INOUT parameters. Using this mechanism, stored proce-
dures can pass more data back to the calling program.
552
IDWA_C11.QXD 10/5/07 11:47 AM Page 553
Figure 11.14 The ISO SQL:2003 syntax for creating a stored function.
A list of parameters
CREATE FUNCTION functionName
to the function
(parameter1 VARCHAR,
parameter2 NUMERIC)
RETURNS NUMERIC
The code to execute; BEGIN
may include SQL
functionCode
procedural and data
manipulation statements END;
Stored functions are often used as wrapper functions for SQL insert, update and delete
statements. They are particularly useful when data must be validated before an insert
or update statement or when a delete statement needs to remove rows from two tables
linked by a foreign key. Their main advantage is that they can simplify the coding of
server pages. A server page does not need to include the data manipulation statements –
it simply passes data to the stored function. Code listing 11.3.1 shows how to create a
wrapper function for deleting staff and their support sessions.
Within the function code, each statement is terminated by a semicolon. This causes
problems as the first semicolon would usually terminate the create function state-
ment itself! The MySQL DBMS gets round this by allowing the application developer
to set a different SQL statement delimiter. The first line in Code listing 11.3.1 shows
how to set the SQL statement delimiter to ‘//’ and set it back to the usual ‘;’ at the
end. The MySQL DBMS regards the first three semicolons as part of the stored func-
tion definition and the ‘//’ on the eleventh line as the create function statement
delimiter.
Notice that the parameter names begins with the prefix prm. This is a coding con-
vention to ensure that parameter names and column names don’t clash.
Code listing 11.3.1 A stored function to remove a member of staff and all his
or her support sessions using the MySQL DBMS
DELIMITER //
CREATE FUNCTION DeleteStaff (prmStaffNo VARCHAR)
RETURNS BOOLEAN
BEGIN
DELETE FROM SupportSession
WHERE staffNo=prmStaffNo;
DELETE FROM Staff
WHERE staffNo=prmStaffNo;
END;
//
DELIMITER ;
Some forms and client pages include repeating groups of data. If the maximum
number of entries in the repeating group will always be small, then they can easily be
553
IDWA_C11.QXD 10/5/07 11:47 AM Page 554
delivered on a single web page. This is the case with the list of support sessions offered
by a member of staff (Figure 11.13 (b)), but may not be the case with the list of staff
members matching some given search criteria.
In a traditional GUI, such repeating groups are displayed in a scrolling list. This is
an option here as HTML form controls include scrolling lists, but such lists limit the
presentation options.
The most common alternative is to present a long list as a series of web pages, each
delivering a fixed, maximum number of rows. This complicates the task of the server
page, though. First, it will be called repeatedly as the user navigates through the list.
Each time, it must retrieve from the database the next set of matches to the original
database query and provide a navigation mechanism so that users can move forwards
or backwards through the list.
Some DBMS have extended the syntax of the select statement to allow users to spec-
ify which rows of the result set to return (see Chapter 7). Code listing 11.3.2 shows the
MySQL DBMS limit clause. The select statement will return the sixth to tenth rows of
the result set (the first row is row 0, the second row is row 1 and so on). This provides
the application developer with a way to control which chunks of the result set the serv-
er page displays. On the first page, it will display the first chunk of records, say five, and
a link to the ‘next’ page. This link simply calls the same server page, but instructs it to
change the limit clause on the select statement so it shows the next chunk of records.
Code listing 11.3.2 Specifying which rows of the result set to return using
the limit clause of the MySQL DBMS
SELECT staffNo, fName||' '||lName, phone
FROM Staff
WHERE lName='Smith'
AND fName LIKE 'a%'
LIMIT 5,9
Deciding whether to present lists as single or multiple web pages involves assessing
the performance of the two approaches. Where the results are presented in a single
web page, there is a single database query and a single HTTP request. Where results
are presented across multiple web pages, there are multiple database queries, which
increase the workload of the database server, and multiple HTTP requests, which
increase the workload of the web server. For medium-sized result sets, of a few tens of
records, the single-page option is clearly more efficient. Even so, the user experience
design may include a non-functional requirement banning ‘long’ web pages. In this
case, the elegance of the presentation may outweigh the efficiency gains.
For very large result sets, a single web page may actually be less efficient than mul-
tiple web pages. There are two reasons for this. First, an HTTP response is sent across
the network as a series of IP packets (see Chapter 2). These must be reassembled at
the client into the original HTTP response message. The larger the HTTP response, the
more packets there are and the longer it can take to do this and actually display
the web page. Sending just the first few rows of the result set will mean that they are
displayed quickly and, with luck, users will find what they are looking for on this first
page. This is the approach most search engines take – send the top ten results first and
554
IDWA_C11.QXD 10/5/07 11:47 AM Page 555
you may never have to send the rest. The second reason for multiple pages perhaps
being more efficient is that the DBMS will remember the execution plan for the select
statement, so, when the second chunk of data is requested, the select statement will
execute faster – the additional burden on the database server is therefore not as great
as might be expected.
The physical web data design issues discussed above focus on how to improve per-
formance by changing the server-side script’s code. However, there is one technical
issue that should be considered as part of physical web data design – how to manage
connections to the DBMS.
Chapter 4 explained that it is good practice to close a DBMS connection at the end
of each PHP script using the mysql_close()function. If the application developer
does not close the connection, then, usually, the application server will. This is be-
cause the DBMS must devote some of its resources to checking all open connections
for activity. If there are a lot of open, but unused, connections, then the DBMS is wast-
ing its resources. Thus, it makes sense to close the DBMS connection when a server-
side script finishes.
The problem with this is that opening a new DBMS connection also takes up re-
sources, as discussed in Chapter 2. One option is to try and reuse the DBMS connection
by having a server-side script open a persistent connection. Persistent connections are
simply DBMS connections that do not close when the server-side script finishes. In-
stead, the application server keeps the connection open. The next time a server-side
script tries to open a connection to the same DBMS using the same authorization iden-
tifier, the application server simply reuses the one it currently has open. Persistent con-
nections make sense when a web database application gets a lot of use. In such cases,
there is always likely to be a server-side script that can reuse the persistent connection.
As the persistent connection is then rarely idle, the DBMS is not wasting resources by
checking it for activity. PHP includes the function mysql_pconnect()to establish a
persistent connection, as shown in the PHP fragment in Code listing 11.3.3.
555
IDWA_C11.QXD 10/5/07 11:47 AM Page 556
‘HTTP subservers’ is not a widely used term, though it captures the general idea.
Only multiprocess (such as Apache) or multithreaded (such as Microsoft Internet In-
formation Server) HTTP servers create them.
The specific techniques for managing the HTTP subservers in multiprocess and
multithreaded environments isn’t important here. What is is the fact that such HTTP
servers will hang on to any persistent connections for quite a while. PHP persistent
connections cannot be closed by the mysql—close()function, either. They really do
persist, so use with care.
Chapter summary
■ This chapter has discussed web data design, which acts as an additional check that
the database design can meet all the requirements of the web database application.
It extends the ER diagram, which only includes database objects, to model objects
from the user interface using the same UML notation.
■ Conceptual web data design gives a good overview of what data is used by which
pieces of user behaviour. Logical web data design provides a way to map these data
requirements on to the client and server pages of a web database application. Phys-
ical web data design considers performance issues. The names for the three stages
are not standard, but are used by analogy with the well-established names for the
three stages of database design.
Further reading
The stereotypes used in web data design represent only a portion of Conallen’s web
application extensions for UML. A fuller discussion of his approach is presented in
his book on the topic (Conallen, 2003). Ceri et al. (2003) provides the official WebML
reference.
Review questions
11.1 Explain what is done during the process of web data design. Distinguish between
the conceptual, logical and physical stages of web data design.
11.2 Explain the following concepts used during conceptual web data design:
(a) screen
(b) association navigability
(c) association qualifier.
556
IDWA_C11.QXD 10/5/07 11:47 AM Page 557
Exercises
11.3 Explain the two approaches to representing repeating groups during conceptual
web data design.
11.4 Explain the following concepts used during logical web data design:
(a) server page
(b) client page
(c) build association
(d) form
(e) submit association
(f) link association.
11.5 Explain how a screen from the conceptual web data design is represented in the
logical web data design.
11.6 Explain the following concepts used during physical web data design:
(a) stored function
(b) wrapper function.
11.7 Explain why it is better to design web pages that query a single table. When might
this approach be inappropriate?
11.9 Explain the benefits and drawbacks of presenting a result set across multiple web
pages.
Exercises
11.11 The Web Timetable case study is documented in Appendix B. Figure 9.18 is a logi-
cal ER diagram for this case study. Using this logical ER diagram as a basis, do the
following.
(a) Develop a conceptual web data design for the ‘View room facilities’ use case in
Appendix B.
(b) Develop a conceptual web data design for the ‘Book additional facilities’ use
case in Appendix B. Explain why it is better to use the logical ER diagram in
Figure 9.18 as the basis for this conceptual web data design than the concep-
tual ER diagram in Figure 9.17.
(c) Complete the conceptual web data design for the Web Timetable case study in
Appendix B by drawing one or more diagrams for each of the remaining use
cases.
557
IDWA_C11.QXD 10/5/07 11:47 AM Page 558
11.13 In Exercise 9.13 you developed a logical ER diagram for the Pennine University’s
Accommodation Office. Using that ER diagram as a basis, develop conceptual
web data designs for the following web pages:
(a) a web page to allow staff to list all rental agreements for a particular student
(b) a web page to allow staff to amend the details of a hall of residence
(c) a web page to allow staff to add a new rental agreement for an existing student
to rent an existing room.
11.14 Using your solutions to Exercise 11.13 above as a starting point, develop logical
web data designs for each of those web pages.
11.15 What physical design issues should be addressed regarding the web data designs
for Exercise 11.14 above?
11.16 Write two sets of server-side scripts to allow users to add, modify and remove sup-
port sessions. One set of scripts should code the SQL data modification state-
ments themselves, while the second should use wrapper functions, calling these
from the server-side script. Critically evaluate the two approaches. Which is easier
to write? Which will be easier to maintain?
Investigations
11.17 Investigate alternative approaches to web data design. Possible sources include
Ceri et al. (2003) for WebML, Atzeni et al. (1999) and Eaglestone and Ridley (2001).
How do they compare to the approach described in this chapter?
11.18 It should be clear from the exercises that a lot of web data designs follow a simi-
lar pattern. The study of such design patterns is a well-established area of re-
search. Investigate the use of design patterns in the context of web database
applications.
11.19 In Investigation 9.19 you developed a logical database design for an organization
that you know. Use this as the basis for developing a set of conceptual and logical
web data designs for a representative range of web application functions (such as
search, item details, data modification). What physical web data design decisions
would you make for your application?
11.20 Most books on web design pay little attention to the sort of functional design dis-
cussed in this chapter. Conversely, this chapter has not covered the user experi-
ence design issues that are more familiar to web designers. Investigate the sorts of
web design issues more commonly discussed in web design textbooks.
558
IDWA_C12.QXD 10/5/07 11:50 AM Page 559
12 XML
Chapter objectives
Chapter outline
There is a lot of activity around XML at the moment, and a lot of interest in using it as
an alternative to relational databases. The pros and cons of XML as a data model were
discussed in Chapter 3. This chapter focuses on XML technologies and approaches to
database design for XML.
As XML technologies emerged more recently than relational DBMS, there is no sta-
ble list of widely used XML technologies. This chapter focuses on two technologies to
support the database approach that are now fairly stable.
The RELAX NG schema language for XML allows the XML database developer to
specify the general structure of documents for an XML application. Together with a
validating parser, it plays a role analogous to the SQL data definition language’s create
table statement. RELAX NG forms part of a suite of proposed ISO international stan-
dards for document schema definition languages, so is likely to be in use for some
time.
The XPath language provides a way to locate and, to some extent, extract data from
an XML instance document. The analogy with the SQL data manipulation language’s
select statement is a little more tenuous, though XPath could be regarded as a more
powerful version of the where clause. There are more powerful XML query languages,
but as yet they are neither widely implemented nor widely used.
559
IDWA_C12.QXD 10/5/07 11:50 AM Page 560
Chapter 12 XML
Approaches to logical design for XML are more stable. The availability of relational
style keys in XML means that many of the techniques used in logical design for the re-
lational data model can be applied, with some tweaks, to logical design for XML. The
added complexity of the XDM organization structures allows greater flexibility when
translating from the organization structures of the conceptual design to those of the
XDM. However, the general approach should be familiar.
Before delving into XML technologies, one omission from the discussion in Chap-
ter 3 needs to be remedied: XML namespaces. They were omitted before as the XDM is
complicated enough in itself. Namespaces allow two or more XML languages to be
used simultaneously in the same XML document without the parsers getting confused.
They are discussed first.
■ Section 12.1 discusses XML namespaces.
■ Section 12.2 introduces the XML schema language RELAX NG and discusses how
to use it to define the structure of an XML document model. It also discusses sup-
port for relational style keys by means of the RELAX NG DTD compatibility data
type library.
■ Section 12.3 introduces the XPath language, as an example of a simple query lan-
guage for XML. XPath is used in more powerful query languages, such as XQuery
and SQL/XML, so makes a good first XML query language.
■ Section 12.4 discusses logical design for XML.
A common problem in computing is how to deal with the situation when two differ-
ent things share the same name. This problem arises in XML because an XML instance
document can use mark-up from two or more different document models.
Code listing 12.1.1 shows a fragment of an XML instance document from the Web
Timetable database (the line numbers are not part of the XML document). This exam-
ple shows a module element from the Web Timetable XML document model. Each
module element has code, title, department and level child elements.
The title element (lines 4–7) has two child elements, which are a mandatory
mainTitle element and an optional subTitle element.
The module element also includes a leader child element, which captures infor-
mation on the module leader. As module leaders are members of staff, the child ele-
ments of the leader element are drawn from the Staff Directory document model.
Unfortunately, this document model also includes a title element. This title
element contains only text. It is clear to human eyes that the two title elements – in
lines 4 and 12 – are different kinds of element that just happen to share a name, but an
XML parser expects all title elements to have the same content model.
560
IDWA_C12.QXD 10/5/07 11:50 AM Page 561
2. <module validated="2004-04-21">
3. <code>CCFC0108</code>
4. <title>
5. <mainTitle>Introduction to programming</mainTitle>
6. <subTitle>C# for beginners</subTitle>
7. </title>
8. <department>Computing</department>
9. <level>F</level>
10. <leader>
11. <staffNo>31210</staffNo>
12. <title>Dr</title>
13. <name>Paul Smith</name>
14. <email>[email protected]</email>
15. </leader>
16. </module>
If both title elements had the same content model (if both contained only text, for
example), then it would be acceptable, from an XML perspective, to reuse the title
element in these two different contexts. Even so, from the database perspective, the sit-
uation could prove confusing as they represent different kinds of title – the title of a
module and the title, or honorific, appropriate to a person. When, as in Code listing
12.1.1, two elements with different content models have the same name, the XML parser
needs to know which document model each element belongs to, so that it can check
each element’s structure against the right document model. If the XML parser checked
the title element in line 12 against the Web Timetable document model, it would re-
port an error – there is no mainTitle child element, and this title element cannot
itself contain text.
A similar problem occurs with attributes. An XML document model can define at-
tributes independently of elements. This provides a way to reuse attribute declara-
tions. Standard sets of attributes are declared once and used in many different
elements. XHTML provides a good example of this. Most of the XHTML elements
include the attributes class, id, lang, style and title and they always
have the same meaning. It makes sense to define this list of attributes once and
include the whole list in every element that needs it. As one document model can use
attribute lists declared in another, there is the potential for two different attributes to
share the same name. For XML to work, there must be a mechanism for resolving
ambiguous element and attribute names.
The problem of ambiguous names also cropped up in the relational data language
SQL. In the WebTimetable database, the Module and Course tables both include
a column called code. An SQL join query using these two tables must distinguish
between these two different code columns. The approach taken is to prefix the
column name with the name of the table it belongs to – Course.code and
Module.code remove the ambiguity. This solution works because, within a table, all
column names must be unique. In an XML document model, all element names must
be unique. By analogy, any ambiguity over which document model an element
belongs to can be resolved by prefixing the element name with the document model
name.
561
IDWA_C12.QXD 10/5/07 11:50 AM Page 562
Chapter 12 XML
The approach taken with attributes is slightly different. Within an element, attribute
names must be unique. This might suggest that, to disambiguate attribute names, they
should be prefixed with the element and document model names. However, attributes
may be declared in a different document model from their parent element. This means
that only the document model name is used to disambiguate attribute names.
The XML namespaces specification (W3C, 2004b) implements this general approach
to dealing with ambiguous names in mark-up. The name of any element or attribute
used in a particular document model is called its local name. The document model it-
self is given a namespace name. The namespace name is a globally unique identifier –
no two document models will ever share a namespace name. The expanded name of
any element or attribute consists of its namespace name and its local name. This ex-
panded name uniquely identifies a particular kind of element or attribute. It also tells
the XML parser which document model to use to check the element’s structure or the
attribute’s data type.
As namespace names are supposed to be globally unique identifiers, there needs to
be a mechanism for ensuring that two document model authors choose different
namespace names for their document models. Rather than invent a new set of stan-
dards, the W3C decided to use URIs as namespace names. A URI is guaranteed to be
globally unique as only the owner of a web domain can use it. There is no intention
for the document model to actually exist at the URI used as its namespace name. In
fact, there may be nothing at the URI. The URI is simply a convenient, widely used
mechanism for defining unique names.
Using URIs as namespace names solves the uniqueness problem, but introduces
another. URIs can include characters such as ‘/’, ‘~’ and ‘%’, which cannot be used in
XML element and attribute names. For example, the namespace name for the Web
Timetable document model is http://www.pennine.ac.uk/WebTimetable,
but this cannot be used in an XML instance document as part of an element or
attribute name.
The solution to this problem is to use a local alias for the namespace name,
called the namespace prefix. Each XML instance document defines its own name-
space prefix for a document model, though each namespace prefix is bound to
the same namespace name. So, for example, the namespace name http://
www.pennine.ac.uk/WebTimetable could sensibly be bound to the name-
space prefix wtt. Less sensibly it could be bound to the namespace prefix
onlyFoolsUseMeaninglessPrefixes. The choice is down to the author of the
particular XML instance document. The only restriction is that the namespace prefix
must be a valid XML name token and cannot include the colon – ‘:’. Together, the
namespace prefix and the local name form the qualified name. Within an XML
instance document, the format for a qualified name is:
namespacePrefix:localName
562
IDWA_C12.QXD 10/5/07 11:50 AM Page 563
This is very similar to the SQL qualified identifier, which takes the form
tableName.columnName. The main difference is that the separator for an XML-
qualified name is the colon, not a full stop. (That is why colons are not permitted in
the namespace prefix.) Code listing 12.1.2 adds namespace prefixes to the code from
Code listing 12.1.1 – wtt for the Web Timetable document model and sdir for the
Staff Directory document model. Every element and attribute name in this example is
a qualified name.
In fact, it’s possible to bind different namespace prefixes to the same namespace
name at different points in an XML instance document. This may be necessary when
combining XML mark-up from multiple sources, but remember that unnecessary
complexity should be avoided.
563
IDWA_C12.QXD 10/5/07 11:50 AM Page 564
Chapter 12 XML
xmlns:namespacePrefix="namespaceName"
First comes the reserved string of characters ‘xmlns:’. Following this comes the
namespace prefix, an equals sign, and finally the actual namespace name, enclosed in
quotation marks. The XML specification actually describes namespace declarations as
‘a family of reserved attributes’ (W3C, 2004b). What this means is that, within the
mark-up of an XML instance document, namespace declarations are treated just like
attributes of the element they appear in. However, a validating XML parser treats
namespace declarations quite differently from attributes.
First, a namespace declaration becomes a namespace node in the tree representa-
tion of the XML document, not an attribute node. Second, an element’s descendants
inherit its namespace declarations. This means that the namespace prefix declared in a
namespace declaration can be used in any child element of the current element and in
their child elements and so on. Attributes are definitely not inherited by an element’s
descendants.
Code listing 12.1.3 includes two namespace declarations in the start tag of the mod-
ule element. The first, in line 2, binds the namespace prefix wtt to the namespace
name http://www.pennine.ac.uk/WebTimetable. This namespace prefix is
used in the qualified name of the module element itself, which comes before the
namespace declaration. The scope of a namespace declaration is the whole of the
element it appears in, including the start tag itself, so this is allowed. Once declared,
the wtt namespace prefix can be used to create a qualified name for any descendants
and their attributes. Indeed, the scope of the namespace declaration includes the
module element’s descendants and their attributes.
There is another namespace declaration in the module element’s start tag. Line 3
binds the prefix sdir to the namespace name http://www.pennine.ac.
uk/StaffDirectory. This namespace prefix isn’t used until line 13, where it is
used in the qualified name sdir:staffNo for the staffNo element. This element
is a child of a child of the module element, so is within the scope of the namespace
declaration. If the sdir namespace prefix is only ever used in descendants of the
leader element, then it might make more sense to declare this namespace in the
start tag of the leader element. Whether to declare all the namespaces early in the doc-
ument tree or only when they are required is a matter of personal style. However, if
namespaces are declared further down the document tree, they may need to be de-
clared more than once.
564
IDWA_C12.QXD 10/5/07 11:50 AM Page 565
565
IDWA_C12.QXD 10/5/07 11:50 AM Page 566
Chapter 12 XML
namespace. Every element name that doesn’t have a prefix has an empty prefix, so is
bound to the default namespace.
566
IDWA_C12.QXD 10/5/07 11:50 AM Page 567
unprefixed attribute names, particularly XML parsers. The documentation for the
XML processor you are using should provide guidance on this.
Code listing 12.1.5 Unprefixed attributes are not in the same namespace as
their parent element (W3C, 2004b)
1. <!-- http://www.w3.org is bound to n1 and is the default
-->
2. <x xmlns:n1="http://www.w3.org"
3. xmlns="http://www.w3.org" >
4. <good a="1" b="2" />
5. <good a="1" n1:a="2" />
6. </x>
A schema language for XML provides a language for specifying XML document mod-
els. Each schema document describes the document model for a particular XML appli-
cation. The structure of an instance document of a particular XML document model
must match the description in the relevant schema document. A validating XML
parser checks the instance document against the schema document. The validating
XML parser needs to check that the instance document only contains elements speci-
fied in the schema document and the content of these elements is correct. It also
needs to check that each element in the instance document has the right attributes.
This is the very minimum that a validating XML parser should do, so any schema lan-
guage for XML must be able to describe elements, their attributes and content models.
Most validating XML parsers will ensure that each attribute value has the correct data
type and many will do this for text nodes, too, adding another requirement for XML
schema languages.
Beyond this basic structural validation, an XML parser may also want to check a
range of integrity constraints. The XML analogues of relational candidate and foreign
key constraints are a good example. Business rules need to be enforced, too, and it
might even be necessary to check that the data in attribute values and text nodes
matches a particular pattern (such as a regular expression). Finally, some XML parsers
might want to add information to the tree representation of a document as they parse
the textual representation. Parsers for the two W3C XML schema languages, DTD and
W3C XML Schema, both add default values for attributes, if the attributes were omit-
ted from the textual representation. Parsers for W3C XML Schema also add annota-
tions to attribute and text nodes detailing the data types of these nodes. Parsers for
RELAX NG check that the data type is correct, but don’t add data type information to
the tree representation. This only makes a difference if the tree representation is passed
to another XML processor that needs detailed data type information, not just an assur-
ance that the data type is correct.
This section focuses on basic structural validation, data types and standard integrity
constraints. Basic structural validation is essential since the core purpose of any XML
schema language is to support the specification of element content models. Data types
567
IDWA_C12.QXD 10/5/07 11:50 AM Page 568
Chapter 12 XML
are important for database applications, so also need to be covered. The standard in-
tegrity constraints, not null, candidate and foreign keys, are all as important in XML
databases as they were in relational databases.
The XML schema language used in this section is RELAX NG. As part of an ISO stan-
dard, RELAX NG is certainly here to stay, although currently the most widely used
XML schema language is W3C XML Schema. RELAX NG has a firm basis in mathemat-
ics and is a clean and easy to understand language. There are few ad hoc restrictions
and, once the basic ideas have been learnt, applying them is as much a matter of com-
mon sense as it is technical facility.
RELAX NG is based on the notion of a pattern. In the tree representation of a valid
instance document, every node will match a pattern defined in the schema document.
‘Matching a pattern’ means that the tree structure below the node (the node and all its
descendants, plus their attributes) is the same as the structure described by the pattern.
A pattern can be written out in English, as in the following example.
This is the pattern for an element node with node name ‘session’. This element
node contains:
568
IDWA_C12.QXD 10/5/07 11:51 AM Page 569
This section uses the XML syntax because it is the syntax style of other XML schema
languages, so learning to work with this style of syntax is important.
Code listing 12.2.1 The two syntaxes of the XML schema language RELAX NG
(a) The XML syntax, used in the rest of the section
<element name="session">
<element name="dayOfWeek">
<text />
</element>
<element name="startTime">
<text />
</element>
<element name="endTime">
<text />
</element>
</element>
569
IDWA_C12.QXD 10/5/07 11:51 AM Page 570
Chapter 12 XML
● text
● data
● attribute
● element.
A schema document combines these four basic patterns to produce a complete descrip-
tion of an XML document model. It does this by specifying a pattern for the attributes
and content model of every element in the document model. The XML parser com-
pares each element in an instance document to the appropriate element pattern in the
schema document. If they all match their patterns, then the document is valid. Other-
wise, it is invalid. This subsection explains the basic patterns and the next explains
how to combine them to create more complex patterns.
A text node, or attribute value, in the instance document must match a text pattern
or a data pattern in the schema document. Whenever the XML parser encounters a
text node, or attribute value, in the instance document, it checks that the pattern of
the parent element contains a matching text or data pattern. As text nodes don’t have
attributes or further content, the text pattern is written as an empty element:
<text />
Whenever the text pattern appears in a schema document, it means:
<dayOfWeek>Monday</dayOfWeek>
matches the pattern for the dayOfWeek element:
570
IDWA_C12.QXD 10/5/07 11:51 AM Page 571
As RELAX NG treats the attribute and attribute value as separate nodes, attribute
values can also match the text pattern. Although this is a divergence from the XDM
(where an attribute value is represented by the value attribute of the corresponding
attribute node), it is easy to convert a RELAX NG tree representation of an instance
document into an XDM one – simply absorb every attribute value node into its parent
attribute node. The RELAX NG approach means that elements and attributes are
treated consistently, making the language simpler and easier to use than XDM.
In mixed content models, there may be more than one text node in the content of
an element, separated by other element nodes. For example, the content model for the
XHTML paragraph element allows the following paragraph elements in an instance
document:
at this point in the instance document, there is some data with this data type,
which is defined in this datatype library.
571
IDWA_C12.QXD 10/5/07 11:51 AM Page 572
Chapter 12 XML
As an example, the following data pattern is matched by text nodes and attribute val-
ues that contain a date literal, in the format yyyy-mm-dd.
<data datatypeLibrary="http://www.w3/org/2001/XMLSchema-
datatypes" type="date" > </data>
If the datatypeLibrary attribute is omitted, then the built-in RELAX NG data
types are used. There are two built-in data types, both of which represent character
string data. The token data type is the default data type of XML documents. A token
data value is a character string in which every sequence of white space characters
(spaces, tabs, carriage returns and linefeeds) has been replaced by a single space charac-
ter and any leading or trailing spaces have been trimmed. The other data type is
string. The data values of the string data type are character strings in which
white space is preserved.
The data pattern can contain other patterns. These further restrict the data values
allowed to occur in matching nodes from the instance document. This lets the schema
author implement simple user-defined types. However, if there are no restrictions, the
data pattern will be empty. As usual, this empty XML element can be abbreviated to:
At this point in the instance document there is an attribute with the specified
name and its value matches this pattern.
RELAX NG recognizes that most attribute values will match the text pattern, so the pat-
tern above is equivalent to:
572
IDWA_C12.QXD 10/5/07 11:51 AM Page 573
Text, data and attribute patterns are all fairly simple. So is the element pattern.
Every element in the instance document must match an element pattern in the
schema document. Each element has a name, set of attributes and some content, so an
element in the instance document matches a particular element pattern, provided:
573
IDWA_C12.QXD 10/5/07 11:51 AM Page 574
Chapter 12 XML
574
IDWA_C12.QXD 10/5/07 11:51 AM Page 575
<element name="session">
<element name="dayOfWeek"> <text /> </element>
<element name="dayOfWeek"> <text /> </element>
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</element>
575
IDWA_C12.QXD 10/5/07 11:51 AM Page 576
Chapter 12 XML
This pattern insists that there must be two dayOfWeek elements at the start of every
session element.
Including patterns multiple times is not very efficient. RELAX NG includes the con-
cept of a named pattern, which is simply a pattern that has been named. The name of
the pattern is used in content models instead of the pattern itself, ensuring that, if the
pattern changes, it only needs to be changed once (just like a PHP function). In this
brief introduction to XML schema languages there isn’t time to cover named patterns
in detail. See the Further reading section at the end of this chapter for books that do.
RELAX NG includes three different patterns to match the other common multipli-
cities. The optional pattern matches zero or one occurrence of the pattern it contains.
Whenever the optional pattern appears in a schema document it means:
at this point in the instance document, the following pattern may appear at
most once, but may not appear at all.
A good example comes from the contacts database used by the Pennine University to
track telephone enquiries from potential students. The potential applicant provides a
name and phone number and may optionally give an e-mail and home address.
Code listing 12.2.2.1 shows a RELAX NG pattern for this situation. The first optional
pattern contains a single element pattern, specifying the email element. An instance
of the contact element may contain an email element or it may not.
The second optional pattern contains four element patterns. These four elements
are optional as a group. To be valid, an instance of the contact element either con-
tains all four child elements, in the specified order, or it contains none of them.
Code listing 12.2.2.1 (c) shows an invalid instance of the contact element. It is
invalid because it contains a houseNumber element, so should also contain street,
town and postcode elements. This is an important point, so is worth reiterating –
part of an instance document matches an optional pattern when it matches the whole
of the content of that pattern or when the whole of the content of the optional pattern
is missing.
Code listing 12.2.2.1 Using the optional pattern for groups of elements that
may occur zero or one time in an element instance
(a) The pattern
<element name="contact">
<element name="name"> <text /> </element>
<element name="phone"> <text /> </element>
<optional>
<element name="email"> <text /> </element>
</optional>
<optional>
<element name="houseNumber"> <text /> </element>
<element name="street"> <text /> </element>
<element name="town"> <text /> </element>
<element name="postcode"> <text /> </element>
</optional>
</element>
576
IDWA_C12.QXD 10/5/07 11:51 AM Page 577
Chapter 12 XML
at this point in the instance document, there is a choice between the following
patterns and the content of the current node must match one of these
patterns.
There is a problem with choice patterns, nicely illustrated by people’s names. Some-
times names are stored as a single character string, with no distinction between first
and last names. At other times, the first and last names are stored separately.
Code listing 12.2.2.3 shows two instances of the name element from the Staff Direc-
tory database that illustrate the two possibilities. Clearly these are two different content
models for the same element. The content model for the element in Code listing
12.2.2.3 (a) contains a text node, while the content model for the element in Code
listing 12.2.2.3 (b) contains two child element node and no text nodes (it is the child
elements – fName and lName – that contain text nodes).
578
IDWA_C12.QXD 10/5/07 11:51 AM Page 579
579
IDWA_C12.QXD 10/5/07 11:51 AM Page 580
Chapter 12 XML
Now, the choice pattern allows a choice between two child patterns. Either the con-
tents of the name element match the text pattern or they match the pattern contained
in the group pattern.
Another use for the choice pattern is to define enumerations. Remember, an enu-
meration is a very simple kind of user-defined data type – the user specifies a list of
valid values. To do this, the schema document needs to allow the XML parser to
choose between literals for the allowed data values.
The obvious solution is to enclose the list of valid literals in a choice pattern. Each
literal is enclosed in a value pattern. The value pattern has a type attribute to indicate
the intended data type of the matching literal and a datatypeLibrary attribute
(see the description of the data pattern). Its content is a single literal defining the
data value. The value pattern is matched by any literal representing the same value as
the one it contains. So, for example, the string literals ‘ This string has white
space ’ and ‘This string has white space’ both match the value pattern:
As the data type of the value is token, the XML parser strips out any additional white
space characters before it compares the text in the instance document to the literal con-
tained by the value pattern. Whenever the value pattern appears in a schema docu-
ment it means:
at this point in the instance document, this particular data value appears.
In contrast, the data pattern indicates that some data value appears at this point in the
instance document, without stating which.
The dayOfWeek element is a good example of an element that can be specified by
an enumeration as it should only contain text that represents a day of the week. The
content of the dayOfWeek element is valid if it matches one of the allowed literals.
Combining the value and choice patterns gives an enumeration of its allowed content.
Code listing 12.2.2.4 shows a pattern for the dayOfWeek element that ensures only
dayOfWeek elements containing the name of a day of the week will be accepted by
the XML parser. Values are case-sensitive, so a dayOfWeek element with content
‘Monday’ will match the pattern, whereas one that has ‘monday’ will not.
580
IDWA_C12.QXD 10/5/07 11:51 AM Page 581
Unordered content models are specified with the interleave pattern. The most com-
mon unordered content model is one in which two or more elements can appear in
any order. This is the usual situation with database applications. For example, to indi-
cate that the contents of the session element can appear in any order, enclose them
all in an interleave pattern:
<element name="session">
<interleave>
<element name="dayOfWeek"> <text /> </element>
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</interleave>
</element>
Any instance of the session element that contains exactly one of each of the
dayOfWeek, startTime and endTime elements, in any order, matches this
pattern.
The full effect of interleave is rather subtle. To understand how it works, consider
the more complex pattern shown in Code listing 12.2.2.5 (a). The idea is that the child
elements of the contact element ought to be allowed to appear in any order, except
for those that are part of the address. These elements must appear in the order
houseNumber, street, town and postcode. The group pattern ensures that this
happens.
The contents of a group pattern within an interleave pattern must appear in the
specified order. However, the other child patterns of the interleave pattern may be
interleaved in between the patterns from the group. This is what has happened in
the instance of the contact element shown in Code listing 12.2.2.5 (b). The four
elements houseNumber, street, town and postcode appear in this order, but
the other elements are interleaved between them. Notice that these other elements are
not in the same order as they appear in the pattern.
Code listing 12.2.2.5 Using the interleave pattern to allow contact details to
appear in any order
(a) The pattern
<element name="contact">
<interleave>
<element name="name"> <text /> </element>
<oneOrMore>
<element name="phone"> <text /> </element>
</oneOrMore>
<optional>
<element name="email"> <text /> </element>
</optional>
<optional>
<group>
<element name="houseNumber"> <text /> </element>
581
IDWA_C12.QXD 10/5/07 11:51 AM Page 582
Chapter 12 XML
582
IDWA_C12.QXD 10/5/07 11:51 AM Page 583
Code listing 12.2.2.6 Using the interleave pattern in mixed content models
(a) The pattern
1. <element name="session">
2. <interleave>
3. <element name="dayOfWeek"> <text /> </element>
4. <text />
5. <group>
6. <element name="startTime"> <text /> </element>
7. <element name="endTime"> <text /> </element>
8. </group>
9. </interleave>
10. </element>
(b) Three instances of the session element matching the pattern in (a)
<session>
I offer support sessions on a
<dayOfWeek>Monday</dayOfWeek>.
They start at <startTime>09:00</startTime> and end at
<endTime>10:00</endTime>. Feel free to drop in.
</session>
<session>
I offer support sessions running from
<startTime>09:00</startTime> to <endTime>10:00</endTime>
on a
<dayOfWeek>Monday</dayOfWeek>. Feel free to drop in.
</session>
<session>
The support session starts at <startTime>09:00</startTime> on
a <dayOfWeek>Monday</dayOfWeek>, and ends at
<endTime>10:00</endTime>. Feel free to drop in.
</session>
(c) An instance of the session element that does not match the pattern in (a)
<session>
The support session ends at <endTime>10:00</endTime> on a
<dayOfWeek>Monday</dayOfWeek>. Feel free to drop in. I’ll be
there from <startTime>09:00</startTime>.
</session>
Each of the instances of session shown in Code listing 12.2.2.6 (b) makes perfect
sense and there is no good reason to stop end users from using any of these variants.
The instance of the session element in Code listing 12.2.2.6 (c) does not match the
pattern as the endTime element appears before the startTime element, which is
583
IDWA_C12.QXD 10/5/07 11:51 AM Page 584
Chapter 12 XML
not allowed. It seems reasonable to disallow this example as stating the time that a
support session ends before saying when it starts could easily confuse people. There
are likely to be students turning up at 10:00 to find that their tutor is no longer
available.
The patterns optional, oneOrMore, zeroOrMore, interleave and group can be thought
of as operators on patterns as they take one or more patterns and operate on them to
make a new pattern. The choice pattern allows choices between different patterns.
Starting with the simple patterns – text, element, attribute, data and value – it is possi-
ble to use the operator patterns and the choice pattern to develop RELAX NG schema
to describe almost every possible XML document model. That almost sounds slightly
disappointing, but isn’t really. It’s possible to determine how expressive an XML
schema language is. RELAX NG is more expressive than W3C XML Schema, which, in
turn, is more expressive than DTD. All these languages describe classes of trees and
there are an awful lot of tree structures that no one would ever want to use as docu-
ment models.
584
IDWA_C12.QXD 10/5/07 11:51 AM Page 585
time zones. That said, the date and time data types are broadly similar to the SQL
DATE and TIME data types. The duration data type represents a length of time, but
is quite different from the SQL INTERVAL data type. A duration includes year,
month, day, hour, minute and second parts in a single value, so mixes the two incom-
patible time intervals of year and month and of day, hour and minute. The conse-
quence is that, in some cases, two duration values are neither longer nor shorter
than each other. Handle with care.
The dataTypeLibrary attribute specifies the datatype library being used. It’s
value is a URI, though, like the namespace name, this URI need not actually point to
anything. It is simply a globally unique identifier for the datatype library. Most of the
patterns in RELAX NG can have a dataTypeLibrary attribute in their start tag, not
just the data pattern. The dataTypeLibrary attribute acts rather like a default
namespace declaration – all descendants of the pattern can use the data types from this
datatype library, even if they haven’t got their own dataTypeLibrary attribute. As
a simple example, the following attribute pattern is perfectly acceptable:
<attribute
datatypeLibrary="http://www.w3/org/2001/XMLSchema-
datatypes" name="lastUpdate">
<data type="date" />
</attribute>
The attribute pattern declares that it, and its descendants, will use the W3C XML
Schema datatype library. Its child data pattern does so, without restating this fact.
Datatype library declarations in an ancestor of a pattern can easily be overridden.
Consider the definition of the staffMember element in Code listing 12.2.3.1. The
element pattern declares that it and its descendants will use the W3C XML Schema
datatype library. The element pattern contains two attribute patterns, each of which
contains a data pattern.
The first data pattern does use the W3C XML Schema datatype library, although
there is no mention of this datatype library in the data pattern itself, just the data type
name. The second attribute pattern does not use W3C XML Schema datatype library.
Instead, it uses its own datatypeLibrary attribute to indicate that it is using the
DTD Compatibility datatype library.
585
IDWA_C12.QXD 10/5/07 11:51 AM Page 586
Chapter 12 XML
One reason for allowing different datatype libraries is that they offer different features.
The DTD compatibility datatype library will check the relational-style candidate and
foreign key constraints. The W3C XML Schema datatype library allows the data values
for a data type to be further restricted using facets. A facet is simply a Boolean expres-
sion that restricts the allowed values for a data type. RELAX NG uses the param pattern
to specify a facet. This pattern can only appear in the content of a data pattern.
For example the minInclusive facet applies to numeric and date data types and
indicates that only values greater than or equal to the given value are permitted. The
maxInclusive facet is its twin, placing an upper bound on the allowed data values.
To restrict an integer to a value between 10,000 and 99,999 inclusive, use the follow-
ing pattern:
<data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="integer">
<param name="minInclusive">10000</param>
<param name="maxInclusive">99999</param>
</data>
There are a lot of facets defined by the W3C XML Schema specification (W3C, 2004e)
but they all work the same way. Use them if required, but, if not, leave them be.
586
IDWA_C12.QXD 10/5/07 11:51 AM Page 587
ignored. Otherwise, this is a fairly typical XML document for holding data, though
perhaps with fewer attributes than usual.
Code listing 12.2.4.1 A typical instance document for the Staff Directory
XML database
<?xml version="1.0" encoding="utf-8"?>
<staff>
<staffMember lastUpdate="2006-01-12">
<staffNo>31210</staffNo>
<name>Paul Smith</name>
<email>[email protected]</email>
<supportSessions>
<session>
<dayOfWeek>Wednesday</dayOfWeek>
<startTime>11:00</startTime>
<endTime>13:00</endTime>
</session>
</supportSessions>
</staffMember>
<staffMember lastUpdate="2005-03-07">
<staffNo>35054</staffNo>
<name>Selma Hutchins</name>
<phone>8706</phone>
<email>[email protected]</email>
<supportSessions>
<session>
<dayOfWeek>Monday</dayOfWeek>
<startTime>09:00</startTime>
<endTime>10:00</endTime>
</session>
<session>
<dayOfWeek>Monday</dayOfWeek>
<startTime>15:00</startTime>
<endTime>16:00</endTime>
</session>
<session>
<dayOfWeek>Tuesday</dayOfWeek>
<startTime>11:00</startTime>
<endTime>12:00</endTime>
</session>
</supportSessions>
</staffMember>
<staffMember lastUpdate="2005-03-07">
<staffNo>23257</staffNo>
<name>Freya Stark</name>
587
IDWA_C12.QXD 10/5/07 11:51 AM Page 588
Chapter 12 XML
<phone>8660</phone>
<phone>8661</phone>
<email>[email protected]</email>
<supportSessions>
<session>
<dayOfWeek>Monday</dayOfWeek>
<startTime>15:00</startTime>
<endTime>16:00</endTime>
</session>
</supportSessions>
</staffMember>
</staff>
In terms of the tree representation, the root node is the document itself and its only
child is the staff element.
Code listing 12.2.4.2 uses this simple observation to start off the schema document.
Line 1 is an XML declaration as the schema document is itself an XML document.
Note that the encoding for this schema document is ISO-8859-1, which is not the
same encoding as the instance document. This is perfectly acceptable.
Line 2 is the start tag for an element pattern. The name attribute identifies this as
the pattern for the staff element. Thus, the staff element of any instance docu-
ment must match this pattern. The namespace declaration sets the default namespace
for the entire XML document to http://relaxng.org/ns/structure/1.0.
This is the namespace name for the RELAX NG document model, so an XML parser
knows that the mark-up in this document is specified in the RELAX NG document
model.
Line 3 is blank, indicating that the pattern for the staff element is missing its
content model.
Line 4 is the closing tag </element>, indicating the end of this element pattern.
The RELAX NG document model has its own schema document so that validating
XML parsers can check the structure of RELAX NG schemas and ensure that, for
example, every choice element in the schema document has at least one child
element and every element element in the schema document has a name attribute.
The schema document is written using the RELAX NG XML schema language.
588
IDWA_C12.QXD 10/5/07 11:51 AM Page 589
skeleton for the final schema document. All the remaining patterns will appear be-
tween the start and end tags of this RELAX NG element pattern.
Having identified that the root node has the staff element as its single child node
and written down an element pattern for this child node, it is time to move down one
level of the tree and consider the content model for the staff element. To do this,
first identify the child nodes of the staff element, but, at this stage, don’t bother
about their content models.
Looking at the instance document in Code listing 12.2.4.1, it is clear that there is
only one kind of child node – the staffMember element. However, there are several
of them. The business rule is that the staff element must contain one or more
staffMember elements and Code listing 12.2.4.3 shows the pattern for that content
model. Line 3 now begins the RELAX NG oneOrMore pattern. This pattern contains a
single element pattern, for the staffMember element. The blank line, line 5, indi-
cates that the content model for the staffMember element has yet to be defined.
Code listing 12.2.4.3 The staff element contains one or more elements
called staffMember and that’s all
1. <?xml version="1.0" encoding="ISO-8859-1"?>
2. <element xmlns="http://relaxng.org/ns/structure/1.0"
name="staff">
3. <oneOrMore>
4. <element name="staffMember">
5.
6. </element>
7. </oneOrMore>
8. </element>
Once again, there is a single kind of child element, though it could occur many times
in the instance document, so the content model for the staff element is fully speci-
fied by the pattern in code listing 12.2.4.3.
Time to move down a level and consider the content model for the staffMember
element. Again, the instance document in Code listing 12.2.4.1 provides clues. Every
staffMember element has an attribute called lastUpdate.
Code listing 12.2.4.4 builds on Code listing 12.2.4.3. In RELAX NG, attributes are
indicated by an attribute pattern. Line 5 has the start tag for this attribute pattern,
with the name attribute in this start tag indicating that the attribute pattern is
matched by an attribute called lastUpdate. Line 6 is left blank to indicate that the
pattern for the value of the lastUpdate attribute hasn’t been specified yet and line 7
is the closing tag for the attribute pattern.
Every staffMember element also contains a staffNo element. Clearly, there
will only ever be one staffNo element for each member of staff, so a single element
pattern is needed in the content model for staffMember, as shown in lines 8–10.
The same is true of the name and email elements and, again, these also have a single
element pattern in the content model for staffMember.
The phone element is trickier. In the instance document shown in Code listing
12.2.4.1 the first staffMember element doesn’t contain a phone element, but the
third has two. This suggests a multiplicity of ‘zero or more’. Lines 14–18 in Code
589
IDWA_C12.QXD 10/5/07 11:51 AM Page 590
Chapter 12 XML
listing 12.2.4.4 place the zeroOrMore pattern around a single element pattern to
capture this part of the content model for staffMember.
The only other child of the staffMember element is the supportSessions
element. Every staffMember element in Code listing 12.2.4.1 has a supportSes-
sions element. However, the Staff Directory also includes contact details for non-
academic staff. Such staff never offer support sessions, so their staffMember
elements won’t contain a supportSessions element. This means that the
supportSessions element is optional. The pattern in lines 22–26 in Code listing
12.2.4.4 indicates this.
590
IDWA_C12.QXD 10/5/07 11:51 AM Page 591
The simplest way to deal with these missing content models is to start at the top of
Code listing 12.2.4.4 and work downwards. The first blank line, line 6, indicates that
the content model for the lastUpdate attribute is missing. The value of the
lastUpdate attribute is always a date, so use a data pattern with the W3C XML
Schema datatype library to define the attribute’s value (Code listing 12.2.4.5).
Code listing 12.2.4.5 The content model for the lastUpdate attribute
5. <attribute name="lastUpdate">
6. <data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="date" />
7. </attribute>
Line 9 in code listing 12.2.4.4 is also blank, indicating that the content model for the
staffNo element needs to be defined. At the Pennine University, the staff number is
always a five-digit number and there is often a need to list members of staff in staff
number order. For this reason, its value should be numeric.
One way to define a five-digit numeric value is to say that it is an integer between
10,000 and 99,999. Code listing 12.2.4.6 shows the relevant data pattern, with its
param children already written out. It is always best to deal with data patterns in one
go – identify that a data pattern is required and define its content. This is because its
child elements are always leaf nodes, so things don’t get too complicated. (The line
numbers have been adjusted to allow for the extra lines.)
Code listing 12.2.4.6 Restricting data values using the param pattern
8. <element name="staffNo">
9. <data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="integer">
10. <param name="minInclusive">10000</param>
11. <param name="maxInclusive">99999</param>
12. </data>
13. </element>
The next blank line occurs in line 12 in Code listing 12.2.4.4, indicating that the con-
tent model for the name element needs to be defined.
The simplest approach is to use the text pattern as it matches all the sample data in
the instance document in Code listing 12.2.4.1. However, as discussed above, the
name element might be split into fName and lName elements.
Code listing 12.2.4.7 uses the choice pattern to allow either of these two content
models in lines 15–25. Note that, the content models for the child elements fName
and lName have been left blank (lines 19 and 22).
The content models for the phone and e-mail elements in lines 16 and 20 in Code
listing 12.2.4.4 are simply the text pattern. Both could be specified in more detail as
RELAX NG allows a data pattern to be restricted using regular expressions. However,
plain old text will do fine for this example. Code listing 12.2.4.7 shows the complete
content models for the phone and email elements.
591
IDWA_C12.QXD 10/5/07 11:51 AM Page 592
Chapter 12 XML
Code listing 12.2.4.7 Using the choice pattern to allow different content
models for the name element
14. <element name="name">
15. <choice>
16. <text />
17. <group>
18. <element name="fName">
19.
20. </element>
21. <element name="lName">
22.
23. </element>
24. </group>
25. </choice>
26. </element>
27. <zeroOrMore>
28. <element name="phone">
29. <text />
30. </element>
31. </zeroOrMore>
32. <element name="email">
33. <text />
34. </element>
Returning to Code listing 12.2.4.4, the next, and final, blank line occurs in line 24.
This is the content model for the supportSessions element. This element must
always contain at least one session element and may contain several. The session
elements hold details of individual support sessions. Although some staff members do
not offer support sessions, this is captured by making the supportSessions
element optional. When the supportSessions element exists, however, it must
contain at least one session element.
Code listing 12.2.4.8 shows how to use the oneOrMore pattern to achieve this.
592
IDWA_C12.QXD 10/5/07 11:51 AM Page 593
the next level down (the blank lines). Code listings 12.2.4.7 and 12.2.4.8 include all
the missing content models.
The first is the content model for the fName element (line 19 in Code listing
12.2.4.7). This element will only ever contain text, as will the lName element
(line 22).
This just leaves the content model for the session element (line 39 in Code list-
ing 12.2.4.8). The session element contains exactly one of each of the dayOfWeek,
startTime and endTime elements. The startTime and endTime elements must
appear in this order, but that is the only restriction on the content model. Code listing
12.2.4.9 shows the RELAX NG pattern for this content model.
Code listing 12.2.4.10 The final schema document for the Staff Directory
1. <?xml version="1.0" encoding="ISO-8859-1"?>
2. <element xmlns="http://relaxng.org/ns/structure/1.0"
name="staff">
3. <oneOrMore>
593
IDWA_C12.QXD 10/5/07 11:51 AM Page 594
Chapter 12 XML
4. <element name="staffMember">
5. <attribute name="lastUpdate">
6. <data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="date" />
7. </attribute>
8. <element name="staffNo">
9. <data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="integer">
10. <param name="minInclusive">10000</param>
11. <param name="maxInclusive">99999</param>
12. </data>
13. </element>
14. <element name="name">
15. <choice>
16. <text />
17. <group>
18. <element name="fName"><text /></element>
19. <element name="lName"><text /></element>
20. </group>
21. </choice>
22. </element>
23. <zeroOrMore>
24. <element name="phone"><text /></element>
25. </zeroOrMore>
26. <element name="email"><text /></element>
27. <optional>
28. <element name="supportSessions">
29. <oneOrMore>
30. <element name="session">
31. <interleave>
32. <element name="dayOfWeek">
33. <choice>
34. <value type="token">Monday</value>
35. <value type="token">
Tuesday</value>
36. <value type="token">
Wednesday</value>
37. <value type="token">
Thursday</value>
38. <value type="token">Friday</value>
39. </choice>
40. </element>
41. <group>
42. <element name="startTime">
594
IDWA_C12.QXD 10/5/07 11:51 AM Page 595
595
IDWA_C12.QXD 10/5/07 11:51 AM Page 596
Chapter 12 XML
as part of the DTD schema language. The approach to defining candidate and foreign
keys is quite different from the approach taken in the relational data model. First, only
attributes can be defined as candidate or foreign keys. Second, this is done by assigning
the attribute a special data type, rather than by declaring an integrity constraint.
Any attribute with the data type ID from the DTD compatibility datatype library is
recognized as representing a candidate key. The value of such an ID attribute must be
a character string that meets the definition for an XML name token, with the added re-
striction that it cannot contain the colon character.
Be careful, here. The value of an ID attribute is not actually a name, it is a value.
This value is a character string that just happens to be restricted to the same set of
characters as that used for XML element and attribute names (among other things).
The XML specifications use the same description of this set of allowed characters for
both name tokens and values. When a document instance is validated, the DTD com-
patibility library checks that the data type of the value of any ID attribute is correct. It
also checks that the value of each occurrence of an ID attribute is unique within the
instance document. This includes ID attributes in different elements. Every attribute
node in the instance document that has a data type of ID must have a value that is
unique within the instance document. This makes the ID attribute a kind of global
surrogate key. A good analogy for the ID attribute is the sequence object used by the
Oracle DBMS to generate values that are unique within a database. Of course, the DTD
compatibility data type library does not actually generate the values for an ID attribute,
but it does check that they are unique.
As usual, there is an example in the Staff Directory as the staff number must be
unique. So far, the staff number has been modelled as a staffNo element, a child
element of the staffMember element. In RELAX NG there is currently no way to
check that the value of an element is unique within the instance document (the W3C
XML Schema language does now support this feature). So, although the content of the
staffNo element is a candidate key, there is no way to enforce the candidate key
constraint.
To enforce a candidate key constraint on staff numbers, then, they must be modelled
as an attribute with the data type ID. Code listing 12.2.5.1 shows how to do this.
Code listing 12.2.5.1 (a) shows the first few lines of the schema document and
(b) shows a typical instance document. Note that three changes have been made to the
schema document since Code listing 12.2.4.10, lines 4–13. First the staffNo is now
defined by an attribute pattern, not an element pattern. Second, its data type has
changed from the W3C XML Schema data type integer to the DTD compatibility
data type ID (obviously). Third, the data values are no longer constrained by param
patterns. This is because the DTD compatibility datatype library does not support
facets. With this new schema document, the XML parser will check the value of all the
staffNo attributes in an instance document and report an error if there are two the
same.
There is also a subtle change to the instance document. Now that staffNo is an
attribute with the ID data type, its value must be a valid XML name token. Name
tokens cannot start with a digit, so the actual staff number values can’t be used. In-
stead, the staff number needs to be prefixed with a letter or underscore. Code listing
12.2.5.1 (b) therefore prefixes the actual staff number with the letter ‘s’. This is one of
the main problems with using attributes to hold candidate key values.
596
IDWA_C12.QXD 10/5/07 11:51 AM Page 597
597
IDWA_C12.QXD 10/5/07 11:51 AM Page 598
Chapter 12 XML
There are two further problems with the schema in Code listing 12.2.5.1 (a). The first
is a conceptual problem. XML attributes are intended to model meta-data, which is
data about the element rather than actual data about the thing the element holds data
about. The staff number is clearly not meta-data – it is data about a member of staff, so
should be part of the content of the staffMember element, not part of its meta-data.
The second problem is a practical one. A staff number should be a five-digit num-
ber. The only way to enforce this constraint is to use the W3C XML Schema datatype
library. Unfortunately, it is also a candidate key and the only way to enforce this con-
straint is to use the DTD compatibility datatype library. As each attribute value or text
node can only have one data type and that data type can come from only one
datatype library, the schema author must choose either to enforce the candidate key
constraint or the constraint that data values are five-digit integers.
This second problem can be partially solved by using a surrogate key (see Chap-
ter 10). Code listing 12.2.5.2 shows the first few lines of an amended schema docu-
ment. The staffMember element has an attribute called surrogateKey with the
data type ID. The values of this attribute are guaranteed to be unique, but might not
be five-digit integers.
It also has a child element called staffNo. This element contains a five-digit inte-
ger, but the values might not be unique. The final twist is to insist that the value of the
surrogateKey attribute (minus its leading ‘s’) must be the same as the content of
the staffNo element. This will ensure that both constraints apply to the attribute’s
value and the element’s content.
Of course, the XML parser won’t do this. An additional XML processor needs to be
written to do this final check. However, this additional XML processor will be quite
simple. All it needs to do is check that, in each staffMember element, the value of
the surrogateKey attribute is the same as the content of the staffNo element. It
does not need to check that the surrogateKey attribute value is unique within the
document, nor that the content of the staffNo element is a five-digit integer as the
XML parser (actually, the datatype library) has already done both these checks.
Code listing 12.2.5.2 Using a surrogate key for the staff element
<?xml version="1.0" encoding="utf-8"?>
<element xmlns="http://relaxng.org/ns/structure/1.0"
name="staff">
<oneOrMore>
<element name="staffMember">
<attribute name="lastUpdate">
<data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="date" />
</attribute>
<attribute name="surrogateKey">
<data datatypeLibrary=
"http://relaxng.org/ns/compatibility/
datatypes/1.0"
type="ID" />
</attribute>
598
IDWA_C12.QXD 10/5/07 11:51 AM Page 599
<element name="staffNo">
<data
datatypeLibrary="http://www.w3/org/2001/ XMLSchema-
datatypes" type="integer">
<param name="minInclusive">10000</param>
<param name="maxInclusive">99999</param>
</data>
</element>
Attributes with the data type ID play the role of candidate keys. Attributes with the
data type IDREF play the role of foreign keys. The value of an IDREF attribute must
be an XML name token and must also match the value of some ID attribute in the
same instance document.
Code listing 12.2.5.3 shows a typical example of this mechanism in action. Code
listing 12.2.5.3 (a) is a RELAX NG schema document for the Staff Directory applica-
tion. The schema has been simplified to make it easier to see the foreign key mecha-
nism in action. As before, the root element is called staff and this will contain one
or more staffMember elements. Each staffMember element has a lastUpdate
and a staffNo attribute with the data type ID, so is a candidate key. It contains
name, phone and email elements, as before. However, the staffMember element
does not contain a supportSessions element. Instead, the staff element may
also contain zero or more session elements. The session element has a staffNo
attribute with the data type IDREF, so is a foreign key. It contains dayOfWeek,
startTime and endTime elements, as before.
This schema models the relationship between members of staff and the support ses-
sions they offer in quite a different way from the earlier ones. Here, elements repre-
senting staff and support sessions appear as child elements of the root element. In an
instance document, all the staffMember elements come first, followed by all the
session elements. The value of the staffNo attribute of each session element
will match the value of exactly one staffNo attribute of some staffMember ele-
ment. Code listing 12.2.5.3 (b) shows a simple document instance, with two members
of staff, each offering one support session.
599
IDWA_C12.QXD 10/5/07 11:51 AM Page 600
Chapter 12 XML
<attribute name="staffNo">
<data datatypeLibrary=
"http://relaxng.org/ns/compatibility/data
types/1.0" type="ID" />
</attribute>
<element name="name"> <text /> </element>
<zeroOrMore>
<element name="phone"> <text /> </element>
</zeroOrMore>
<element name="email"> <text /> </element>
</element>
</oneOrMore>
<zeroOrMore>
<element name="session">
<attribute name="staffNo">
<data datatypeLibrary=
"http://relaxng.org/ns/compatibility/data
types/1.0" type="IDREF" />
</attribute>
<element name="dayOfWeek"> <text /> </element>
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</element>
</zeroOrMore>
</element>
600
IDWA_C12.QXD 10/5/07 11:51 AM Page 601
<endTime>16:00</endTime>
</session>
</staff>
There is no need for the element with the ID attribute and that with the IDREF
attribute to appear at the same level in the document hierarchy. In fact, ID and IDREF
attributes can be used to link an element with one of its descendants.
There is a further twist to the ID and IDREF mechanism for implementing foreign
key constraints. There is also an IDREFS data type. An attribute that has the data
type IDREFS holds a list of space-separated XML name tokens, each of which match-
es the value of some ID attribute in the same instance document. The IDREFS data
type offers a way to solve the problem raised at the start of this subsection. Making the
staffNo attribute of the session element an IDREFS means that it can list all
members of staff who attend this support session. The IDREFS data type provides a
mechanism for implementing a foreign key that references multiple candidate keys,
directly implementing a many-to-many relationship. This goes beyond the capability
of foreign keys in the relational data model. Link tables achieve a similar effect in a
relational database, but require two foreign keys, not one.
The ‘or attribute’ in the above description is necessary because, in XML, an attribute
isn’t a child of its parent. Apologies if this is getting tedious, but the distinction is
important.
Figure 12.1 shows the tree representation of an instance document for the Staff
Directory XML database. The number labelling each node indicates its position in
the document order. The instance document holds contact details for two members
of staff – Paul Smith and Freya Stark. The dashed lines indicate four different
paths through this tree. All four paths start at the root node, though, in general,
paths can start from any node in the instance document. All four paths always move
down the tree structure, though, again, in general, paths can move back up the tree,
too.
The leftmost path in Figure 12.1 passes through the nodes 1, 2, 3 and 5, in that order.
This means that the path can be described by a sequence of nodes, written (1, 2,
3, 5). The other paths can also be described by the sequences of nodes (1, 2, 3,
7), (1, 2, 11, 13) and (1, 2, 11, 15).
601
IDWA_C12.QXD 10/5/07 11:51 AM Page 602
Chapter 12 XML
Figure 12.1 The tree representation of an instance document for the Staff Directory XML data-
base with four paths indicated.
1:Document
2:Element
node-name="staff"
3:Element 11:Element
node-name="staffMember" node-name="staffMember"
All four paths can be described by the same set of instructions, shown in Code
listing 12.3.1. These instructions do not define a unique path because, for example,
following the instruction in Step 3 leads to two nodes – the two staffMember
element nodes numbered 3 and 11. The nodes each step leads to form the starting
points for the next step. As Step 3 leads to nodes 3 and 11, nodes 3 and 11 form the
starting points for step 4. As each step may have a number of starting points, it makes
sense to consider them in document order. So, in step 4, first follow the instructions
with node 3 as the starting point. This leads to the two nodes 5 and 7. Next, follow the
instructions again with node 11 as the starting point. This leads to the two nodes 13
and 15. This means that Step 4 leads to four nodes – the staffNo and name child el-
ements of node 3, and the staffNo and name child elements of node 11.
As sequences of nodes describe both the starting point for a step and where it leads to,
they provide a concise description of what happens when the instructions in Code list-
ing 12.3.1 are followed, thus.
602
IDWA_C12.QXD 10/5/07 11:51 AM Page 603
Step 1: Starts with the whole document and leads to the sequence of nodes (1).
Step 2: Starts from the sequence of nodes (1) and leads to the sequence of nodes
(2).
Step 3: Starts from the sequence of nodes (2) and leads to the sequence of nodes
(3, 11).
Step 4: Starts from the sequence of nodes (3, 11) and leads to the sequence of
nodes (5, 7, 13, 15).
Clearly, sequences of nodes are a good way to describe both individual paths through
a tree and the result of following the instructions that describe a set of similar paths.
That is why the formal description of the XPath 2.0 and XQuery 1.0 data model
(XDM) uses sequences of nodes as its basic organization structure.
The XPath language is basically a computer language for describing instructions
like the ones above. In XPath terminology, each step is called a location step and a
set of location steps is called the location path. The sequence of nodes forming the
starting point for a particular location step is the context for that step. The context
size is the number of nodes in the context. Each node in the context is considered in
turn. At any stage in the evaluation of a location step, the node being considered is
called the context node, and the context position is its position in the current con-
text. Each location step equates to a sequence of nodes. The location path also
equates to a sequence of nodes – that produced by following each of the location
steps in turn.
This means that a location path does not actually equate to a path through the tree.
Instead, it describes a set of paths through the tree (just as the instructions above do).
The sequence of nodes a location path equates to is the sequence of nodes at the end
of all the paths it describes.
Before introducing the syntax for XPath, it is worth looking in a little more detail at
the concept of a location step. Location steps can be broken up into three distinct
parts. Some more examples based on the instance document shown in Figure 12.1 will
help illustrate these three parts.
First, consider the location path described by the instructions in Code listing 12.3.2.
Code listing 12.3.2 A second list of plain English instructions for navigating a
document tree
Step 1: Move to the root node.
Step 2: From the current context, move to the descendant element nodes called
staffMember.
Step 3: From the current context, move to the child element nodes that are
called either staffNo or name.
Step 1: Starts with the whole document and sets the context to the sequence of
nodes (1).
Step 2: Starts from the context (1) and sets the context to the sequence of nodes
(3, 11).
603
IDWA_C12.QXD 10/5/07 11:51 AM Page 604
Chapter 12 XML
Step 3: Starts from the context (3, 11) and sets the context to the sequence of
nodes (5, 7, 13, 15).
This means that the location path in Code listing 12.3.2 leads to the same sequence of
nodes as that in code listing 12.3.1, but with fewer location steps.
The first difference between the two is that in Step 2 in Code listing 12.3.2, the
instruction says ‘move to the descendent’, whereas, in Code listing 12.3.1, it says ‘move
to the child’. This illustrates the first way to refine the notion of a location step. Each
location step begins at a given context and moves to a new context. The initial
move could be from the context nodes to all their child nodes or to all their
descendant nodes or some other sequence of nodes defined relative to the context.
In XPath, this initial sequence of nodes for a location step is called the axis. The
axis is always defined relative to the context. XPath defines 12 different axes.
● The child axis consists of all child nodes of the context node.
● The parent axis consists of the parent node of the context node.
● The self axis consists of the context node.
● The descendant axis consists of all descendant nodes of the context node.
● The ancestor axis consists of all ancestor nodes of the context node.
● The descendant or self axis consists of the context node plus all descendant nodes
of the context node.
● The ancestor or self axis consists of the context node plus all ancestor nodes of the
context node.
● The attribute axis consists of the attribute nodes of the context node.
● The following sibling axis consists of all sibling nodes of the context node that
occur after it in document order. An attribute node has no siblings.
● The preceding sibling axis consists of all sibling nodes of the context node that
occur before it in document order. An attribute node has no siblings.
● The following axis consists of all nodes that occur after the context node in docu-
ment order, other than its descendants.
● The preceding axis consists of all nodes that occur before the context node in docu-
ment order, other than its ancestors.
Attribute nodes only ever appear in the attribute axis or in the self axis when the con-
text node is an attribute node. The two sibling axes are always empty when the context
node is an attribute. The parent, ancestor and ancestor or self axes only ever contain
element nodes – only elements can contain other nodes so only elements can be
parents. The other axes can contain element nodes, text nodes, namespace nodes, com-
ment nodes and processing instruction nodes.
An XPath axis is either a forward axis or a reverse axis. The difference between
them is that, for a forward axis, the nodes appear in document order, while for a
reverse axis they appear in reverse document order. This is important as it determines
the order of nodes in the node sequence for this step. The reverse axes are parent,
ancestor, ancestor or self, preceding and preceding sibling. The others are all forward
axes. Note that the order of attribute nodes in the attribute axis is implementation-
defined, but will always be the same.
604
IDWA_C12.QXD 10/5/07 11:51 AM Page 605
Returning to the two examples, Step 2 in Code listing 12.3.1 works as follows.
This also shows the second way to refine a location step – specify which nodes on the
chosen axis should be selected from the initial sequence of nodes identified by the
axis. In XPath, this is called a node test.
XPath includes two sorts of node test. The first, called a name test, selects nodes by
name. Code listing 12.3.2 uses a name test. Name tests only work for element and at-
tribute nodes as they are the only nodes that have names.
The second, called a kind test, selects a particular kind of node. The simplest
kind test selects nodes based on the different kinds of nodes allowed in the XDM – text
nodes, element nodes or attribute nodes (or comment, processing instruction
or namespace nodes). Code listing 12.3.1 uses a simple kind test to select just the
element nodes. A more complex sort of kind test selects nodes based on which kind
of XDM node they are and what their data type is.
Step 3 in Code listing 12.3.2 (and Step 4 in Code listing 12.3.1) starts with the con-
text (3, 11) and uses the child axis. The children of node 3 are nodes 5, 7 and 9. The
children of node 11 are nodes 13, 15 and 17. This gives an initial sequence of nodes
(5, 7, 9, 13, 15, 17). Step 3 then has the further information that only ele-
ment nodes called ‘staffNo’ or ‘name’ should be included. This looks like it should be
a name test. Unfortunately, a name test can only include one name. In fact, it is a gen-
eral rule that a location step can only have one node test. Using and and or to pro-
duce Boolean combinations of node tests simply isn’t allowed. That is why a location
step can have a predicate. A predicate is a Boolean combination of XPath expressions
(of location paths or location steps). This is a little odd, because XPath expressions are
node sequences, not Boolean values. It is possible because XPath coerces each node
sequence into a Boolean value as follows:
● the empty sequence is coerced into being the Boolean value false
● a sequence consisting of a single text node or attribute with Boolean data type is
coerced into being the Boolean value the node represents
● a sequence consisting of a single text node or attribute with a string data type is
coerced into being the Boolean value true for non-empty strings and false for
empty strings
605
IDWA_C12.QXD 10/5/07 11:51 AM Page 606
Chapter 12 XML
● a sequence consisting of a single text node or attribute with a numeric data type is
coerced into being the boolean value true for non-zero numbers and false for
zero or the special value NAN (which stands for ‘not a number’).
● any other non-empty sequence is coerced into being the Boolean value true.
This coercion applies to both the results of XPath expressions making up the predicate
and the result of the predicate itself, if this is a node sequence.
Returning to Step 3 in Code listing 12.3.2, the only way to capture the final
requirement – that only element nodes called staffNo or name should be included
in the node set for this step – is to use a predicate. Consider the location path ‘From
the current context, use the self axis and move to the node called staffNo’.
This location path has a single location step, so can be written using just an axis and
a node test. For a particular context node, the location path works as follows.
1 Start from the context node.
2 Use the self axis, giving an initial node sequence consisting of just the context node
itself.
3 Select only the nodes called staffNo. If the context node is called staffNo, this
leads to the node sequence consisting of just the context node itself; otherwise it
leads to the empty sequence.
4 Finally, set the context to this node sequence.
Similarly, the location path ‘From the current context, use the self axis and move to
the node called name’ evaluates to a sequence consisting of the context node if the
context node is called name, but to the empty sequence in all other circumstances.
The predicate for Step 3 in Code listing 12.3.2 combines these two location paths with
a Boolean or operator:
(From the current context, use the self axis and move to the element node called
staffNo) or (From the current context, use the self axis and move to the ele-
ment node called name)
Step 3 in Code listing 12.3.2 works as follows.
1 Start from the sequence of nodes (3, 11).
2 Use the child axis, giving an initial node sequence (5, 7, 9, 13, 15, 17).
3 Apply the node test to select all the element nodes from the node sequence (5, 7,
9, 13, 15, 17). This gives the node sequence (5, 7, 9, 13, 15, 17).
4 For each node in the sequence (5, 7, 9, 13, 15, 17) apply the predicate de-
scribed above. For nodes 5, 7, 13 and 15, the predicate evaluates to true. For nodes
9 and 17, both called email, the predicate evaluates to false. This gives the node
sequence (5, 7, 13, 15).
Using a combination of axis, node test and predicate, a location path can point to any
node in the XML document. This makes XPath one of the core XML technologies.
XPath expressions form the basis for XQuery, are used extensively in XSLT to identify
portions of a document for transformation and are used in some XML schema lan-
guages to apply constraints to the document structure (W3C XML Schema and
Schematron both use XPath expressions).
606
IDWA_C12.QXD 10/5/07 11:51 AM Page 607
So far, the discussion of XPath has been at a conceptual level. The syntax for XPath
expressions is based on the syntax for the UNIX file system’s pathnames. This makes
sense, because the UNIX file system is a hierarchical structure and most computing
professionals are familiar with the pathname syntax. In UNIX, a pathname is a
sequence of directory names, separated by a forward slash, and a final filename. For
example, the pathname /usr/local/httpd/files/index.html starts at the
root directory (the initial forward slash), moves to the directory called usr, then the
directory called local, then the directory called httpd, then the directory called
files and, in that directory, points to the file called index.html.
Similarly, an XPath location path is written as a sequence of location steps separated
by a forward slash. If there is only one location step in the location path, then there is
no need for a trailing forward slash. Each location step has three parts. First comes the
axis, represented by its name followed by a double colon – ::. So, in an XPath loca-
tion step, the child axis is written child:: and the follow sibling axis is written
follow-sibling::. The other axes all follow this pattern.
After the axis comes the node test. Node tests are either name tests or kind tests. The
XPath syntax for a name test is simply the name of the node required. If this is a local
name, then XPath will determine the namespace from the context. A node satisfies the
name test provided its expanded name matches the expanded version of the name
given in the name test. There is no need to specify whether the node is an element or
an attribute as XPath works this out from the axis. An axis includes attribute nodes if,
and only if, it is either the attribute axis or it is the self axis in the context of an at-
tribute node. In these situations, XPath assumes that the name test selects attributes
with the given name. In all other situations, it assumes that it selects elements with
the given name. This default kind of node for an axis, in a given context, is called the
principal node kind for that axis in that context.
Name tests can include wildcards. The wildcard symbol is ‘*’ and it selects all nodes
of the principal node kind. As node names can be qualified by a namespace prefix, it is
possible to use name tests of the form *:myName, which selects nodes of the principal
node kind called myName regardless of their namespace. Similarly, myPrefix:*
selects nodes of the principal node kind in the namespace with prefix myPrefix
regardless of their local name.
The XPath syntax for a simple kind test is the kind of the node followed by paren-
theses. So, the kind test for element nodes is written element(). This syntax gener-
alizes to allow more complex kind tests by including further information between the
parentheses, but these more complex kind tests are not covered in this book. The
text() and attribute() kind tests are matched by text nodes and attribute nodes
respectively. The special kind test node() selects any kind of node at all from the
nodes matching the axis. The node() kind test is useful for emphasizing that all
nodes from the axis are required, not just the principal node kind.
Putting the syntax together, the XPath syntax for Step 2 in Code listing 12.3.1 –
‘Move to the child element nodes’ – is:
child::element()
607
IDWA_C12.QXD 10/5/07 11:51 AM Page 608
Chapter 12 XML
child::* says ‘select all child element nodes, regardless of their name’. Most XPath
developers seem to use child::*, presumably because it is easier to type.
For Step 2 in Code listing 12.3.2 – ‘From the current context, move to the descen-
dant element nodes called staffMember’ – the XPath syntax is:
descendant::staffMember
The final part of a location step is the predicate. The predicate is simply a Boolean
combination of location paths, enclosed in square brackets. The final location step in
Code listing 12.3.2 – ‘From the current context, move to the child element nodes that
are called either staffNo or name’ – is written:
child::*[self::staffNo or self::name]
This location step can be read as follows.
Select all child nodes of nodes in the current context. From these, select all
element nodes (the principal node kind), regardless of their node name (the
wildcard ‘*’). Finally, for each node selected so far, test whether or not it satisfies
the predicate ‘This node is called either staffNo or name’.
A single forward slash at the start of a location path selects the root node of the docu-
ment. This is the first step in our two examples, so completes the syntax required to
write XPath expressions for them. The location path described by Code listing 12.3.1
can be written:
/child::element()/child::element()
/child::element()[self::staffNo or self::name]
or, using wildcards:
/child::*/child::*/child::*[self::staffNo or self::name]
Code listing 12.3.2 can be written:
/descendant::staffMember
/child::*[self::staffNo or self::name]
XPath can be used as a simple query language for XML documents. Using the instance
document in Figure 12.1 as the database instance, the business query ‘What is the
name of the member of staff with staffNo “10780”’ gives the answer ‘Paul Smith’.
The XPath expression that retrieves this data from the instance document is shown in
Code listing 12.3.3.
[child::staffNo eq '10780']
608
IDWA_C12.QXD 10/5/07 11:51 AM Page 609
This predicate uses the XPath comparison operator eq, which makes equality com-
parisons between two sequences of values. XPath comparison operators compare XPath
expressions, which are always sequences. As XPath doesn’t deal with ‘raw’ values, the
literal '10780' actually represents the sequence ('10780'), which is a sequence
consisting of a single value. This is a standard feature of XPath syntax – a literal for a
‘raw’ value is interpreted as the sequence with that single value as its only member.
There is also a problem with the other part of this predicate. The XPath expression
child::staffNo doesn’t represent a sequence of values but a sequence of element
nodes. XPath automatically converts this into a new sequence before making the
equality comparison required by the eq operator. Each element node is replaced by its
string value. The string value of an element node (or the root node) is the concatena-
tion of the content of its descendant text nodes into a single string, taking the text
nodes in document order. Thus, a sequence of two element nodes is converted into a
sequence of two string values before the comparison is made. XPath also converts text
nodes and attribute nodes into a single string value – for text nodes this is their con-
tent, while for attribute nodes it is their value. In this way every sequence of nodes is
replaced by a sequence of string values. This process is known as atomization.
There are two further atomizations, for comment nodes and processing instruction
nodes. Each becomes the string representing its text.
Now that both sides of the eq operator have been turned into sequences of string val-
ues, XPath compares them according to the following rules:
There are five other value comparison operators covering the obvious comparisons:
Step 1: Start from the document and return the context consisting of the root
node (1).
Step 2: From the context (1) select all child element nodes called staff. This
leads to the context (2).
609
IDWA_C12.QXD 10/5/07 11:51 AM Page 610
Chapter 12 XML
Step 3a: From the context (2) select all child element nodes called
staffMember. This leads to the sequence of nodes (3, 11). Now, apply the
predicate child::staffNo eq '10780' to each node in this sequence.
Step 3b: With node 3 as the context node, evaluate the location path
child::staffNo. This leads to the sequence of nodes (5) (there is only one
child node called staffNo). Atomize this sequence. This leads to the sequence
of string values ('10780'). Now consider the right-hand side of the eq opera-
tor. This is the string value '10780', which is converted to the sequence
('10780'). As both sides of the eq operator evaluate to the sequence
('10780'), return the value true. Node 3 is added to the final context.
Step 3c: With node 11 as the context node, evaluate the location path
child::staffNo. This leads to the sequence of nodes (13) (there is only one
child node called staffNo). Atomize this sequence. This leads to the sequence
of string values ('23257'). Now consider the right-hand side of the eq opera-
tor. This is the string value '10780', which is converted to the sequence
('10780'). As the two sides evaluate to single-member sequences with differ-
ent string values, return the value false. Node 11 is not added to the final
context.
Step 3d: Finally, pass the context (3) to the next step.
Step 4: From the context (3) select all child element nodes called name. This
leads to the context (7).
Step 5: From the context (7) select all child text nodes. This leads to the
context (8).
As node 8 is the text node with value ‘Paul Smith’, the XPath expression does answer
the query.
This example demonstrates the power of predicates. For each node that satisfies the
axis and node test of the third location step, the predicate burrows deeper into the tree
to test a Boolean condition. Only if this test is satisfied is the node included in the
context for the next location step. This is a little like a tourist using a street map to
navigate to a hotel in Venice (where there are no cars, only pedestrians and boats). The
alleys are the edges and the junctions are the nodes. At each junction, the tourist
checks whether or not this alleyway leads to the hotel. If it does, the person turns down
the alleyway. If not, the alleyway is ignored.
This introduction to XPath covers the key concepts in the language and introduces
the concepts at the heart of XML query languages. Beyond the basic features discussed,
XPath provides other comparison operators and a large collection of functions. The
Further reading section at the end of this chapter points to resources that can help in
exploring the XPath language, along with the more powerful XML query languages
XQuery and SQL/XML. XQuery builds on the ability of XPath to locate specific por-
tions of an XML document to deliver an XML query language along the lines of the
relational query language SQL. The W3C adopted XQuery as an official recommenda-
tion in January 2007. In cases where XML documents or document fragments are
stored in relational databases, the SQL/XML query language will also be important.
This extends the traditional SQL query language to deal with column values that are
610
IDWA_C12.QXD 10/5/07 11:51 AM Page 611
one of the XDM data types. SQL/XML has already been at least partially implemented
by several commercial RDBMS.
The differences between the relational data model and the XDM do not affect
conceptual database design as it doesn’t target any particular data model. Even so, if a
project is known to be using an XML rather than a relational database, then it
would make sense to use data types from the XDM rather than relational model. Other
than that, conceptual database design is the same for XML as it is for relational
databases.
Logical database design follows a similar process to that for logical design for the
relational data model, but targets the XDM. First, any structures in the conceptual ER
diagram that cannot be implemented directly using structures from the XDM are
removed. This give a logical ER diagram for the XDM. The structures in this logical ER
diagram are then implemented using elements and attributes – the core organization
structures of the XDM. It’s important to note that the UML attributes of the ER dia-
gram do not necessarily become XML attributes – many become XML elements.
Logical database design for the XDM does not aim to produce a complete schema
document, like that in Code listing 12.2.4.10. Instead, it produces descriptions for each
of the main elements – their names, attributes and content models. This raises the
problem of how best to write down these descriptions.
In logical database design for the relation data model, the table schema are written
in such a way that they are almost, but not quite, SQL create table statements. The
main reason for not going directly to SQL is that table schema are a little simpler than
create table statements. XML, and the XDM, are not as simple as the relational data
model. In fact, the RELAX NG compact syntax probably provides the simplest language
for describing the content models of XML elements. This makes it a strong candidate
for the logical design language for XML databases. However, that means learning yet
another syntax (the seventh in this book, which has used UML, PHP, SQL, XML, RELAX
NG XML syntax and XPath). As the aim here is to understand the process of logical
database design for the XDM it makes sense to use the RELAX NG XML syntax instead.
This leads to slightly longer descriptions for the content models, but avoids cluttering
up the discussion with explanations of yet another syntax.
Physical database design for XML turns the element content models produced dur-
ing logical design into a complete schema document for the database. It also addresses
the general physical database design issues discussed in Chapter 10. Physical database
design for XML databases is even more closely tied to the choice of DBMS than it is for
relational databases. The best general advice is to bear in mind the issues discussed in
Chapter 10 and read the documentation for the chosen XML DBMS very carefully. The
remainder of this section focuses on logical database design for XML.
Logical database design for the XDM has two main goals. First, it must produce a set
of content models for XML elements that implement the requirements documented
611
IDWA_C12.QXD 10/5/07 11:51 AM Page 612
Chapter 12 XML
by the conceptual database design. Second, it should minimize data duplication. This
requires a different style of content model from more traditional, document-centric
XML applications, with a lot more use of relational style keys. If relational style keys
aren’t used, then there will be lots of data duplication.
It is important to realize that there is more than one mechanism for implementing
relational style keys. Only one mechanism has been discussed in this chapter – that of
using attributes with the special data types ID, IDREF and IDREFS. This mechanism
matches surrogate keys with single- or multiple-valued foreign keys to manage referen-
tial integrity. The surrogate key is a single attribute and its value is unique within the
document rather than within the containing element.
The W3C XML Schema language includes this mechanism, but also includes a more
flexible mechanism for implementing referential integrity. A schema document writ-
ten using W3C XML Schema can define a collection of attributes, or child elements
with only text or data content, as a key. In effect, each key definition in an element’s
content model is a candidate key constraint as the combination of data values must
be unique within each instance of the containing element. Defining a set of attributes
or elements in an element’s content model as a keyref is equivalent to defining a
foreign key. The keyref definition must explicitly reference a named key and the
combination of values in an instance of the keyref must be the same as that in an
instance of the matching key.
To complete the set of relational style keys, W3C XML Schema includes a unique
constraint – the counterpart of the SQL UNIQUE constraint – that enforces uniqueness
but allows the attributes or elements to be optional (the XML equivalent of NULL). All
three of these constraints use XPath expressions to identify the set of attributes or ele-
ments that comprise the relational style key.
Both these mechanisms for implementing relational style keys in XML
(ID/IDREF/IDREFS and unique/key/keyref) only work within a single XML
instance document (although the instance document may be stored in more than one
physical file).
There is a third mechanism, using the W3C’s XLink recommendation, that allows
foreign key constraints that apply to elements (and attributes) from two document in-
stances with different document models. Particular XML-based DBMS may provide
other mechanisms to implement relational style keys. This means that, during logical
design, there is a choice of two general mechanisms for implementing the associations
of the conceptual ER diagram – element containment and relational style keys.
The multiple mechanisms for implementing relational style keys mean only a gen-
eral indication that relational style keys are being used is made during logical database
design. The particular implementation mechanism is described during physical data-
base design. Deciding to use an XML schema language or some dedicated hyperlink
mechanism, such as XLink, is a task left to the physical design stage.
The organization structures of the XDM are documents, elements and attributes
and relational style keys. An XDM relational-style foreign key can reference multiple
matching candidate keys – something that the relational data model does not allow.
This means that the XDM can directly support the many-to-many associations of
the conceptual ER diagram. It also directly supports multivalued UML attributes as
child elements that may occur more than once. Even so, there are some ER diagram
612
IDWA_C12.QXD 10/5/07 11:51 AM Page 613
organization structures that are best replaced by simpler structures at the start of the
logical design process. These are:
As with logical design for relational databases, the first task is to create a logical ER
diagram from which these two features have been removed.
Binary and unary associations that have attributes of their own will be many-to-
many associations. They are resolved in exactly the same way as they were for the rela-
tional data model. The many-to-many association is replaced by a link entity and its
attributes are those of the original many-to-many association.
The link entity participates in a one-to-many association with each of the two orig-
inal entities. Figure 12.2 repeats an example of a binary association seen in Chapter 9
(Figure 9.12). It’s important to emphasize that only associations with attributes need
to be replaced by a link entity. Many-to-many associations that don’t have attributes
can be left as they are.
Each n-ary association is also resolved in exactly the same way that it was for the re-
lational data model. The n-ary association is replaced by a link entity and n binary as-
sociations connecting the link entity to each of the n entities that participated in the
original n-ary association. Figures 12.3 (a) and (b) repeat an example of how to resolve
a three-ary association discussed in Chapter 9 (Figures 9.15 and 9.16).
Figure 12.4 shows the logical ER diagram for the Web Timetable database using the
XDM rather than the relational data model. There are far fewer changes here than
there were for the logical ER diagram produced in Chapter 9 (Figure 9.17). Only the
three-ary association has been replaced as the Uses association in the conceptual ER
diagram did not have any attributes of its own. One problem with this logical ER dia-
gram is that the conceptual ER diagram was written using data types appropriate to the
Figure 12.2 Replacing a binary association that has attributes with a link entity.
from : Date
to : Date
613
IDWA_C12.QXD 10/5/07 11:51 AM Page 614
Chapter 12 XML
Figure 12.3 Replacing a three-ary association with a link entity and three
binary associations.
«persistent»
Staff
0..1
{Inner multiplicity is 0..*}
«persistent»
BookEquipment
0..* TeachingSession
bookedOn : Date
bookedFor : Date
«persistent»
Equipment
«persistent»
Staff
1..1
Makes
0..*
«persistent»
EquipmentBooking For «persistent»
bookedOn : Date
0..* 1..1 TeachingSession
bookedFor : Date
0..*
Books
1..1
«persistent»
Equipment
relational data model (specifically, to the SQL data language). Rather than redo the
conceptual design using data types appropriate to XML, examples of how to translate
between SQL and XML data types are discussed.
The move from the logical ER diagram to an XML schema document is reasonably
simple. Each entity becomes an element. The entities’ attributes either become child
614
IDWA_C12.QXD 10/5/07 11:51 AM Page 615
Figure 12.4 The logical ER diagram for the Web Timetable database, targeting the XML
data model XDM.
elements or attributes of the element implementing the entity. Deciding which de-
pends on the intended interpretation of the UML attribute. In XML, attributes repre-
sent meta-data, and element content represents data. In UML, most attributes
represent data, but they can also represent meta-data. UML attributes that represent
meta-data should be implemented as XML attributes. Those that represent data should
be implemented as XML child elements. Note that, as XML attributes can only have
multiplicity 0..1 or 1..1, a UML attribute with any other multiplicity must therefore be
implemented as an XML element. UML attribute multiplicity is dealt with using XML
element or attribute multiplicity (including optionality). Primary and candidate keys,
along with any business rules, are initially recorded as comments. The implementa-
tion of these constraints depends on the choice of mechanism for implementing rela-
tional style keys, which is a physical design issue.
This first step leads to an initial element pattern for each entity in the logical ER
diagram, but doesn’t deal with relationships. Consider the Student entity in Figure 12.4.
An initial RELAX NG pattern looks like that in Code listing 12.4.1 (a). The entity
becomes an XML element, with the same name, apart from the initial lower-case ‘s’.
Both UML and XML are case-sensitive, but have different conventions. UML entity
names always begin with an upper-case letter, whereas XML element names tend to
begin with a lower-case letter.
615
IDWA_C12.QXD 10/5/07 11:51 AM Page 616
Chapter 12 XML
The Student entity has six attributes, all of which represent data so all become child
elements of the student element. Code listing 12.4.1 (b) shows the definition of the
studentNo element.
Notice the use of the data and param patterns to capture the data type and length
defined for the UML studentNo attribute. The data type is from the W3C XML Schema
datatype library because the only way to limit the length of data values to exactly ten
characters is to use the W3C XML Schema length facet. This facet specifies the
length of data values for character string data types.
The data type chosen is token rather than string. The string data type pre-
serves all white space, including leading and trailing spaces, new lines, tabs and so on.
This seems close to the SQL CHAR data type, which will pad short values with spaces.
However, the value for studentNo will always have exactly ten characters – this is
a Pennine University business rule. As there shouldn’t be any white space in a
studentNo value, token seems a better choice.
Although this logical design uses data types from W3C XML Schema, it might be
that a different schema language is used to implement the logical design. If that
happens, then, during the physical database design, the data pattern in Code listing
12.4.1 (b) is translated into the nearest equivalent data type provided by the chosen
schema language. Even if it is known in advance that the available data types will be
more restricted than those from W3C XML Schema, it makes sense to document in the
logical database design any stronger requirements identified during conceptual data-
base design. The data pattern is a good way to do this.
616
IDWA_C12.QXD 10/5/07 11:51 AM Page 617
must be changed in two different places. Use a RELAX NG named pattern (see below)
to overcome this problem.
The content of these elements will be plain text as there is no length specified in the
ER diagram. Note that this is subtly different from the SQL VARCHAR, which does not
perform white space normalization. If white spaces must be preserved in data values,
use the string data type instead.
Another point worth noting is that the RELAX NG pattern for the student ele-
ment doesn’t use the interleave pattern, so the child elements must appear in the
order they are defined. Although this isn’t required by the UML of the logical ER dia-
gram, it seems a reasonable design decision for an XML database.
Code listing 12.4.3 The complete initial element pattern for the UML Student
entity
<element name="student">
<element name="studentNo">
<data datatypeLibrary=
617
IDWA_C12.QXD 10/5/07 11:51 AM Page 618
Chapter 12 XML
"http://www.w3/org/2001/XMLSchema-datatypes"
type="token">
<param name="length">10</param>
</data>
</element>
<oneOrMore>
<element name="fName"> <text /> </element>
</oneOrMore>
<element name="lName"> <text /> </element>
<element name="email"> <text /> </element>
<optional>
<element name="email"> <text /> </element>
</optional>
<optional>
<element name="phone"> <text /> </element>
</optional>
<!-- PRIMARY KEY CONSTRAINT: (studentNo) -->
<!-- CANDIDATE KEY CONSTRAINT: (email) -->
</element>
The initial element patterns capture information from the logical ER diagram about
entities and attributes. The next step is to capture information about relationships.
This is more complex because there is a choice of mechanisms for implementing rela-
tionships and it is where XML document models designed to hold data differ most
from those designed to hold documents. Data-centric documents make much more
use of relational-style keys than document-centric ones. This is because using element
containment to implement a relationship may lead to data duplication. For example,
in Figure 12.4, there is an aggregation association between the Course and Module en-
tities. This indicates that a Module is part of a Course. It might seem natural to make
the corresponding module element a child of the course element, using contain-
ment to implement the association. Unfortunately, at the Pennine University, some
modules are used by many courses (indicated by the multiplicity constraints on Uses).
To reflect this, a copy of the module element must appear in each of the course el-
ements using the module, leading to unnecessary data duplication. A key goal of the
database approach is to avoid unnecessary data duplication, so, under these circum-
stance, relational-style keys should be used to implement the association.
Another problem with using containment to represent associations is that some-
times it simply doesn’t work. Consider the association KeptIn between the Equipment
and Room entities (Figure 12.4). It might seem reasonable to use containment to
implement the association and make the equipment element a child of the room
element. Again, this doesn’t work – this time because some occurrences of Equipment
are not linked to any occurrence of Room (check the lower bound on the multiplicity
constraint). This leads to two general rules for guiding us as to whether to use contain-
ment or relational-style keys.
1 Where both entities may occur independently of one another, the association must
be implemented using relational-style keys.
618
IDWA_C12.QXD 10/5/07 11:51 AM Page 619
These rules help to ensure that each of the many different sorts of association can
be implemented effectively.
One sort of association that is always implemented using element containment is a
composition association. This is because composition associations model a whole–part
relationship between entities, the part not being able to exist independently of the
whole. This is exactly what containment does – the child element is a part of the par-
ent and cannot exist independently. The entity representing the whole is designated
the parent element and the entity representing the part is the child element.
There is one unusual circumstance that leads to a problem here. Consider the
unary composition association on the Component entity in Figure 12.5. This ER dia-
gram models the scenario of a piece of coursework being made up of components
and each component itself being made up of other components. It isn’t possible to
define recursive content models such as this using the Russian doll style of schema
document.
Code listing 12.4.4 shows the RELAX NG syntax for defining a named pattern. Once
defined, using the RELAX NG define pattern, it can be referred to anywhere else in the
schema document using the RELAX NG ref pattern. Named patterns allow the
schema’s author to define recursive content models. Code listing 12.4.4 (a) can be read
as follows.
Code listing 12.4.4 (b) shows a slightly different version of the define pattern. The idea
of this version is that a coursework component can only have one component of its
own. The problem is that the element pattern actually insists that every component
contains a component. This leads to an infinite descent – a component must have a
child component, which must itself have a child component, which must itself have a
child component, ad infinitum. To avoid this, always enclose a recursive ref pattern in
an optional pattern or a zeroOrMore pattern.
«persistent» «persistent»
Component Coursework
619
IDWA_C12.QXD 10/5/07 11:51 AM Page 620
Chapter 12 XML
Code listing 12.4.4 Using named patterns to define recursive content models
(a) A recursive content model
<element name="coursework">
<element name="title"> <text /> </element>
<element name="rubric"> <text /> </element>
<oneOrMore>
<ref name="elmComponent" />
</oneOrMore>
</element>
<define name="elmComponent">
<element name="component">
<element name="sequenceNo"> <text /> </element>
<element name="title"> <text /> </element>
<element name="instructions"> <text /> </element>
<zeroOrMore>
<ref name="elmComponent" />
</zeroOrMore>
</element>
</define>
(b) An invalid content model for the component element that leads to an
infinite descent
<define name="elmComponent">
<element name="component">
<element name="sequenceNo"> <text /> </element>
<element name="title"> <text /> </element>
<element name="instructions"> <text /> </element>
<ref name="elmComponent" />
</element>
</define>
There are other situations in which using containment in a recursive content model
can lead to infinite descents or to infinite loops, where one element contains a sec-
ond, which contains the first, which contains the second, ad infinitum. When an ele-
ment must have a recursive content model, it is worth writing out a few element
instances, just to check that there are no hidden infinities lurking in the content
model.
620
IDWA_C12.QXD 10/5/07 11:51 AM Page 621
A good example of this situation is the association between the Staff and SupportSession
entities in the Staff Directory database. These are the only two entities in the database.
Moreover, the SupportSession entity cannot exist independently of the Staff entity
The one-to-many association between them is implemented by designating the Staff
entity the parent and the SupportSession entity the child. The multiplicity constraint
at the “many” end of the association determines the multiplicity of the child element.
The child element is optional if the lower bound on the multiplicity constraint is zero.
Otherwise it is required.
Code listing 12.4.5 shows three (incomplete) schema demonstrating alternative
ways to implement this association using parent–child element containment. Code
listing 12.4.5 (a) uses the approach taken in Section 12.2. The whole Staff Directory
database is represented by the staff element. The Staff entity is implemented as the
staffMember element. This element contains an optional supportSessions
element, representing the collection of all support sessions offered by a particular
member of staff. The supportSessions element itself contains one or more
session element and it is the session element that implements the
SupportSession entity.
621
IDWA_C12.QXD 10/5/07 11:51 AM Page 622
Chapter 12 XML
The supportSessions element isn’t really necessary. Code listing 12.4.5 (b)
makes the session elements direct children of the staffMember element. This is
both simpler and closer to the meaning of the ER diagram.
Code listing 12.4.5 (c) takes this one stage further, dispensing with the session
element as well and simply including the support session data items as a repeating
group of child elements – dayOfWeek, startTime and endTime. This will
only work because the content model fixes the order of these elements. Any applica-
tion reading the instance document will see that each dayOfWeek element is
followed immediately by a startTime and endTime element and so can treat
them as a group. Relying on the order of elements to convey meaning is not a good
idea in a database application, so, of the three approaches, the second is probably
the best.
622
IDWA_C12.QXD 10/5/07 11:51 AM Page 623
</element>
</zeroOrMore>
</element>
</oneOrMore>
</element>
When the lower bound on both multiplicity constraints is zero, the association cannot
be implemented using containment. This is because a lower bound of zero on a multi-
plicity constraint means that the entity at the opposite end may occur independently.
The association KeptIn between the Equipment and Room entities (Figure 12.4) is a
good example. The lower bound on both multiplicity constraints is zero, which means
that both entities can occur independently of each other – some equipment is not kept
in any room and some rooms have no equipment kept in them, for example. Clearly,
neither entity can be implemented as a child element of the other, so the association
must be implemented by relational-style keys. The question is, which element is the
referenced element (so gets the ‘primary key’) and which is the referencing element
(getting the ‘foreign key’)?
The multiplicity constraints provide the answer, just as they did during logical
design for the relational model. Consider the KeptIn association between the
Equipment and Room entities. The two entities are implemented as separate XML
elements – equipment and room. XML foreign keys can refer to multiple matching
candidate key values, so it is possible to place a foreign key in the room element that
references the primary key in the equipment element. Each foreign key would
include multiple values, matching the primary key value of multiple occurrences of
the equipment element.
The alternative is to place the foreign key in the equipment element. Each foreign
key would include a single value, matching the primary key values of a single occur-
rence of the room element.
This second approach is a lot simpler, so is the best one to take. The general rule is
as follows.
623
IDWA_C12.QXD 10/5/07 11:51 AM Page 624
Chapter 12 XML
When the lower bound on both multiplicity constraints is zero, use relational-
style keys to implement the association. The entity at the association end with
the larger upper bound on its multiplicity constraint is designated the referenc-
ing entity. The other is the referenced entity. If the upper bounds are equal, then
choose as the referenced entity the one that is more significant in terms of the
scenario being modelled (the most important to the business).
A foreign key is added to the element representing the referencing entity. A primary key
is added to the element representing the referenced entity (unless it already has one).
Code listing 12.4.6 shows the element schema for the Equipment and Room entities,
with comments indicating their primary and foreign keys and describing a default
value. Notice that the comment describing the foreign key is limited to mentioning
the referenced element. This is because there are at least three possible mechanisms for
implementing foreign keys. If the final schema document is written in RELAX NG,
then the foreign key will be implemented using the ID/IDREF/IDREFS mechanism.
If it is written in W3C XML Schema, then the foreign keys may use its key/keyref
mechanism. Some native XML DBMS may use XLink or a proprietary mechanism.
Giving a natural language description is the best that can be done during the logical
design process.
624
IDWA_C12.QXD 10/5/07 11:51 AM Page 625
<optional>
<element name="cost">
<data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="decimal" />
</element>
</optional>
<optional>
<element name="capacity">
<data datatypeLibrary=
"http://www.w3/org/2001/XMLSchema-datatypes"
type="date" />
<!-- DEFAULT VALUE = Today -->
</element>
</optional>
<!-- PRIMARY KEY CONSTRAINT: (assetNo) -->
<!-- FOREIGN KEY CONSTRAINT: Single value, references the
room element -->
</element>
Another relationship best implemented using relational-style keys is the unary associ-
ation. Using relational-style keys avoids defining a recursive content model for the
element. Even the unary composition association, discussed above, would probably be
better implemented by means of relational-style keys.
Finally, many-to-many associations should always be implemented using relational-
style keys. Many-to-many associations rely on the ability of XDM foreign keys to refer
to multiple matching candidate keys. There are no strict rules on designating the refer-
encing and referenced entities. The entity at the association end with the smallest
upper bound on its multiplicity constraint is a good candidate for the referenced
entity as this keeps the list of foreign key values on the referencing entity to a mini-
mum. Tables 12.1 and 12.2 summarize the guidance given on when to use contain-
ment and when to use relational-style keys.
Generalization relationships are much easier to deal with than associations. One very
simple approach is to create an element pattern for the more general entity and include
in its content model element patterns for each of its more specialized entities. Disjoint
constraints are implemented using the choice pattern, while participation constraints
are implemented using the optional pattern. To see this approach on a reasonably com-
plex example, consider the five generalization relationships in Figure 12.6.
Code listing 12.4.7 presents the element schema for the Room entity, although it
doesn’t include the specializations for the Laboratory entity. The constraints on this
first set of three generalization relationships are disjoint and optional. The optional
constraint means that an occurrence of the Room entity need not be an occurrence of
any of its more specialized versions. This is implemented by the optional pattern,
which means that an occurrence of the room element may not contain any of the
data for its more specialized versions. The disjoint constraint means that an occur-
rence of the Room entity can be an occurrence of, at most, one of the more specialized
versions. The choice pattern achieves this, by insisting that only one of the three child
625
IDWA_C12.QXD 10/5/07 11:51 AM Page 626
Chapter 12 XML
Parent Child
Type of association element element Comment
Entity2 Entity1 For unary composition
«persistent» ConnectWith «persistent»
associations, use relational-
Entity1 Entity2 style keys instead.
Referenced Referencing
Type of association element element Comment
Entity1 Entity2 Assuming m<n. If m = n,
«persistent» ConnectWith «persistent»
choose the entity that is
Entity1 0..m 0..n Entity2
most significant to the
business as referenced en-
tity.
Entity1 Entity1
«persistent» n..m
Entity1
p..q
ConnectWith
626
IDWA_C12.QXD 10/5/07 11:51 AM Page 627
«persistent»
Room
building : Varchar {PK}
roomNo : Varchar {PK}
capacity : Integer
{disjoint, optional}
«persistent» «persistent»
ComputingLab ScienceLab
specialistSoftware : Set Varchar safetyRating : Varchar
notes [0..1] : Varchar
627
IDWA_C12.QXD 10/5/07 11:51 AM Page 628
Chapter 12 XML
628
IDWA_C12.QXD 10/5/07 11:51 AM Page 629
</optional>
</element>
</optional>
</group>
<group>
<optional>
<element name="computingLab">
<element name="specialistSoftware"> <text />
</element>
</element>
</optional>
<element name="scienceLab">
<element name="safetyRating"> <text /> </element>
<optional>
<element name="notes"> <text /> </element>
</optional>
</element>
</group>
</choice>
</element>
To round off this subsection, here is a summary of the process of logical database
design for XML.
1 For each entity in the logical ER diagram, write down an initial element schema.
(a) Write down the element name.
(b) Write down an initial content model for the element. For each attribute of the
entity, include a child element (or attribute) pattern in the content model. Give
this child element the same name as the attribute and indicate its multiplicity.
Include a comment about any default value.
(c) If the entity is a strong entity, add a comment recording primary and candidate
keys.
(d) Add comments to record any business rules associated with the entity.
2 For each binary association in the logical ER diagram, do the following.
(a) Decide whether the binary association should be represented by containment or
by relational-style keys.
(b) Where the association is best represented by containment, designate one entity
to be a child entity. Move the element schema for the child entity into the con-
tent model of the element schema for the parent entity. Use the multiplicity
constraint at the child end of the original association to determine which multi-
plicity pattern to enclose the child in.
(c) Where the binary association is best represented by relational-style keys, desig-
nate one element as the referenced element and the other as the referencing el-
ement. Add a comment describing the foreign key to the element schema for the
referencing element.
3 For each generalization relationship, write down an element schema for the more
general entity, which includes the more specialized entities in its content. Use
choice and options to implement the disjoint and participation constraints.
629
IDWA_C12.QXD 10/5/07 11:51 AM Page 630
Chapter 12 XML
1 Identify a functional determinant and all its dependent elements. Always ensure
that, for each dependent element elementName, the claim ‘the functional deter-
minant determines the value of elementName’ makes sense within the scenario
being modelled.
2 Consider whether or not this repeating group of child elements may lead to unnec-
essary data duplication. If so, create a new child element of the root element consist-
ing of the elements in the functional determinant and all its dependent elements.
(a) Note that the functional determinant is a candidate key for this element.
(b) If the dependent columns include the columns of a pre-existing foreign key,
then include the comment documenting the foreign key constraint on the new
element.
3 Remove the dependent elements from the original element, but leave the elements of
the functional determinant behind.
(a) If the dependent elements include the columns of a foreign key, then remove the
comment documenting the foreign key constraint from the original element.
(b) Add a new foreign key constraint on the original element to indicate that the
functional determinant references the candidate key of the new element.
Chapter summary
■ This chapter has completed the discussion of the core XML features by discussing
XML namespaces. Namespaces are widely used in XML documents, including both
schema and instance documents.
630
IDWA_C12.QXD 10/5/07 11:51 AM Page 631
Review questions
■ The chapter has introduced RELAX NG as a schema language for XML and XPath
as a simple XML query language. Both languages are largely stable and widely im-
plemented. XPath is also used in more sophisticated XML query languages.
■ The chapter has concluded with a discussion of XML database design. Conceptual
design is unchanged and logical design is very similar to logical design for the rela-
tional data model. Because of the lack of standard XML database technologies,
physical design is tied closely to the particular XML DBMS being used. Even so, the
guidance given in Chapter 10 should prove useful, provided the documentation for
the XML DBMS is read carefully first.
Further reading
The early part of Steegmans et al. (2004) provides a more recent version of this
material.
Møller and Schwartzbach (2006) is probably the best general introduction to XML
and its related technologies. They cover XML for both document and database
applications.
Harold and Means (2004) give an overview of the various XML technologies, with em-
phasis on those developed by the W3C. The W3C website (www.w3.org) is the official
source for XML and those related technologies developed by the W3C.
Eisenberg and Melton (2002b, 2005) provide an introduction to XQuery, while some
of their other papers (2001, 2002a, 2004) cover SQL/XML.
The approach to logical design for XML is adapted directly from the traditional ap-
proach to relational database design. Provost (2002) gives a more detailed account of
the value of normalization for designing XML schema documents.
Review questions
12.2 What is the purpose of a namespace declaration? How do you declare a default
namespace?
631
IDWA_C12.QXD 10/5/07 11:51 AM Page 632
Chapter 12 XML
12.3 Explain the difference between an instance document and a schema document.
When is an XML document both a schema document and an instance document?
12.4 What are the four basic patterns of RELAX NG? Explain the purpose of each of
them.
12.7 What is the purpose of the RELAX NG param pattern? When and where can it be
used?
12.8 Briefly describe how to implement relational-style foreign keys and foreign key con-
straints using the RELAX NG DTD compatibility data type library.
12.10 Which structures from a conceptual ER diagram should not be included in the logi-
cal ER diagram for the XDM?
Exercises
12.12 For each of the following descriptions, write down a RELAX NG pattern. If no con-
tent is specified for an element, assume that it contains text. Assume that elements
appear in the order stated unless an unordered content model is specified.
(a) The book element has a contents element, optional foreword, preface and
acknowledgements elements, plus one or more chapter elements.
(b) A chapter element has an objectives element, introduction element,
outline element, one or more section elements, summary element and
furtherReading element. The objectives element, outline element,
summary element and furtherReading element all have a heading element
and text content.
(c) The section element may consist solely of text content, have some text
followed by one or more subsection elements or consist entirely of
632
IDWA_C12.QXD 10/5/07 11:51 AM Page 633
Exercises
12.13 Write down instances of the elements defined by the following RELAX NG patterns,
showing the different ways the content can interleave.
(a)
<element name="session">
<interleave>
<group>
<element name="term"> <text /> </element>
<element name="dayOfWeek"> <text /> </element>
</group>
<group>
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</group>
</interleave>
</element>
(b)
<element name="session">
<interleave>
<group>
<element name="term"> <text /> </element>
<element name="dayOfWeek"> <text /> </element>
</group>
<group>
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</group>
<text />
</interleave>
</element>
(c)
<element name="session">
<interleave>
<group>
<element name="term"> <text /> </element>
<element name="dayOfWeek"> <text /> </element>
</group>
<element name="time">
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</element>
<text />
</interleave>
</element>
633
IDWA_C12.QXD 10/5/07 11:51 AM Page 634
Chapter 12 XML
(d)
<element name="session">
<interleave>
<group>
<element name="term"> <text /> </element>
<element name="dayOfWeek"> <text /> </element>
</group>
<interleave>
<element name="time">
<element name="startTime"> <text /> </element>
<element name="endTime"> <text /> </element>
</element>
<text />
</interleave>
<text />
</interleave>
</element>
12.14 The student element contains each student’s ID number, first and last names,
e-mail address and may also contain a phone number. For each of the following
additional restrictions, write down a RELAX NG pattern for the student element.
(a) The content appears in the order specified.
(b) A student may have several first names and the content appears in the order
specified.
(c) A student may have up to two e-mail addresses (one personal, one university)
and the content appears in the order specified. (Hint: You’ll need to define a
type attribute for the email element, the content of which is an enumeration).
(d) The first name must come before the last name, but otherwise no specific order-
ing is required.
(e) The first and last names must appear in this order with no intervening content,
but, otherwise no specific ordering is required.
12.15 Figure 12.4 shows the logical ER diagram for the Web Timetable application,
targeting the XDM.
(a) Using RELAX NG, complete the logical design for the Web Timetable
application.
(b) Assuming that all the various elements are enclosed in a single
webTimetable element, write three or more XPath expressions to locate data
in your XML database.
12.16 Figure E.1 is a conceptual ER diagram for the Pennine University’s Accommodation
Office.
(a) Draw the corresponding logical ER diagram for the XDM.
(b) Using RELAX NG, write down a logical design for this database.
(c) What are the significant differences between this logical database design and
the logical database design for the relational data model developed in Exercise
9.13?
Investigations
12.17 Evaluate RELAX NG (OASIS, 2001) against one of the other XML schema lan-
guages. Possible choices include the W3C XML Schema language (W3C, 2004d,
2004e), DSD2 (Møller, 2002) or Schematron (ISO/IEC JTC 1, 2004).
634
IDWA_C12.QXD 10/5/07 11:51 AM Page 635
Investigations
Figure E.1 A conceptual ER diagram for the Pennine University’s Accommodation Office.
«persistent»
Staff
dateRaised : Date staffNo :Varchar {PK}
problem : Varchar fName : Varchar
IName : Varchar
phone [0..1] : Varchar
email : Varchar {CK=1}
«persistent» photo [0..1] : Blob
RaisesProblem
Student 0..* 0..* title : Varchar
{Value one of Mr, Ms,
studentNo : Char(10) {PK} Miss, Mrs, Dr, Prof}
fName [1..*] : Varchar 1..1 Bursar
IName : Varchar
email [1..2] : Varchar {CK=1} Manages
phone [0..1] : Varchar
0..* 0..*
«persistent»
0..* Hall
«persistent» name :Varchar {PK}
Room /capacity : Integer {Total
Rents Has
accomld : Varchar {PK} number of rooms}
0..* 1..* 1..1 address : Varchar
roomNo : Varchar
ensuite : Boolean phone [2..3] : Varchar
floor : Integer postcode : Varchar
selfCatering : Boolean
{Floor of the building the room is distanceFromCampus : Integer
start : Date
on: 0 for ground floor, 1 for first
end : Date
floor, etc.}
rent : Money
12.18 Investigate one of the more powerful query languages for XML. Choices include
XQuery (Eisenberg and Melton, 2002b, 2005), W3C (2005c, 2005e) or SQL/XML
(Eisenberg and Melton, 2001, 2002a, 2004).
12.19 In Investigation 8.19, you developed a conceptual database design for an organiza-
tion you know. Use the ER diagram to create a logical database design targeting the
XDM, including a logical ER diagram and a RELAX NG schema document. How
does it differ from the logical design targeting the relational data model you devel-
oped in Investigation 9.19?
12.20 Investigate the support for XML and its associated technologies offered by com-
mercial DBMS. You should consider both native XML DBMS (DBMS that have as
their underlying data model an XML data model such as the XDM) and the relation-
al DBMS, many of which include support for storing XML documents.
635
IDWA_APPA.QXD 10/5/07 12:04 PM Page 636
Appendix
A.1 Requirements
The Staff Directory is a simple web database application, though not as simple as most
staff directories. In addition to the usual requirements to deliver staff contact details
via a web browser, it must also provide details of support sessions offered to students
by academics. The idea for the application was first documented in an e-mail from
the Dean of the School of Computing to the head of Library and Computing Services
(Figure A.1).
Additional input at the requirements stage came from the School of Computing’s
Staff List application and the printed staff directory (Figures A.2 (a) and (b)). It was
quickly decided to develop a web database application in-house as the Library and
Computing Services had the necessary staff and other resources available. All the
Schools and Services staff agreed that the Staff Directory would use a database and it
would be the definitive source for staff contact details. Accuracy was considered crucial,
so a separate administrator’s interface would be developed to allow administrative staff
to add, modify and remove staff details from the database. An option in this adminis-
trator’s interface would allow users to generate the printed directory.
Both these requirements were put outside the scope of the first phase of the project,
which focused on retrieving staff contact details from a database and delivering them
to end users via a web browser. Consequently, in the first phase of the project, there is
no requirement to include room numbers on the Staff Directory database.
The requirement to include details of academics’ support sessions caused some dif-
ficulties. As the times of support sessions might change each term to fit into the acad-
emics’ new timetables, it was felt that academics themselves should maintain this
data. However, doing so raised a security problem – how to ensure that academics’ sup-
port sessions were not changed by anyone but them. The solution adopted was docu-
mented in an e-mail from the head of Library and Computing Services (Figure A.3).
The requirements were documented as use cases. The use case diagram (Figure A.4)
provides an overview of who uses the system and how they expect to use it. The use
case specifications (Figure A.5) give further details of the expected behaviour. Together,
they form the formal definition of users requirements for the Staff Directory web
database application.
636
IDWA_APPA.QXD 10/5/07 12:04 PM Page 637
A.1 Requirements
Dilip
It was interesting to chat after yesterday’s meeting. You are right that we
need a university-wide, searchable staff directory on the intranet. I promised
to recount our recent experience of developing a similar web application, so
here are the salient details.
Two years ago, our web administrator (Frank Rose) wrote an application to
generate a series of web pages listing contact details for staff, filtered by job
role – admin, academic, technical and research. All the data were kept in a
single text file, accessed using a PHP script (ask Frank!) This is proving
difficult to manage. Originally, Frank made the data file accessible to all
members of staff, so they could amend their own details. This caused
problems, with several instances of data loss or corruption. On one memorable
occasion, Frank had just saved the details of seven new researchers, only to
lose them all when I overwrote the Staff List data file with the copy I’d been
editing in Notepad! Only Frank can access the data file now, but he can’t
always make immediate changes when staff move offices or new staff arrive.
The application is also a little inflexible – in particular, new staff usually want
to search by name, not role. Finally, the academics want it to show the details
of their support sessions. These are unbooked ‘drop-in’ sessions, open to any
student (all schools run these, not just computing). As all academics spend at
least two hours each week sitting in their offices waiting for students to drop
in, they want the times widely publicised. A staff directory seems the sensible
place for this.
I hope you find these brief notes useful. If you do decide to sponsor a Staff
Directory, I’m sure Frank will be happy to share his experiences with your
development team.
Regards
Freya
SCHOOL OF COMPUTING
Administrative Support
Technical Support
Academic
Research
Thanks for the memo setting out the options for securing access to the
‘Support Sessions’ function. The first two options seem to be too
complex – asking them to remember another password won’t be popular
and using OS authentication will mean Support Session data can’t be
changed by staff working from home over the Web (which was the
point of developing this application). The idea of using the staff number
as a password/PIN seems best, though I accept that it is a bit of a kludge.
In its favour, the staff number is printed on staff ID cards, so is easily
accessible to the member of staff concerned and hard for others to find
out. Anyone who loses a card will just have to wait to change the
support sessions – after all, this data isn’t business-critical. Also, support
staff can easily amend the details of staff on long-term sick leave. So,
my decision is that we go with the ‘staff number option’. Can you
update the estimate for completion?
Thanks
Dilip
638
IDWA_APPA.QXD 10/5/07 12:04 PM Page 639
A.1 Requirements
Staff Directory
«extend»
University
member
Academic
Visitor
This allows any user to get the contact details (name, phone and e-mail)
for a named member of staff at the university. The user provides the
staff member’s surname and first initial and requests his or her contact
details. A list showing the name and phone number of each member of
staff matching the search criteria is delivered, together with a count of the
number of matches from the Staff Directory. The user can see the full
contact details for any entry in the list, along with a photo of the member
of staff if one is available, by selecting that entry. If there are no matches,
a message stating this is displayed.
The full contact details for an academic provide access to a list of support
sessions offered by him or her.
The user will have accessed the full contact details for an academic
member of staff and will request a list of that person’s support
sessions. The day, start time and end time of all the support sessions
are displayed. If there are no support sessions for the chosen
academic, a message stating this is displayed.
(Continued)
639
IDWA_APPA.QXD 10/5/07 12:04 PM Page 640
Notice that the browse behaviour is omitted. This is because the different schools
could not agree on what browse behaviour the application should support. Comput-
ing wanted to browse by job role (as for its current ‘staff list’), health alphabetically by
surname and music by instrument. It was agreed that each school would develop its
own browse pages, hosted on its local web server but drawing its data from the central
Staff Directory database. This page would interface with the Staff Directory web data-
base application to deliver full contact details for listed staff.
The use cases will include descriptions of the data requirements for the application, so
are the main source for the conceptual database design. This section suggests a way to
extract information on the entities, attributes and relationships by scanning the use
case specifications. It also suggests how to document these requirements in an applica-
tion glossary – an alternative to using a UML modelling tool. Note that, during concep-
tual design, new requirements may be identified. If so, the use cases should be revised.
This is one consequence of recognizing that design is an iterative and incremental
process – earlier stages may need revisiting and their documentation updating in the
light of discoveries later on in the development lifecycle.
640
IDWA_APPA.QXD 10/5/07 12:04 PM Page 641
it is a thing that the database needs to store data about, then it is an entity. If it is data
about a thing, then it is an attribute. Note that most things named in the use case
specifications will not be entities or attributes. They might, for example, be users of
the application or outputs of it.
The nouns ‘name’, ‘surname’ and ‘initial’ are all identified as attributes and there is
clearly some overlap in meaning. For any given member of staff, ‘name’ refers to his or
her full name, so includes the surname, whereas ‘initial’ is the first letter of their first
forename.
Rather than have three separate attributes, it makes sense to store just enough data
to allow the application to derive the full name, surname and first initial. The simplest
approach is to have surname and forename attributes as the full name and first initial
can be derived from these. Derived attributes can be included in the model.
The full name is retrieved from the database in Get staff contact details, so it makes
sense to include the derived attribute\fullName in the model. The first initial is gath-
ered from users as part of the search criteria, but not stored in the database, so there is
no need to indicate this as an attribute.
There are two ways to document the process of identifying data entities. Figure A.6
shows one of the original use case specifications with comments for each noun and
noun phrase. Table A.1 shows the result of a similar analysis of all three use cases, doc-
umented in a table. Note that only the first occurrence of the noun or noun phrase is
commented on. This means that more potential entities are identified in the first use
case than in later ones.
Once it’s decided which things mentioned in the use cases are entities and which at-
tributes, they are recorded in the application glossary. This may be managed within a
modelling tool or as a separate document, when a format like that in Tables A.2 and
A.3 can be used.
641
642
IDWA_APPA.QXD
10/5/07
Figure A.6 Documenting possible entities using comments within a word-processed document.
12:04 PM
named [member of staff] at the [University]. The user provides the [staff Comment: Not entities - this is data held about members of staff: i.e. attributes
Comment: Entity - synonym for “staff member”
member]’s [surname and first initial] and requests their contact details. A [list]
Comment: Not an entity - there is only one university, so it cannot be an entity.
showing the name and phone number of each member of staff matching the Comment: Entity - the application holds contact details, photo and staff number.
Comment: Not entities - this is data held about members of staff: i.e. attributes
[search criteria] is delivered, together with a [count of the number of matches]
Comment: Not an entity - this is an output from the application, not persistent data.
from the [Staff Directory]. The user can see the full contact details for any [entry Comment: Not an entity - synonym for “surname and first initial"
on the list], along with a [photo] of the member of staff if one is available, by Comment: Not an entity - this data will be calculated for each search, so cannot be
stored in the database.
Appendix A: The Pennine University’s Staff Directory case study
selecting that entry. If there are no matches, a [message] stating this is Comment: Not an entity - the staff directory is the application, not something the
application holds data about.
displayed.
Comment: Not an entity - this is an output from the application, not persistent data.
Comment: Not an entity - this is data held about members of staff: i.e. an attribute
Comment: Not an entity - it may make sense to hold messages in the database, but
The full contact details for an [academic] provide access to a list of [support
this is an implementation issue, not part of conceptual design.
sessions] offered by them. Comment: Not an entity - an academic is a member of staff. Could model them as a
separate entity, but no need in this simple application.
Comment: Entity - data about support sessions is held in the database.
There is also an option to view a [“help” page], giving instructions on how to Comment: Not an entity - it may make sense to hold help information in the database,
but this is an implementation issue, not part of conceptual design.
use the [search facility]. Comment: Not an entity - the search facility is part of the applications functionality.
IDWA_APPA.QXD 10/5/07 12:04 PM Page 643
643
IDWA_APPA.QXD 10/5/07 12:04 PM Page 644
644
IDWA_APPA.QXD
10/5/07
12:04 PM
Figure A.7 Documenting possible associations using comments within a word-processed document.
This allows any [user to get the contact details](name, phone and email) for a Comment: Not an association between entities. Although “contact details” is a
synonym for the Staff entity, “user” is not an entity.
named member of staff at the University. The user provides the staff
member’s surname and first initial and requests their contact details. A list
showing the name and phone number of each member of staff matching the
search criteria is delivered, together with a count of the number of matches
from the Staff Directory. The user can see the full contact details for any entry
on the list, along with a photo of the member of staff if one is available, by
selecting that entry. If there are no matches, a message stating this is
displayed.
The full contact details for an academic provide access to a list of [support
sessions offered by them]. Comment: An association - indicates that academics (who are all members of staff)
offer support sessions. Both Staff and SupportSession are entities.
645
A.2 Deriving the ER diagram from the use case specifications
IDWA_APPA.QXD 10/5/07 12:04 PM Page 646
Participating
Relationship Synonyms entities Description
Offer OfferedBy Staff, An association. Each occurrence of Offer
SupportSession indicates that a particular support session
is offered by a particular academic.
Table A.6 Extract from Staff Directory application glossary: association multiplicity constraints.
The following code allows you to create a sample Staff Directory database for use with
the exercises.
-- *******************************************************
-- Create the tables using MySQL Syntax.
-- Use the DROP TABLE statement first, to ensure no name
-- conflicts arise. Useful if the first version fails
-- half-way through and the script must be run again.
646
IDWA_APPA.QXD 10/5/07 12:04 PM Page 647
--
-- NOTES
-- 1) The "IF EXISTS" syntax is MySQL specific.
-- *******************************************************
DROP TABLE IF EXISTS Staff;
CREATE TABLE Staff(
staffNo CHAR(5) NOT NULL ,
fName VARCHAR(50) NOT NULL ,
lName VARCHAR(20) NOT NULL ,
phone VARCHAR(20) ,
email VARCHAR(50) NOT NULL ,
photo BLOB ,
department VARCHAR(20) NOT NULL ,
title VARCHAR(4) ,
jobType VARCHAR(20) ,
jobTitle VARCHAR(30) ,
CONSTRAINT pkStaff PRIMARY KEY (staffNo)
);
DROP TABLE IF EXISTS SupportSession;
CREATE TABLE SupportSession(
staffNo CHAR(5) NOT NULL ,
dayOfWeek VARCHAR (9) NOT NULL ,
startTime TIME NOT NULL ,
endTime TIME NOT NULL ,
CONSTRAINT pkSupportSession PRIMARY KEY (staffNo,
dayOfWeek, startTime)
);
--********************************************************
-- Add the other constraints.
-- Adding foreign keys after creating all the tables means
-- we can create the tables in any order.
--
-- NOTE The default MySQL table type is MyISAM, which does
-- not enforce the foreign key constraint!
-- *******************************************************
ALTER TABLE SupportSession
ADD CONSTRAINT frnSupportSessionStaff
FOREIGN KEY (staffNo)
REFERENCES Staff (staffNo)
ON UPDATE CASCADE
ON DELETE CASCADE;
--********************************************************
-- Insert sample data.
--********************************************************
-- Staff data
--********************************************************
647
IDWA_APPA.QXD 10/5/07 12:04 PM Page 648
648
IDWA_APPA.QXD 10/5/07 12:04 PM Page 649
649
IDWA_APPA.QXD 10/5/07 12:04 PM Page 650
— — ******************************************************
— — SupportSession data
— —*******************************************************
INSERT INTO SupportSession
(staffNo, dayOfWeek, startTime, endTime)
VALUES
('56673', 'Friday', '15:30', '16:30');
INSERT INTO SupportSession
(staffNo, dayOfWeek, startTime, endTime)
VALUES
('56673', 'Thursday', '15:30', '16:30');
INSERT INTO SupportSession
(staffNo, dayOfWeek, startTime, endTime)
VALUES
('35054', 'Monday', '09:00', '10:00');
INSERT INTO SupportSession
(staffNo, dayOfWeek, startTime, endTime)
VALUES
('45965', 'Wednesday', '11:00', '12:00');
650
IDWA_APPA.QXD 10/5/07 12:04 PM Page 651
651
IDWA_APPB.QXD 10/5/07 12:05 PM Page 652
Appendix
B.1 Requirements
The Pennine University teaches a variety of courses at both undergraduate and post-
graduate levels. Students study several modules each term. Each week, a module will
have a lecture, tutorial, practical or seminar session. Typical arts and humanities mod-
ules have one lecture and one tutorial each week, while typical science and engineer-
ing modules have a lecture, tutorial and practical. Some modules have two lectures
each week – one for day students and one for evening students. Others have no lec-
tures, simply a long seminar session. This makes timetabling very tricky, especially as
more students than before are taking combined degrees, so are taking modules with
more than one school.
Although the existing timetabling application will continue to be used by the
timetable administrators, it has been decided to develop a new approach to allocating
students to tutorial groups and so on. Timetables for courses, modules and academic
staff will be created with the existing timetabling application, but students will not be
allocated to particular tutorial, practical or seminar groups. Instead, a new application
will allow students to choose these themselves. For example, where a student takes a
module with only one lecture, but a choice of four tutorial slots, he or she must attend
the lecture but may choose which of the four tutorial slots to go to.
The application must be Web based and accessible to students over the Internet,
allowing them to easily sort out their timetable before the start of term. The application
must also allow academic staff to book any additional equipment that they may need
for specific teaching sessions. The requirements were documented as use cases. The use
case diagram (Figure B.1) provides an overview of who uses the system and how they
expect to use it. The use case specifications below give further details of the expected
behaviour. Together, they form the formal definition of the users’ requirements for the
Staff Directory web database application.
These requirements are simplified versions for use with the exercises in Chapter 8.
The ER diagrams in Chapters 8 and 9 include many details not mentioned in these
use cases, such as attribute and entity constraints.
652
IDWA_APPB.QXD 10/5/07 12:05 PM Page 653
B.1 Requirements
Web Timetable
Staff
View room usage
Academic
View timetable
A member of the University may ask to view a timetable. Timetables are available for
courses, modules, rooms, academics and students. Valid identifiers for particular
timetables are, respectively, course code, module code, building and room number,
staff number and student ID number. Timetables should include the following informa-
tion under each heading.
Course timetable: Course code and title, name of school delivering the course.
Module timetable: module code and title, name of module leader.
Room timetable: building and room number.
Academic timetable: staff member’s title and name.
Student timetable: Student’s name and number and the course he or she is enrolled on.
Provided a valid identifier for the requested timetable is provided by the user, the
teaching sessions for that timetable are delivered. For each teaching session, the mod-
ule code, module title, type of class, building, room and tutor are shown, as well as the
start and end time of the session and the teaching weeks for which it runs.
For practical, tutorial, seminar and lab sessions, the group ID is also included to dis-
tinguish between teaching sessions that deliver the same lesson, but to different
groups of students.
For an academic timetable, users may optionally include details of that staff member’s
support sessions (see View support sessions).
653
IDWA_APPB.QXD 10/5/07 12:05 PM Page 654
For each of the requested dates, the application allocates a suitable item of equip-
ment and records both the date the booking was made and the date on which the
equipment is required. If no suitable item is available, then a message stating this is
displayed.
For each module, students choose which of the available teaching sessions they
wish to attend. Once a teaching session has only four places left, it is considered
unavailable. These last four places are allocated by the course leader and allow, for
example, students with special needs to be accommodated.
654
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 655
Glossary
absolute URL A URL that includes the full address of the resource. See relative URL
and URL.
actor A role that a user, human or otherwise, of the information system might take.
aggregation association In UML, an association indicating that one entity forms part
of another.
alternate key A candidate key of a table other than the primary key.
application software Computer software that people interact with, usually via a
graphical user interface (GUI), using windows, buttons, menus and so on.
application state Information about the current and previous state of an application.
assertion In SQL, an integrity constraint that applies to the database instance rather
than a table instance.
association A meaningful connection between, for example, an actor and a use case in
a use case diagram or between entities in an ER diagram.
655
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 656
Glossary
attribute, derived In UML, an attribute the value of which can be derived (or calcu-
lated) from the values of the other attributes.
attribute, multivalued In UML, an attribute that can hold more than one value. Similar
to a non-scalar data type. See attribute multiplicity.
attribute compartment In UML, that part of the representation of an entity where the
attribute definitions are written.
attribute multiplicity In UML, indicates how many values an attribute may hold. See
attribute, multivalued.
attribute property list In UML, a list of constraints, tagged values and comments that
apply to the attribute.
authentication In database security, the way in which users prove their identity to the
DBMS. See identification and authorization.
axis In XPath, the first sequence of nodes reached in a location step. See also node test
and predicate.
base table A table instance stored on disk. See also table and derived table. A base table
may be a persistent base table or a temporary base table.
binary operator An operator that takes two input values. See operator and unary
operator.
block A contiguous strip of the hard disk (or other persistent storage medium) that
can be read in a single action.
656
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 657
Glossary
bridge A device that connects two subnetworks, allowing devices in one subnetwork
to broadcast messages to the other subnetwork.
build association In logical web data design, a UML stereotyped association between a
client page and a server page indicating that an occurrence of the server page builds the
associated occurrence of the client page.
candidate key A set of columns from a table, the combination of the values of which
is unique and irreducible, meaning no subset of those column values is unique.
candidate key constraint An integrity constraint that enforces uniqueness and irre-
ducibility for a set of columns.
child In a tree, a node immediately below some other node, called its parent.
client page In logical web data design, a UML stereotyped class representing a client-side
web page.
commit An instruction to the DBMS to ‘save’ all changes made to the database
instance. After a commit, the changes are made permanent. See also rollback.
comparison operators An operator that compares two values of a given data type and
returns a Boolean value. See operator.
composite candidate key A candidate key with two or more columns.
composition association A stronger form of aggregation association in which the part
cannot exist independently of the containing entity.
657
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 658
Glossary
conceptual database design A model showing the structures required to organize the
data mentioned in the user requirements specification.
conceptual web data design Shows the links between the database entities and those
parts of the user interface that interact directly with them.
connection In SQL, a user’s link with the DBMS. See session.
connection trap A structure in an ER diagram such that a connection between two
entities is not adequately represented by existing associations.
cookie A set of data stored by the web browser at the request of the web server.
correlated subquery A subquery that refers to a column of the outer SQL statement.
cross join The simplest, and least useful, sort of join in SQL. Joins each row in one
table to every row in the other.
data-centric A part of the user interface that draws (part of) its content from, or allows
the user to modify, the database instance.
data communications manager (DCM) Software that works alongside the DBMS to
allow it to use network system software.
data definition language (DDL) The set of statements for creating and modifying the
database schema.
data dictionary Maintained by the DBMS, the data dictionary holds the description of
a database’s data structures and constraints. Also called the system catalogue.
data file A file on a hard disk or other persistent data storage medium, used to store
data.
data language A computer language designed to allow the manipulation of data and
data structures on a database.
data manipulation language (DML) The statements used to manipulate the database
instance.
data type A named set of data values, typically sharing common characteristics and
operations. See user-defined data type and system-defined data type.
658
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 659
Glossary
data validation The process of correctly interpreting the data entered by the user and
generating an appropriate response.
database A persistent, self-describing, structured collection of related items of
data.
database client Application software designed to allow end users to interact with the
DBMS.
database design A description or model of the data requirements of a particular
organization.
database event An SQL DML statement causes a database event to occur. The database
events are insert, update and delete.
database instance The actual data held on a database.
database management system (DBMS) The software that manages all interaction
with the database.
database query A request to view selected data held on a database.
database schema The description of a database structure consisting of all the table
schema.
datagram A chunk of a message being sent using the IP internetworking protocol.
DBMS See database management system.
DCM See data communications manager.
DDL See data definition language.
degree of a table The number of columns in the table.
default namespace In XML, if a default namespace is declared, then every unprefixed
element name belongs to that default namespace.
default value The value used (for an attribute, column, parameter and so on) when
end users do not specify a value.
delete statement An SQL statement that allows users to remove rows from a table
instance.
delete event Occurs when a row is removed from a table.
dependent column One for which the value is determined by the value functional
determinant.
derived table A table instance that is only ever stored in memory. See base table, table,
view.
descendant In a tree, a child, child of a child and so on of a particular node.
disjoint constraint On a generalization relationship, indicates whether or not the occur-
rence of the more general entity can, at the same time, be an occurrence of more
than one of its more specific entities.
distinct type In SQL, a user-defined data type created as a restriction of an existing
system-defined data type.
659
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 660
Glossary
document In XML, one of the three structures used to organize data. See also element
and attribute.
document element In XML and HTML, the top-level element that encloses all the ele-
ments and character data held in the document.
document type declaration, in HMTL Indicates which version of HTML is being used.
domain name system (DNS) Allows an Internet application to translate a domain name
into the corresponding IP address.
dynamic content The content of a dynamic web page that can change between differ-
ent instances of that document.
entity A set of things about which the information system must hold, or process, data.
entity, strong Its occurrences can be uniquely identified by its attribute values.
entity, weak Its occurrences cannot be uniquely identified by its attribute values.
660
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 661
Glossary
ephemeral port number Any port number that is not a well-known port number. Value is
1024 or larger.
equi-join A join in which the join condition is based on an equality match, usually be-
tween the columns of a foreign key and its matching primary key.
expanded name In XML, the full name of an element or attribute consisting of its
namespace name and its local name.
extend relationship Indicates that the behaviour of one use case (the extending use
case) may be added on to the behaviour of another (the extended use case).
Extensible Mark-up Language (XML) A language for defining the document model of
other mark-up languages.
external view Part of the ANSI/SPARC database systems architecture. Provides a tai-
lored view of a database for a group of end users with similar information needs.
field Within a record, a single item of data; the equivalent of a column or attribute
value.
file transfer protocol An Internet application that allows files to be transferred from
one computer to another across the Internet.
foreign key A column (or set of columns) in one table, the values of which match
those of the primary key on some other table. It acts as a link between the two tables.
foreign key constraint An integrity constraint that states, for every row of the referenc-
ing table, the value of the foreign key column(s) must match those in the primary
key column(s) in some row of the referenced table. Also called a referential integrity
constraint.
form In logical web data design, a UML stereotyped class representing an HTML form; a
special kind of client page.
forward axis In XPath, an axis in which the nodes appear in document order. See also
reverse axis.
from clause In an SQL select statement, indicates which table(s) the columns named in
the select list come from.
functional determinant A set of columns that forms a candidate key for a subset of the
columns in the table. See dependent column, normalized table, repeating group.
661
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 662
Glossary
generation clause Used with a surrogate key to instruct the DBMS to automatically
generate a unique value for this column in new rows.
grant statement Allows the owner of a database or database object to grant privileges
to use that object to other DBMS users.
group by clause Allows rows to be grouped by the value of one or more columns and
summary information calculated for each group. Rows with the same values in the
grouping columns are placed in the same group.
hash file File organization that places each record in a particular block on the hard disk
(or other persistent storage medium).
hash function A function that takes a character string and returns a (much longer)
unique string of characters. In security, often used to encrypt passwords. In a hash
file, a hash function determines which block a record is placed in.
having clause In an SQL select statement indicates which groups have their summary
information shown. Only ever used in conjunction with a group by clause.
heap file A data file in which records are stored in no particular order; an unordered
heap of records.
hidden form control An HTML form control, specifies setting the input element’s type
attribute to ‘hidden’, which is not displayed in the web browser, but can be viewed in
the HTML source of the web page. Used as one technique for maintaining application
state in a web application by writing hidden form controls at runtime.
host number The part of the IP address identifying the particular network device.
HTTP authentication A method for authenticating users built into the HTTP protocol.
HTTP method Allows a web client to request specific services of a web server.
HTTP request message HTTP message sent from the web client to the web server.
HTTP response message HTTP message sent from the web server to the web client.
HTTP server Application software used to manage access to a website. Often referred to
as the web server.
hyperlink A link from one part of a document to another or between two documents.
662
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 663
Glossary
Hypertext Mark-up Language (HTML) The specification of the document model for web
pages and the language used to write them.
hypertext transfer protocol (HTTP) The application protocol used by web clients and web
servers to communicate. Aims to facilitate the transfer of web resources.
identification In database security, the way in which users identify themselves to the
DBMS. See authentication and authorization.
identifier, regular In SQL, a valid name for a database object that is not case-sensitive.
identifier, delimited In SQL, a valid name for a database object that is case-sensitive.
include file A PHP script suitable for inclusion in other PHP scripts.
index Provides a way to quickly locate a record within a data file based on the value of
a field within that record.
index, clustering An index on the ordering field of an ordered file when that field does not
correspond to a candidate key of the table being stored.
index, primary An index on the ordering field of an ordered file when that field corre-
sponds to a candidate key of the table being stored.
index, secondary An index on any field of an ordered file, apart from the ordering field.
663
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 664
Glossary
664
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 665
Glossary
logical view Part of the ANSI/SPARC database systems architecture. Provides a com-
plete description of the data structures and data on the database.
logical web data design Adapts the conceptual web data design to reflect the particular
approach to implementing a user interface taken by web technology and the cho-
sen application architecture.
look-up entity An entity (usually in a logical ER diagram) that plays a role similar to that
of a multivalued attribute. Each occurrence of a look-up entity holds a list of values as-
sociated with a single occurrence of some other entity.
mark-up Special text within a document that describes its content, especially the struc-
ture of document content. Used in HTML and XML.
matching candidate key In the description of a foreign key constraint, this is the candi-
date key of the referenced table, the value of which matches the foreign key of the ref-
erencing table.
method note In UML, the way to document the specific behaviour represented by an
operation.
mixed content Refers to an XML element, the content of which is a mixture of other
elements and text.
multimedia document A document combining different media types, such as text, im-
ages, sound and video.
multiplicity constraint, outer A constraint on an association end that limits the number
of occurrences of the entity at this association end that may be linked, by occur-
rences of the association, to a given set of occurrences of entities from each of the
other ends of the association.
name compartment In UML, that part of the representation of an entity where the
entity name and stereotype are written.
name server Part of the domain name system (DNS), it maintains a database linking
domain names to IP addresses.
namespace name In XML, the globally unique identifier for a document model.
665
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 666
Glossary
network number The part of the IP address that identifies the network the device is
connected to.
no value In XML, indicates that no value has been supplied for a particular property of
a node.
node-name The name of a node (such as element or attribute) in an XML document. See
also type name.
node test In XPath, the second sequence of nodes needed in a location step, after using
the axis. See also predicate.
non-functional requirement Any requirement not directly tied to what data or behav-
iour an information system should support.
non-scalar data type A data type that has values with user-accessible subparts.
normalized table A table description is said to be normalized if all repeating groups have
been removed into separate tables. More formally, a table is normalized if every
functional determinant is a candidate key.
null In the relational data model, null is used to indicate that there is no data to hold
in a column for a particular row.
numeric Numeric data is that used in arithmetic or which needs to be sorted into
numerical order.
numeric, exact A numeric data value that can be represented exactly as a standard
decimal literal – 0.2, 1756, 6.5 and so on.
object diagram Shows particular occurrences of entities and the relationship occur-
rences that link them. An instance of an ER diagram.
one-to-many In an ER diagram a binary association with the upper bound of the multi-
plicity constraint at one association end being 1, and at the other *(many).
666
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 667
Glossary
operator Language construct that takes one or more values, manipulates them in
some way and outputs a single value.
operation In UML, operations document the behaviour of an entity. The specific be-
haviour is documented in a method note.
ordered file A data file in which records are stored in a particular order determined by
the ordering field.
ordering field The field used to order records within an ordered file.
outer join A join in which data for unmatched rows from one, or all, tables is included
in the result set. For such rows, the column values from the other table are set to null
in the result set. See also equi-join, inner join, join.
owner In database terms, the user who created a particular database or database
object.
parent In a tree, a node immediately above some other node, called its child.
participation constraint Indicates whether or not every occurrence of the more general
entity is also an occurrence of one of the more specific entities.
path In a tree, a path is a sequence of nodes where each node is a child of the preceding
node.
physical web data design Adapts the logical web data design to use particular web
technologies.
persistent connection A DBMS connection that does not close when the server-side
script finishes.
PHP resource A reference to some external resource, that is used within a PHP script.
physical data independence The property of the ANSI/SPARC database systems archi-
tecture that means changes to the internal view need not affect the logical view.
physical database design Shows how to implement the logical database design using
particular software and hardware technologies.
port number A unique identifier for a program using TCP to send messages across an
internet.
precision For numeric data, the total number of digits that can be stored, including
those after the decimal point.
predicate In XPath, the third way to refine the sequence of nodes reached by a location
step. See also axis and node test.
primary key The candidate key chosen as the main identifier for a table.
privilege In SQL, the ability to instruct the DBMS to carry out some particular action.
667
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 668
Glossary
privilege system In SQL, the system of privileges that can be assigned to a user.
property list In UML, a list of additional properties for an attribute.
protocol The rules governing communication between members of a group, such as
for networked computing devices.
qualified identifier In SQL, a column name prefixed with the name of the table
and, optionally, the database it belongs to, such as tableName.columnName or
databaseName.tableName.columnName. Used to distinguish between two
columns, in different tables that have the same name.
qualified name In XML, the namespace prefix and the local name form the qualified name.
QUID matrix Records whether a database transaction queries a table, inserts a new row,
updates an existing row or deletes a row.
RDBMS A DBMS that is based on the relational data model.
realm See security realm.
record Within a data file, holds information on one thing. The equivalent of a row in
a table or an occurrence of an entity.
redundant association One where the connection it represents is adequately repre-
sented by other associations.
referenced table In the description of a foreign key constraint, this is the table with the
matching candidate key.
referential action Tells the RDBMS how to maintain referential integrity when a user
requests changes to candidate key values that would violate a foreign key constraint.
referential integrity constraint See foreign key constraint.
referencing table In the description of a foreign key constraint, this is the table with the
foreign key.
relation See table.
relationship A meaningful connection between entities that are said to participate in
the relationship.
relationship occurrence A particular instance of a relationship. It links together one
occurrence of each entity that participates in the relationship.
relative URL A URL defined relative to some base URL. See absolute URL and URL.
reload problem When reloading a web page causes a previous action to be repeated – a
particular problem for web database applications that allow data modification.
repeating group A set of two or more columns such that the data in these columns
repeats across two or more rows. See functional determinant.
requirement Statement of what an information system must do. See also functional
requirement and non-functional requirement.
resolving the many-to-many association A process carried out during logical database
design that replaces a single many-to-many association with a new entity (called a link
entity) and two new one-to-many associations.
668
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 669
Glossary
resource In PHP, a data type for variables used to hold a reference to an external
resource being used by a script, such as a DBMS connection.
result set The set (or multiset) of rows returned by a DBMS in response to a database
query.
reverse axis In XPath, an axis in which the nodes appear in reverse document order. See
also forward axis.
role In SQL, this represents a group of users who share the same privileges.
rollback An instruction to the DBMS to ‘undo’ all changes made to the database in-
stance as the last point at which the instance was made permanent. See also commit.
root In a tree, the top-level node. Every tree has a single root node. Sometimes called
the root node.
router A device that interconnects two or more networks and forwards messages from
one network to another.
Russian doll schema An XML schema document in which the content model for each
element appears inside its parent.
safe method An HTTP method that will not result in any changes on the web server
other than the HTTP server logging the request.
scalar data type A data type that has complete values – they have no subparts that are
directly accessible to end users.
scale For numeric data, the number of digits after the decimal point.
schema valid In XML, indicates that a particular XML document instance meets all the
structural requirements set out in the document model’s XML schema.
schema document In XML, a document that holds the description of the document
structures allowed by a particular document model. Used by a validating XML parser to
check that a particular instance document conforms to the document model.
scheme In a URL, the scheme identifies which Internet application the URL relates to.
scope The scope of a variable is that portion of a program within which the variable
can be used.
security realm In HTTP authentication, a set of web resources that can only be accessed
by authorized users.
select list In an SQL select statement, the list of columns the values of which should be
retrieved for a database query.
select statement The SQL statement that allows users to query a database.
669
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 670
Glossary
self join A select statement that joins two copies of the same table.
server page In web data design, a UML stereotyped class representing a server-side
script.
session (SQL) In SQL, each user’s connection to the DBMS has an associated session.
Users can set certain default properties for their sessions, such as language, charac-
ter set, date format and so on.
session ID In web applications, a unique identifier used by the web browser and web
server to determine which session a user is associated with.
sibling In a tree, a node’s siblings are the other child nodes of its parent.
socket For TCP, the combination of IP address and port number that uniquely identi-
fies the program using TCP to send messages.
SQL schema Term used in the SQL standard to refer to a collection of tables of related
data – that is, an SQL schema is simply a database.
SQL statement A statement written in the SQL language, used to instruct a DBMS.
Standard Generalized Mark-up Language (SGML) A language for defining the docu-
ment model of other mark-up languages.
state constraint A statement about the information held by the information system
that must always be true.
stateful application An application that maintains a record of the current and past state
of communications between client and server.
static content The content of a dynamic web page that is the same for all instances of
that document.
670
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 671
Glossary
storage engine A combination of a particular kind of data file (hash, heap, ordered ) and
additional structures such as index files.
stored function A piece of application code stored on the database and executed as
part of an SQL expression. Stored functions return a value to the calling program. See
also stored procedure and trigger.
stored procedure A piece of application code stored on the database and executed on
the instructions of the database client. Stored procedures do not return a value to the
calling program – instead, they use parameters to pass values between the stored
procedure and the calling program. See also stored function and trigger.
strongly typed A computer language that has operators that will only work for values of
the correct data type.
structured type In SQL, a user-defined data type with a complex internal structure.
submit association In logical web data design, a UML stereotyped association between a
form and a server page indicating that an occurrence of the form submits data it gath-
ers to the associated occurrence of the server page for processing.
surrogate key A single numeric column that has values set by the DBMS itself, which
act as the primary key of a table. The values of a surrogate key have no business signif-
icance.
system boundary Separates what is part of the information system from what is not.
system-defined data type A data type defined by the data model, or database software,
usually as part of the data language.
system software Computer software that runs the computer hardware and forms the
infrastructure for developing application software.
table heading The set of all column descriptions for a table. Sometimes called a table
schema.
671
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 672
Glossary
tag In a mark-up language, tags delimit (mark the start and end of) elements. A start tag
may contain attributes.
transaction A series of modifications to a database that must all succeed. If any one
fails, then all must be rolled back (see rollback).
transaction path analysis A technique for checking if the entities and associations in
the ER diagram can support a particular database query.
Transmission Control Protocol (TCP) A transport protocol ensuring that there is a reli-
able connection between networked devices.
trigger A piece of application code stored on the database and executed when some
specified database event occurs in a named table. See also stored function and stored
routine.
trigger, row-level Fires once for each row affected by the database event, either before
or after the row is modified.
trigger, statement-level Fires once, either before or after, the database event occurs.
trigger condition A Boolean expression that controls whether or not the trigger is
executed.
type casting Changing the data type of a value, for example from a numeric data type to
a character string.
unary operator An operator that takes one input value. See binary operator and operator.
Unified Modeling Language (UML) ‘A visual language for specifying, constructing and
documenting the artefacts of systems.’ (OMG, 2003, p. 20).
uniform resource locator (URL) Identifies an application server and allows a client to
specify exactly which resource it wants from that server.
unique constraint Ensures that, in every row of the table, the combination of column
values is unique or that one of the columns is null.
unknown In XML, indicates that it is possible a value has been specified for a property
of an information item, but is currently inaccessible.
update statement An SQL statement that allows users to modify the column values in
specified rows of a table instance.
672
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 673
Glossary
use case A description of one way in which the information system is used.
use case diagram UML diagram showing the various use cases, how they are connected
and which actors use them.
user-defined data type A data type defined for using as part of a particular database
design.
user experience design The design of the look and feel of a website and the naviga-
tion paths through it.
validation test The process of checking that data submitted by end users is valid.
validating XML parser An XML parser that also checks whether or not an XML docu-
ment satisfies the constraints specified in the appropriate schema document.
value The binary number stored on the database to represent a particular data item.
view In SQL, a view is a derived table and its definition is stored in the data dictionary.
See also external view, internal view and logical view.
web data design The process of reconciling the database design with the design of the
rest of the web database application. Split into conceptual, logical and physical
stages, as with the established approach to database design.
web database applications Application software that utilizes web and database tech-
nologies.
web page A multimedia document, usually containing hyperlinks, that is available via
the Internet.
web server Collection of hardware and software used to manage a website. See HTTP
server.
673
IDWA_GLOS.QXD 10/5/07 12:09 PM Page 674
Glossary
well-known port number A standard port number used by specific programs, such as
web servers, e-mail servers and so on. Lies in the range 1 to 1023.
where clause Used in several SQL statements to indicate which rows of a table should
be affected by the statement.
wrapper function A programming approach that ‘wraps up’ SQL data modification
statements in a stored function.
XML declaration A statement at the start of a physical XML document indicating that
the document is XML. States the XML version and character set used and indicates
whether or not data held in other physical files is required to fully understand the
XML document.
XML document A physical, plain text file consisting of Unicode characters that repre-
sents well-formed XML. Also called an instance document.
XML parser An XML processor that checks whether or not a particular document satis-
fies the well-formedness constraints in the XML specification.
XML processor Application software that can read, and correctly interpret, an XML
document.
674
IDWA_REF.QXD 10/5/07 12:10 PM Page 675
The Harvard style of referencing is used, except in two respects: where the full forename
of an author is known, it is included, rather than just their initials; and where there are
three or more authors, all are listed. A separate list of websites is also included.
ABRSM (no date) Guide to Music Medals: Entry and administration. Available at:
<www.musicmedals.org/?page=introduction/entryAdmin.html> [accessed 16 June
2007].
Atzeni, Paolo, Ceri, Stefano, Parabosci, Stefano and Torlone, Riccardo (1999) Database
Systems: Concepts, languages, architectures. Maidenhead, UK: McGraw-Hill.
Arvin, Troels (2006) Comparison of Different SQL Implementations. Available at:
http://troels.arvin.dk/db/rdbms [accessed 16 June 2007].
Avison, David, and Fitzgerald, Guy (2003) Information Systems Development: Methodolo-
gies, techniques and tools, 3rd edn. Maidenhead, UK: McGraw-Hill.
Bennett, Simon, Skelton, John, and Lunn, Ken (2005) UML, Schaum’s Outlines, 2nd
edn. Maidenhead, UK: McGraw-Hill.
Berners-Lee, T. (1994) RFC1738 – Uniform Resource Locators. Available at: www.ietf.org/
rfc/rfc1738.txt [accessed 16 June 2007].
Bourret, Ron (2003) XML and Databases. Available at: www.rpbourret.com/xml/
XMLAndDatabases.htm [accessed 16 June 2007].
Bulger, Brad, Greenspan, Jay, and Wall, David (2004) MySQL/PHP Database Applications,
2nd edn. Indianapolis, Indiana: Wiley.
Castro, Elizabeth (2007) HTML, XHTML and CSS Visual Quickstart Guide, 6th edn.
Berkeley, California: Peachpit Press.
Ceri, Stefano, Fraternali, Piero, Bongio, Aldo, Brambilla, Marco, Comai, Sara, and
Matera, Maristella (2003) Designing Data-intensive Web Applications. San Francisco,
California: Morgan Kaufmann.
Conallen, Jim (2003) Building Web Applications with UML, 2nd edn. Harlow, UK:
Addison-Wesley.
Connolly, Thomas, and Begg, Carolyn (2004) Database Systems: A practical approach to
design, implementation and management, 4th edn. Harlow, UK: Addison-Wesley.
Date, C. J. (2004) An Introduction to Database Systems, 8th edn. Harlow, UK: Addison-
Wesley.
Duerst, M., and Suignard, M. (2005) Request for Comments: 3987 – Internationalized
Resource Identifiers (IRIs). Available at: www.ietf.org/rfc/rfc3987.txt [accessed 16 June
2007].
675
IDWA_REF.QXD 10/5/07 12:10 PM Page 676
Eaglestone, Barry, and Ridley, Mick (2001) Web Database Systems. Maidenhead, UK:
McGraw-Hill.
Eisenberg, Andrew, and Melton, Jim (2001) ‘SQL/XML and the SQLX informal group
of companies’, Sigmod Record, 30(3). Available at: www.sigmod.org/sigmod/record/
issues/0109/standards.pdf [accessed 12 April 2006].
Eisenberg, Andrew, and Melton, Jim (2002a) ‘SQL/XML is making good progress’,
Sigmod Record, 31(2) pp. 101–8. Available at: www.sigmod.org/sigmod/record/is-
sues/ 0206/standard.pdf [accessed 12 April 2006].
Eisenberg, Andrew, and Melton, Jim (2002b) ‘An early look at XQuery’, Sigmod Record,
31(4). Available at: www.sigmod.org/sigmod/record/issues/0212/AndrewEJimM.pdf
[accessed 12 April 2006].
Eisenberg, Andrew, and Melton, Jim (2004) ‘Advancements in SQL/XML’, Sigmod Record,
33(3). Available at: www.sigmod.org/sigmod/record/issues/0409/11.JimMelton.pdf
[accessed 12 April 2006].
Eisenberg, Andrew, and Melton, Jim (2005) ‘XQuery 1.0 is nearing completion’, Sigmod
Record, 34(4), pp. 78–84. Available at: www.sigmod.org/sigmod/record/issues/0512/
p78-column-eisenberg-melton.pdf [accessed 12 April 2006].
Elmasri, Ramez, and Navathe, Shamkant B. (2000a) Chapter 10: The network data model
and the IDMS system. Available at: ftp://ftp.aw.com/cseng/authors/elmasri/Dbase3e
[accessed 6 June 2007].
Elmasri, Ramez, and Navathe, Shamkant B. (2000b) Chapter 11: The hierarchical data
model and the IMS system. Available at: ftp://ftp.aw.com/cseng/authors/elmasri/
Dbase3e [accessed 6 June 2007].
Elmasri, Ramez, and Navathe, Shamkant B. (2007) Fundamentals of Database Systems,
5th edn. Harlow, UK: Addison-Wesley.
Fielding, R. (1995) RFC1808 – Relative Uniform Resource Locators. Available at:
www.w3.org/Addressing/rfc1808.txt [accessed 27 September 2005].
Génova, Gonzalo, Llorens, Juan, and Martínez, Paloma (2002) ‘The meaning of multi-
plicity of n-ary associations in UML’, Software and Systems Modeling, 1(2), pp. 86–97.
Gourley, David, and Totty, Brian (2002) HTTP: The definitive guide. Farnham, UK:
O’Reilly.
Graves, Mark (2002) Designing XML Databases. Upper Saddle River, New Jersey:
Prentice Hall.
Halpin, Terry A. (2001) Information Modeling and Relational Databases: From conceptual
analysis to logical design. San Francisco, California: Morgan Kauffmann.
Harold, Elliotte Rusty and Means, W. Scott (2004) XML in a Nutshell: A desktop quick ref-
erence, 3rd edn. Farnham, UK: O’Reilly.
Hoffer, J. A, Prescott, M. B. and McFadden, F. R. (2004) Modern Database Management,
6th edn. Harlow, UK: Prentice Hall.
Holt, Jon (2001) UML for Systems Engineering: Watching the wheels. London: The Institu-
tion of Electrical Engineers.
Howe, David (2001) Data Analysis for Database Design, 3rd edn. Oxford: Butterworth-
Heinemann.
676
IDWA_REF.QXD 10/5/07 12:10 PM Page 677
677
IDWA_REF.QXD 10/5/07 12:10 PM Page 678
Ray, Erik T. (2003) Learning XML, 2nd edn. Farnham, UK: O’Reilly.
Rodriguez, Adolfo, Gatrell, John, Karas, John, and Peschke, Roland (2001) TCP/IP Tuto-
rial and Technical Overview, IBM Redbooks, 6th edn. New York: IBM.
Savage, Ron (2005) BNF Grammar for ISO/IEC 9075-2:2003 – Database Language SQL
(SQL-2003) SQL/Foundation. Available at: http://savage.net.au/SQL/sql-2003-2.bnf.
html [accessed 16 June 2007].
Silberschatz, Abraham, Korth, Henry F., and Sudarshan, S. (2002) Database System Con-
cepts, 4th edn. Maidenhead, UK: McGraw-Hill.
Skidmore, Steve, and Eva, Malcolm (2004) Introducing Systems Development. Houndsmill,
Basingstoke, UK: Palgrave Macmillan.
Steegmans, Bart, Bourret, Ronald, Cline, Owen, Guyennet, Olivier, Kulkarni, Shrinivas,
Priestley, Stephen, Sylenko, Valeriy, and Wahli, Ueli (2004) XML for DB2 Information
Integration, IBM Redbooks. New York: IBM.
UKAIS (1999) The Definition of Information Systems. Available at: www.cs.york.ac.uk/
ukais/isdefn.pdf [accessed 16 June 2007].
Unicode Consortium (2005) Unicode-enabled Products. Available at: www.unicode.org/
onlinedat/products.html [accessed 16 June 2007].
van der Vlist, Eric (2003) RELAX NG. Available at: http://books.xmlschemata.org/
relaxng [accessed 16 June 2007].
Vaswani, Vikram (2004) PHP101: PHP for the absolute beginner [on line]. Available at:
http://devzone.zend.com/article627-PHP-101-PHP-for-the-Absolute-Beginner
[accessed 2 August 2007].
W3C (2001) XHTML™ 1.1 – Module-based XHTML Second Edition – W36 Working Draft,
16 February 2007. Available at: www.w3.org/TR/xhtml11 [accessed 16 June 2007].
W3C (2004a) Why Validate?. Available at: http://validator.w3.org/docs/why.html
[accessed 16 June 2007].
W3C (2004b) Namespaces in XML 1.1 Second Edition – W3C Recommendation, 16 August
2006. Available at: www.w3.org/TR/xml-names11 [accessed 16 June 2007].
W3C (2004c) XML Information Set Second Edition – W3C Recommendation 4 February
2004. Available at: www.w3.org/TR/2004/REC-xml-infoset-20040204 [accessed
16 June 2007].
W3C (2004d) XML Schema Part 1: Structures Second Edition – W3C Recommendation,
28 October 2004. Available at: www.w3.org/TR/xmlschema-1 [accessed 16 June 2007].
W3C (2004e) XML Schema Part 2: Datatypes Second Edition – W3C Recommendation,
28 October 2004. Available at: www.w3.org/TR/xmlschema-2 [accessed 16 April 2007].
W3C (2004f ) Extensible Markup Language (XML) 1.0 Third Edition – W3C Recommen-
dation, 4 February 2004. Available at: www.w3.org/TR/2004/REC-xml-20040204
[accessed 16 April 2007].
W3C (2005a) Markup validation service. Available at: http://validator.w3.org [accessed
16 June 2007].
W3C (2005b) Tutorial: Character sets & encodings in XHTML, HTML and CSS. Available
at: www.w3.org/International/tutorials/tutorial-char-enc [accessed 16 June 2007].
678
IDWA_REF.QXD 10/5/07 12:10 PM Page 679
W3C (2005c) XQuery 1.0: An XML Query Language – W3C Recommendation, 23 January
2007. Available at: www.w3.org/TR/xquery [accessed 16 June 2007].
W3C (2005d) XML Path Language (XPath) 2.0 – W3C Recommendation, 23 January 2007.
Available at: www.w3.org/TR/xpath20 [accessed 16 April 2007].
W3C (2005e) XQuery 1.0: An XML Query Language – W3C Candidate Recommendation,
3 November 2005. Available at: www.w3.org/TR/2005/CR-xquery-20051103 [accessed
16 June 2007].
W3C (2005f) XQuery 1.0 and XPath 2.0 Functions and Operators – W3C Recommendation,
23 January 2007. Available at: www.w3.org/TR/xpath-functions [accessed 16 June
2007].
W3C (2005g) XQuery 1.0 and XPath 2.0 Data Model (XDM) – W3C Recommendation,
23 January 2007. Available at: www.w3.org/TR/xpath-datamodel [accessed 16 June
2007].
W3C (2005h) XQuery 1.0 and XPath 2.0 Formal Semantics – W3C Recommendation 23
January 2007. Available at: www.w3.org/TR/xquery-semantics [accessed 16 June
2007].
W3C (2005i) XSL Transformations (XSLT) Version 2.0 – W3C Recommendation, 23 Janu-
ary 2007. Available at: www.w3.org/TR/xslt20 [accessed 16 June 2007].
W3C (2006a) Extensible Stylesheet Language (XSL) Version 1.1 – W3C Candidate Recom-
mendation, 20 February 2006. Available at: www.w3.org/TR/2006/CR-xsl11-
20060220/#d0e155 [accessed 16 June 2007].
W3C (2006b) XQuery Update Facility – W3C Working Draft, 11 July 2006. Available at:
www.w3.org/TR/xqupdate [accessed 16 June 2007].
W3C (2006c) XML Schema Part 2: Datatypes Second Edition – W3C Recommendation,
28 October 2004. Available at www.w3.org/TR/xmlschema-2 [accessed 16 June 2007].
Welling, Luke, and Thompson, Laura (2005) PHP and MySQL Web Development, 3rd
edn. Indianapolis, Indiana: Sams.
Wikipedia (2005) Wikipedia. Available at: http://en.wikipedia.org/wiki/Wikipedia
[accessed 16 June 2007].
Wilde, Erik (1999) Wilde’s WWW: Technical foundations of the World Wide Web. Berlin:
Springer.
Williams, Hugh E., and Lane, David (2004) Web Database Applications with PHP and
MySQL, 2nd edn. Farnham, UK: O’Reilly.
Williams, Rob (2001) Computer Systems Architecture: A networking approach. Harlow, UK:
Addison-Wesley.
679
IDWA_REF.QXD 10/5/07 12:10 PM Page 680
Websites
680
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 681
Index
access path analysis 396, 400–2, see also aggregation associations; build
488–91, 498 associations; composition
actors 6 associations; link associations; submit
Adobe® Acrobat® Reader® 64 associations
aggregate functions 252–5 atomic data types 82
aggregate queries 252–61, 495 atomization in XPath 609
aggregation associations 374–5, 620–1 attribute axis in XPath 604
aliases 192–3, 235, 243 attribute patterns in RELAX
alternate keys 95 NG 572–3
ambiguous names 561 attributes 66, 115–16, 119, 122, 348–61,
ancestor axis in XPath 604 561–2, 611
ancestors of nodes 116 of associations 372
AND operator 199–200 constraints on 359, 366
ANSI/SPARC architecture 46–52, 105–7, default values of 358
128, 507, 519 derived 359, 446, 475–81
Apache 556 missing 502
apostrophe problem 289–90 multivalued 357, 414–21, 431
application glossaries 361 naming of 355
application partitioning 30 authentication of users 521
application programming interface (API) authorization identifiers 102, 555
53, 62–4 authorization of users 521–2
application servers 58, 61–2, 502 auto increment facility 492
application software 2 AVG function 254
application states, maintenance of axis in XPath 604–6
319–20
architecture bags 82
of client-server systems 31–2 base tables 105
of web database applications 71 binary associations 368–73, 390, 613,
arithmetic operators 84–5 621, 629
arrays 82–3 binary data, storage of 494
assertions 104 binary operators 85
Associated Board of the Royal Schools of blocks of data 507
Music 5 boiler-plate text 18
association multiplicity constraints Boolean data and expressions 81–7, 91–2,
375–90 195, 199–201, 296, 474
association qualifiers 539–43, 548–9 boundaries of information systems 6
associations 368–75 Boyce-Codd normal form 455
logical design for 440 browsers 14, 60, 65–6, 72
navigability and name direction of 536 buffering 308
redundant 395, 399–400 build associations 545–8
681
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 682
Index
682
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 683
Index
683
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 684
Index
684
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 685
Index
685
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 686
Index
686
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 687
Index
687
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 688
Index
688
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 689
Index
689
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 690
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 691
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 692
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 693
IDWA_INDEX.QXD 10/5/07 12:12 PM Page 694