Data Management - Watson Richard
Data Management - Watson Richard
Management
Databases and Organizations
Richard T. Watson
Department of MIS
Terry College of Business
The University of Georgia
6 edition
Copyright © 2013 Richard T. Watson
provides
• overhead slides in PowerPoint format;
• all relational tables in the book in electronic format;
• code for the R, Java, and XML examples in the book;
• answers to many of the exercises;
• additional exercises;
• revisions and errata.
New in the sixth edition
This edition has the following improvements and additions.
• Redesigned to be a digital edition in ePub format
• MySQL Workbench for data modeling and SQL querying
3
• Integration of XML and MySQL
• New chapters on R, data visualization, text mining, and Hadoop distributed
file system and MapReduce
Acknowledgments
I thank my son, Ned, for help with the typesetting and my wife, Clare, for converting
the 5 edition to Pages format and redoing the figures to support this first electronic
th
edition. I would like to thank the reviewers of this and prior editions for their many
excellent suggestions and ideas for improving the quality of the content and
presentation of the book.
Richard T. Watson
Athens, Georgia
Section 1: The Managerial Perspective
People only see what they are prepared to see.
Ralph Waldo Emerson, Journals, 1863
Organizations are accumulating vast volumes of data because of the ease with which
data can be captured and distributed in digital format (e.g., smartphone cameras and
tweets). The world’s data are estimated to be doubling every 1-2 years, and large
companies now routinely manage petabytes (10 bytes) of data every day. Data
15
processing system, which shows that humans receive input, process it, and produce
output. The processing is done by a processor, which is linked to a memory that
stores both data and processes. The processor component retrieves both data and
processes from memory.
To understand this model, consider a person arriving on a flight who has agreed to
meet a fellow traveler in the terminal. She receives a text message to meet her friend
at B27. The message is input to her human information processing system. She
retrieves the process for interpreting the message (i.e., decoding that B27 means
terminal B and gate 27) and finds the terminal and gate. The person then walks to the
terminal and gate, the output. Sometimes these processes are so well ingrained in
our memory that we never think about retrieving them. We just do them
automatically.
Human information processing systems are easily overloaded. Our memory is
limited, and our ability to process data is restricted; thus we use a variety of external
tools to extend and augment our capacities. A contacts app is an instance of external
data memory. A recipe, a description of the process for preparing food, is an
example of external process memory. Smartphones are now the external processor
of choice that we use to augment our limited processing capacity.
The original model of human information processing can be extended to include
external memory, for storing data and processes, and external processors, for
executing processes.
Calendars come in many shapes and forms, but they are all based on the same
organizing principle. A set amount of space is allocated for each day of the year,
and the spaces are organized in date order, which supports rapid retrieval of any
date. Some calendars have added features to speed up access. For example,
electronic calendars usually have a button to select today’s data.
Address books also have a standard format. They typically contain predefined
spaces for storing address details (e.g., name, company, phone, and email). Access
is often supported by a set of buttons for each letter of the alphabet and a search
engine.
An address book
The structure of to-do lists tends to be fairly standard. They are often set up in list
format with a small left-hand margin. The idea is to enter each item to be done on
the right side of the screen. The left side is used to check (√) or mark those tasks that
have been completed. The beauty of the check method is that you can quickly scan
the left side to identify incomplete tasks.
A to-do or reminder list
Many people use some form of the individual memory systems just described. They
are frequently marketed as “time management systems” and are typically included in
the suite of standard applications for a smart phone.
These three examples of individual memory systems illustrate some features
common to all data management systems:
• There is a storage medium. Data are stored electronically in each case.
• There is a structure for storing data. For instance, the address book has
labeled spaces for entering pertinent data.
• The storage device is organized for rapid data entry and retrieval. A
calendar is stored in date and time sequence so that the data space for any
appointment for a particular day can be found quickly.
• The selection of a data management system frequently requires a trade-off
decision. In these examples, the trade-off is screen dimensions versus the
amount of data that can be seen without scrolling. For example, you will notice
the address book screen is truncated and will need to be scrolled to see full
address details.
Skill builder
Smart phones have dramatically changed individual data management in the last few
years. We now have calendars, address books, to-do lists, and many more apps
literally in our hands. What individual data is still difficult to manage? What might
be the characteristics of an app for that data?
There are differences between internal and external memories. Our internal
memory is small, fast, and convenient (our brain is always with us—well, most of
the time). External memory is often slower to reference and not always as
convenient. The two systems are interconnected. We rely on our internal memory to
access external memory. Our internal memory and our brain’s processing skills
manage the use of external memories. For example, we depend on our internal
memory to recall how to use our smartphone and its apps. Again, we see some
trade-offs. Ideally, we would like to store everything in our fast and convenient
internal memory, but its limited capacity means that we are forced to use external
memory for many items.
Organizational data management
Organizations, like people, need to remember many things. If you look around any
office, you will see examples of the apparatus of organizational memory: people,
filing cabinets, policy manuals, planning boards, and computers. The same
principles found in individual memory systems apply to an organization’s data
management systems.
There is a storage medium. In the case of computers, the storage medium varies.
Small files might be stored on a USB drive and large, archival files on a magnetic
disk. In Chapter 11, we discuss electronic storage media in more detail.
A table is a common structure for storing data. For example, if we want to store
details of customers, we can set up a table with each row containing individual
details of a customer and each column containing data on a particular feature (e.g.,
customer code).
Storage devices are organized for rapid data entry and retrieval. Time is the
manager ’s enemy: too many things to be done in too little time. Customers expect
rapid responses to their questions and quick processing of their transactions. Rapid
data access is a key goal of nearly all data management systems, but it always comes
at a price. Fast access memories cost more, so there is nearly always a trade-off
between access speed and cost.
As you will see, selecting how and where to store organizational data frequently
involves a trade-off. Data managers need to know and understand what the
compromises entail. They must know the key questions to ask when evaluating
choices.
When we move from individual to organizational memory, some other factors
come into play. To understand these factors, we need to review the different types of
information systems. The automation of routine business transactions was the
earliest application of information technology to business. A transaction processing
system (TPS) handles common business tasks such as accounting, inventory,
purchasing, and sales. The realization that the data collected by these systems could
be sorted, summarized, and rearranged gave birth to the notion of a management
information system (MIS). Furthermore, it was recognized that when internal data
captured by a TPS is combined with appropriate external data, the raw material is
available for a decision support system (DSS) or executive information system
(EIS). Recently, online analytical processing (OLAP), data mining, and business
intelligence (BI) have emerged as advanced data analysis techniques for data
captured by business transactions and gathered from other sources (these systems
are covered in detail in Chapter 14). The purpose of each of these systems is
described in the following table and their interrelationship can be understood by
examining the information systems cycle.
Types of information systems
Type of information system System’s purpose
TPS Transaction processing system Collect and store data from routine transactions
MIS Management information system Convert data from a TPS into information for plannin
DSS Decision support system Support managerial decision making by providing mo
EIS Executive information system Provide senior management with information necessa
OLAP Online analytical processing Present a multidimensional, logical view of data
Data mining Use statistical analysis and artificial intelligence techn
BI Business intelligence Gather, store, and analyze data to improve decision m
The information systems cycle
The various systems and technologies found in an organization are linked in a
cycle. The routine ongoing business of the organization is processed by TPSs, the
systems that handle the present. Data collected by TPSs are stored in databases, a
record of the past, the history of the organization and its interaction with those with
whom it conducts business. These data are converted into information by analysts
using a variety of software (e.g., a DSS). These technologies are used by the
organization to prepare for the future (e.g., sales in Finland have expanded, so we
will build a new service center in Helsinki). The business systems created to prepare
for the future determine the transactions the company will process and the data that
will be collected, and the process continues. The entire cycle is driven by people
using technology (e.g., a customer booking a hotel room via a Web browser).
Decision Making, or preparing for the future, is the central activity of modern
organizations. Today’s organizations are busy turning out goods, services, and
decisions. Knowledge and information workers, over half of the U.S. labor force,
produce the bulk of GNP. Many of these people are decision makers. Their success,
and their organization’s as well, depends on the quality of their decisions.
The information systems cycle
Industrial society is a producer of goods, and the hallmark of success is product
quality. Japanese manufacturers convincingly demonstrated that focusing on product
quality is the key to market leadership and profitability. The methods and the
philosophy of quality gurus, such as W. Edwards Deming, have been internationally
recognized and adopted by many providers of goods and services. We are now in
the information age as is evidenced by the hot products of the times, such as
portable music players, smart phones, tablets, video recorders, digital cameras, and
streaming video players. These are all information appliances, and they are
supported by a host of information services. For example, consider how Apple
connects together its various devices and services through iTunes. For example, an
iPad owner can buy through iTunes a book to read with the iBooks app.
In the information society, which is based on innovation, knowledge, and services,
the key determinant of success has shifted from product quality to decision quality.
In the turbulent environment of global business, successful organizations are those
able to quickly make high-quality decisions about what customers will buy, how
much they will pay, and how to deliver a high-quality experience with a minimum of
fuss. Companies are very dependent on information systems to create value for their
customers.
Attributes of data
Once we realize the critical importance of data to organizations, we can recognize
some desirable attributes of data.
Desirable attributes of data
Shareable Readily accessed by more than one person at a time
Transportable Easily moved to a decision maker
Secure Protected from destruction and unauthorized use
Accurate Reliable, precise records
Timely Current and up-to-date
Relevant Appropriate to the decision
Shareable
Organizations contain many decision makers. There are occasions when more than
one person will require access to the same data at the same time. For example, in a
large bank it would not be uncommon for two customer representatives
simultaneously to want data on the latest rate for a three-year certificate of deposit.
As data become more volatile, shareability becomes more of a problem. Consider a
restaurant. The permanent menu is printed, today’s special might be displayed on a
blackboard, and the waiter tells you what is no longer available.
Transportable
Data should be transportable from their storage location to the decision maker.
Technologies that transport data have a long history. Homing pigeons were used to
relay messages by the Egyptians and Persians 3,000 years ago. The telephone
revolutionized business and social life because it rapidly transmitted voice data. Fax
machines accelerated organizational correspondence because they transported both
text and visual data. Computers have changed the nature of many aspects of business
because they enable the transport of text, visuals, and voice data.
Today, transportability is more than just getting data to a decision maker ’s desk. It
means getting product availability data to a salesperson in a client’s office or
advising a delivery driver, en route, of address details for an urgent parcel pickup.
The general notion is that decision makers should have access to relevant data
whenever and wherever required, although many organizations are still some way
from reaching this target.
Secure
In an information society, organizations value data as a resource. As you have
already learned, data support day-to-day business transactions and decision making.
Because the forgetful organization will soon be out of business, organizations are
very vigilant in protecting their data. There are a number of actions that
organizations take to protect data against loss, sabotage, and theft. A common
approach is to duplicate data and store the copy, or copies, at other locations. This
technique is popular for data stored in computer systems. Access to data is often
restricted through the use of physical barriers (e.g., a vault) or electronic barriers
(e.g., a password). Another approach, which is popular with firms that employ
knowledge workers, is a noncompete contract. For example, some software
companies legally restrain computer programmers from working for a competitor
for two years after they leave, hoping to prevent the transfer of valuable data, in the
form of the programmer ’s knowledge of software, to competitors.
Accurate
You probably recall friends who excel in exams because of their good memories.
Similarly, organizations with an accurate memory will do better than their less
precise competitors. Organizations need to remember many details precisely. For
example, an airline needs accurate data to predict the demand for each of its flights.
The quality of decision making will drop dramatically if managers use a data
management system riddled with errors.
Polluted data threatens a firm’s profitability. One study suggests that missing,
wrong, and otherwise bad data cost U.S. firms billions of dollars annually. The
consequences of bad data include improper billing, cost overruns, delivery delays,
and product recalls. Because data accuracy is so critical, organizations need to be
watchful when capturing data—the point at which data accuracy is most vulnerable.
Timely
The value of a collection of data is often determined by its age. You can fantasize
about how rich you would be if you knew tomorrow’s stock prices. Although
decision makers are most interested in current data, the required currency of data
can vary with the task. Operational managers often want real-time data. They want to
tap the pulse of the production line so that they can react quickly to machine
breakdowns or quality slippages. In contrast, strategic planners might be content
with data that are months old because they are more concerned with detecting long-
term trends.
Relevant
Organizations must maintain data that are relevant to transaction processing and
decision making. In processing a credit card application, the most relevant data
might be the customer ’s credit history, current employment status, and income level.
Hair color would be irrelevant. When assessing the success of a new product line, a
marketing manager probably wants an aggregate report of sales by marketing
region. A voluminous report detailing every sale would be irrelevant. Data are
relevant when they pertain directly to the decision and are aggregated appropriately.
Relevance is a key concern in designing a data management system. Clients have to
decide what should be stored because it is pertinent now or could have future
relevance. Of course, identifying data that might be relevant in the future is difficult,
and there is a tendency to accumulate too much. Relevance is also an important
consideration when extracting and processing data from a data management system.
Provided the pertinent data are available, query languages can be used to aggregate
data appropriately.
In the final years of the twentieth century, organizations started to share much of
their data, both high and low volatility, via the Web. This move has increased
shareability, timeliness, and availability, and it has lowered the cost of distributing
data.
In summary, a data management system for maintaining an organization’s memory
supports transaction processing, remembering the past, and decision making. Its
contents must be shareable, secure, and accurate. Ideally, the clients of a data
management system must be able to get timely and relevant data when and where
required. A major challenge for data management professionals is to create data
management systems that meet these criteria. Unfortunately, many existing systems
fail in this regard, though we can understand some of the reasons why by reviewing
the components of existing organizational memory systems.
Components of organizational memory
An organization’s memory resides on a variety of media in a variety of ways. It is in
people, standard operating procedures, roles, organizational culture, physical
storage equipment, and electronic devices. It is scattered around the organization
like pieces of a jigsaw puzzle designed by a berserk artist. The pieces don’t fit
together, they sometimes overlap, there are gaps, and there are no edge pieces to
define the boundaries. Organizations struggle to design structures and use data
management technology to link some of the pieces. To understand the complexity of
this wicked puzzle, we need to examine some of the pieces. Data managers have a
particular need to understand the different forms of organizational memory because
their activities often influence a number of the components.
Components of organizational memory
People
People are the linchpin of an organization’s memory. They recall prior decisions
and business actions. They create, maintain, evolve, and use data management
systems. They are the major component of an organization’s memory because they
know how to use all the other components. People extract data from the various
components of organizational memory to provide as complete a picture of a
situation as possible.
Each person in an organization has a role and a position in the hierarchy. Role and
position are both devices for remembering how the organization functions and how
to process data. By labeling people (e.g., Chief Information Officer) and placing
their names on an organizational chart, the organization creates another form of
organizational memory.
Organizational culture is the shared beliefs, values, attitudes, and norms that
influence the behavior and expectations of each person in an organization. As a
long-lived and stable memory system, culture determines acceptable behavior and
influences decision making.
People develop skills for doing their particular job—learning what to do, how to do
it, and who can help them get things done. For example, they might discover
someone in employee benefits who can handle personnel problems or a contact in a
software company who can answer questions promptly. These social networks,
which often take years to develop, are used to make things happen and to learn about
the business environment. Despite their high value, they are rarely documented, at
least not beyond an address book, and they are typically lost when a person leaves
an organization.
Conversations are an important method for knowledge workers to create, modify,
and share organizational memory and to build relationships and social networks.
Discussions with customers are a key device for learning how to improve an
organization’s products and services and learning about competitors. The
conversational company can detect change faster and react more rapidly. The
telephone, instant message, e-mail, coffee machine, cocktail hour, and cafeteria are
all devices for promoting conversation and creating networks. Some firms
deliberately create structures for supporting dialog to make the people component
of organizational memory more effective.
Standard operating procedures exist for many organizational tasks. Processing a
credit application, selecting a marketing trainee, and preparing a departmental
budget are typical procedures that are clearly defined by many organizations. They
are described on Web pages, computer programs, and job specifications. They are
the way an organization remembers how to perform routine activities.
Successful people learn how to use organizational memory. They learn what data
are stored where, how to retrieve them, and how to put them together. In promoting
a new product, a salesperson might send the prospect a package containing some
brochures and an email of a product review in a trade journal, and supply the phone
number and e-mail address of the firm’s technical expert for that product. People’s
recall of how to use organizational memory is the core component of
organizational memory. Academics call this metamemory; people in business call it
learning the ropes. New employees spend a great deal of time building their
metamemory so that they can use organizational memory effectively. Without this
knowledge, organizational memory has little value.
Tables
A table is a common form of storing organizational data. The following table
shows a price list in tabular form. Often, the first row defines the meaning of data in
subsequent rows.
A price list
Product Price
Pocket knife–Nile 4.50
Compass 10.00
Geopositioning system 500.00
Map measure 4.90
A table is a general form that describes a variety of other structures used to store
data. Computer-based files are tables or can be transformed into tables; the same is
true for general ledgers, worksheets, and spreadsheets. Accounting systems make
frequent use of tables. As you will discover in the next section, the table is the
central structure of the relational database model.
Data stored in tables typically have certain characteristics:
• Data in one column are of the same type. For example, each cell of the
column headed “Price” contains a number. (Of course, the exception is the first
row of each column, which contains the title of the column.)
• Data are limited by the width of the available space.
Rapid searching is one of the prime advantages of a table. For example, if the price
list is sorted by product name, you can quickly find the price of any product.
Tables are a common form of storing organizational data because their structure is
readily understood. People learn to read and build tables in the early years of their
schooling. Also, a great deal of the data that organizations want to remember can be
stored in tabular form.
Documents
The document—of which reports, manuals, brochures, and memos are examples—
is a common medium for storing organizational data. Although documents may be
subdivided into chapters, sections, paragraphs, and sentences, they lack the
regularity and discipline of a table. Each row of a table has the same number of
columns, but each paragraph of a document does not have the same number of
sentences.
Most documents are now stored electronically. Because of the widespread use of
word processing, text files are a common means of storing documents. Typically,
such files are read sequentially like a book. Although there is support for limited
searching of the text, such as finding the next occurrence of a specified text string,
text files are usually processed linearly.
Hypertext, the familiar linking technology of the Web, supports nonlinear document
processing. A hypertext document has built-in linkages between sections of text that
permit the reader to jump quickly from one part to another. As a result, readers can
find data they require more rapidly.
Although hypertext is certainly more reader-friendly than a flat, sequential text file,
it takes time and expertise to establish the links between the various parts of the text.
Someone familiar with the topic has to decide what should be linked and then
establish these links. While it takes the author more time to prepare a document this
way, the payoff is the speed at which readers of the document can find what they
want.
Multimedia
Many Web sites display multimedia objects, such as sound and video clips.
Automotive company Web sites have video clips of cars, music outlets provide
sound clips of new releases, and clothing companies have online catalogs
displaying photos of their latest products. Maintaining a Web site, because of the
many multimedia objects that some sites contain, has become a significant data
management problem for some organizations. Consider the different types of data
that a news outfit such as the British Broadcasting Corporation (BBC) has to store to
provide a timely, informative, and engaging Web site.
Images
Images are visual data: photographs and sketches. Image banks are maintained for
several reasons. First, images are widely used for identification and security. Police
departments keep fingerprints and mug shots. Second, images are used as evidence.
Highly valuable items such as paintings and jewelry often are photographed for
insurance records. Third, images are used for advertising and promotional
campaigns, and organizations need to maintain records of material used in these
ventures. Image archiving and retrieval are essential for mail-order companies,
which often produce several photo-laden catalogs every year. Fourth, some
organizations specialize in selling images and maintain extensive libraries of clip
art and photographs.
Graphics
Maps and engineering drawings are examples of electronically stored graphics. An
organization might maintain a map of sales territories and customers.
Manufacturers have extensive libraries of engineering drawings that define the
products they produce. Graphics often contain a high level of detail. An engineering
drawing will define the dimensions of all parts and may refer to other drawings for
finer detail about any components.
A graphic differs from an image in that it contains explicitly embedded data.
Consider the difference between an engineering plan for a widget and a photograph
of the same item. An engineering plan shows dimensional data and may describe the
composition of the various components. The embedded data are used to
manufacture the widget. A photograph of a widget does not have embedded data and
contains insufficient data to manufacture the product. An industrial spy will receive
far more for an engineering plan than for a photograph of a widget.
A geographic information systems (GIS) is a specialized graphical storage system
for geographic data. The underlying structure of a GIS is a map on which data are
displayed. A power company can use a GIS to store and display data about its
electricity grid and the location of transformers. Using a pointing device such as a
mouse, an engineer can click on a transformer ’s location to display a window of
data about the transformer (e.g., type, capacity, installation date, and repair history).
GISs have found widespread use in governments and organizations that have
geographically dispersed resources.
Audio
Music stores such as iTunes enable prospective purchasers to hear audio samples.
News organizations, such as National Public Radio (NPR), provide audio versions
of their new stories for replay.
Some firms conduct a great deal of their business by phone. In many cases, it is
important to maintain a record of the conversation between the customer and the
firm’s representative. The Royal Hong Kong Jockey Club, which covers horse
racing gambling in Hong Kong, records all conversations between its operators and
customers. Phone calls are stored on a highly specialized voice recorder, which
records the time of the call and other data necessary for rapid retrieval. In the case
of a customer dispute, an operator can play back the original conversation.
Video
A video clip can give a potential customer additional detail that cannot be readily
conveyed by text or a still image. Consequently, some auto companies use video and
virtual reality to promote their cars. On a visit to Toyota’s Web site, you can view a
video clip of the Prius or rotate an image to view the car from multiple angles.
Models
Organizations build mathematical models to describe their business. These models,
usually placed in the broader category of DSS, are then used to analyze existing
problems and forecast future business conditions. A mathematical model can often
produce substantial benefits to the organization.
Knowledge
Organizations build systems to capture the knowledge of their experienced decision
makers and problem solvers. This expertise is typically represented as a set of rules,
semantic nets, and frames in a knowledge base, another form of organizational
memory.
Decisions
Decision Making is the central activity of modern organizations. Very few
organizations, however, have a formal system for recording decisions. Most keep
the minutes of meetings, but these are often very brief and record only a meeting’s
outcome. Because they do not record details such as the objectives, criteria,
assumptions, and alternatives that were considered prior to making a decision, there
is no formal audit trail for decision making. As a result, most organizations rely on
humans to remember the circumstances and details of prior decisions.
Specialized memories
Because of the particular nature of their business, some organizations maintain
memories rarely found elsewhere. Perfume companies, for instance, maintain a
library of scents, and paint manufacturers and dye makers catalog colors.
External memories
Organizations are not limited to their own memory stores. There are firms whose
business is to store data for resale to other organizations. Such businesses have
existed for many years and are growing as the importance of data in a postindustrial
society expands. Lawyers using Mead Data Central’s LEXIS® can access the laws
and court decisions of all 50 American states and the U.S. federal government.
Similar legal data services exist in many other nations. There is a range of other
services that provide news, financial, business, scientific, and medical data.
Problems with data management systems
Successful management of data is a critical skill for nearly every organization. Yet
few have gained complete mastery, and there are a variety of problems that typically
afflict data management in most firms.
Problems with organizational data management systems
Redundancy Same data are stored in different systems
Lack of data control Data are poorly managed
Poor interface Data are difficult to access
Delays There are frequently delays following requests for reports
Lack of reality Data management systems do not reflect the complexity of the real wo
Lack of data integration Data are dispersed across different systems
Redundancy
In many cases, data management systems have grown haphazardly. As a result, it is
often the situation that the same data are stored in several different memories. The
classic example is a customer ’s address, which might be stored in the sales
reporting system, accounts receivable system, and the salesperson’s address book.
The danger is that when the customer changes address, the alteration is not recorded
in all systems. Data redundancy causes additional work because the same item must
be entered several times. Redundancy causes confusion when what is supposedly the
same item has different values.
Lack of data control
Allied with the redundancy problem is poor data control. Although data are an
important organizational resource, they frequently do not receive the same degree
of management attention as other important organizational resources, such as
people and money. Organizations have a personnel department to manage human
resources and a treasury to handle cash. The IS department looks after data captured
by the computer systems it operates, but there are many other data stores scattered
around the organization. Data are stored everywhere in the organization (e.g., on
personal computers and in departmental filing systems), but there is a general lack
of data management. This lack is particularly surprising, since many pundits claim
that data are a key competitive resource.
Poor interface
Too frequently, potential clients of data management systems have been deterred by
an unfriendly interface. The computer interface for accessing a data store is
sometimes difficult to remember for the occasional inquirer. People become
frustrated and give up because their queries are rejected and error messages are
unintelligible.
Delays
Globalization and technology have accelerated the pace of business in recent years.
Managers must make more decisions more rapidly. They cannot afford to wait for
programmers to write special-purpose programs to retrieve data and format
reports. They expect their questions to be answered rapidly, often within a day and
sometimes more quickly. Managers, or their support personnel, need query
languages that provide rapid access to the data they need, in a format that they want.
Lack of reality
Organizational data stores must reflect the reality and complexity of the real world.
Consider a typical bank customer who might have a personal checking account,
mortgage account, credit card account, and some certificates of deposit. When a
customer requests an overdraft extension, the bank officer needs full details of the
customer ’s relationship with the bank to make an informed decision. If customer
data are scattered across unrelated data stores, then these data are not easily found,
and in some cases important data might be overlooked. The request for full
customer details is reasonable and realistic, and the bank officer should expect to be
able to enter a single, simple query to obtain it. Unfortunately, this is not always the
case, because data management systems do not always reflect reality.
In this example, the reality is that the personal checking, mortgage, and credit card
accounts, and certificates of deposit all belong to one customer. If the bank’s data
management system does not record this relationship, then it does not mimic reality.
this makes it impossible to retrieve a single customer ’s data with a single, simple
query.
A data management system must meet the decision making needs of managers, who
must be able to request both routine and ad hoc reports. In order to do so effectively,
a data management system must reflect the complexity of the real world. If it does
not store required organizational data or record a real-world relationship between
data elements, then many managerial queries cannot be answered quickly and
efficiently.
Lack of data integration
There is a general lack of data integration in most organizations. Not only are data
dispersed in different forms of organizational memory (e.g., files and image
stores), but even within one storage format there is often a lack of integration. For
example, many organizations maintain file systems that are not integrated.
Appropriate files in the accounting system may not be linked to the production
system.
This lack of integration will be a continuing problem for most organizations for
two important reasons. First, earlier computer systems might not have been
integrated because of the limitations of available technology. Organizations created
simple file systems to support a particular function. Many of these systems are still
in use. Second, integration is a long-term goal. As new systems are developed and
old ones rewritten, organizations can evolve integrated systems. It would be too
costly and disruptive to try to solve the data integration problem in one step.
Many data management problems can be solved with present technology. Data
modeling and relational database technology, topics covered in Section 2 of this
book, help overcome many of the current problems.
A brief history of data management systems
Data management is not a new organizational concern. It is an old problem that has
become more significant, important, and critical because of the emergence of data
as a critical resource for effective performance in the modern economy.
Organizations have always needed to manage their data so that they could remember
a wide variety of facts necessary to conduct their affairs.
The recent history of computer-based data management systems is depicted in the
following figure. File systems were the earliest form of data management. Limited
by the sequential nature of magnetic tape technology, it was very difficult to
integrate data from different files. The advent of magnetic disk technology in the
mid-1950s stimulated development of integrated file systems, and the hierarchical
database management system (DBMS) emerged in the 1960s, followed some years
later by the network DBMS. The spatial database, or geographic information system
(GIS), appeared around 1970. Until the mid-1990s, the hierarchical DBMS, mainly
in the form of IBM’s DL/I product, was the predominant technology for managing
data. It has now been replaced by the relational DBMS, a concept first discussed by
Edgar Frank Codd in an academic paper in 1970 but not commercially available
until the mid-1970s. In the late 1980s, the notion of an object-oriented DBMS,
primed by the ideas of object-oriented programming, emerged as a solution to
situations not handled well by the relational DBMS. More recently, the Hadoop
distributed files system (HDFS) has emerged as a popular alternative model for data
management, and it will be covered in the latter part of this book.
Data management systems timeline
This book concentrates on the relational model, currently the most widely used data
management system. As mentioned, Section 2 is devoted to the development of the
necessary skills for designing and using a relational database.
Data, information, and knowledge
Often the terms data and information are used interchangeably, but they are
distinctly different. Data are raw, unsummarized, and unanalyzed facts. Information
is data that have been processed into a meaningful form.
A list of a supermarket’s daily receipts is data, but it is not information, because it is
too detailed to be very useful. A summary of the data that gives daily departmental
totals is information, because the store manager can use the report to monitor store
performance. The same report would be only data for a regional manager, because
it is too detailed for meaningful decision making at the regional level. Information
for a regional manager might be a weekly report of sales by department for each
supermarket.
Data are always data, but one person’s information can be another person’s data.
Information that is meaningful to one person can be too detailed for another. A
manager ’s notion of information can change quickly, however. When a problem is
identified, a manager might request finer levels of detail to diagnose the problem’s
cause. Thus, what was previously data suddenly becomes information because it
helps solve the problem. When the problem is solved, the information reverts to
data. There is a need for information systems that let managers customize the
processing of data so that they always get information. As their needs change, they
need to be able to adjust the detail of the reports they receive.
Knowledge is the capacity to use information. The education and experience that
managers accumulate provide them with the expertise to make sense of the
information they receive. Knowledge means that managers can interpret
information and use it in decision making. In addition, knowledge is the capacity to
recognize what information would be useful for making decisions. For example, a
sales manager knows that requesting a report of profitability by product line is
useful when she has to decide whether to employ a new product manager. Thus,
when a new information system is delivered, managers need to be taught what
information the system can deliver and what that information means.
The relationship between data, information, and knowledge is depicted in the
following figure. A knowledgeable person requests information to support decision
making. To fulfill the request, data are converted into information. Personal
knowledge is then applied to interpret the requested information and reach a
conclusion. Of course, the cycle can be repeated several times if more information
is needed before a decision can be made. Notice how knowledge is essential for
grasping what information to request and interpreting that information in terms of
the required decision.
The relationship between data, information, and knowledge
The challenge
A major challenge facing organizations is to make effective use of the data
currently stored in their diverse data management systems. This challenge exists
because these various systems are not integrated and many potential clients not only
lack the training to access the systems but often are unaware what data exist. Before
data managers can begin to address this problem, however, they must understand
how organizational memories are used. In particular, they need to understand the
relationship between information and managerial decision making. Data
management is not a new problem. It has existed since the early days of civilization
and will be an enduring problem for organizations and societies.
Data
Database management system (DBMS)
Data management
Data management system
Data mining
Data security
Decision making
Decision quality
Decision support system (DSS)
Documents
Executive information system (EIS)
External memory
Geographic information system (GIS)
Graphics
Hypertext
Images
Information
Internal memory
Knowledge
Management information system (MIS)
Metamemory
Online analytical processing (OLAP)
Organizational culture
Organizational memory
Shareable data
Social networks
Standard operating procedures
Storage device
Storage medium
Storage structure
Tables
Transaction processing
Transaction processing system (TPS)
Voice data
References and additional readings
Davenport, T. H. (1998). Putting the enterprise into the enterprise system. Harvard
Business Review, 76(4), 121-131.
Exercises
1. What are the major differences between internal and external memory?
2. What is the difference between the things you remember and the things you
record on your computer?
3. What features are common to most individual memory systems?
4. What do you think organizations did before computers were invented?
5. Discuss the memory systems you use. How do they improve your performance?
What are the shortcomings of your existing systems? How can you improve
them?
6. Describe the most “organized” person you know. Why is that person so
organized? Why haven’t you adopted some of the same methods? Why do you
think people differ in the extent to which they are organized?
7. Think about the last time you enrolled in a class. What data do you think were
recorded for this transaction?
8. What roles do people play in organizational memory?
9. What do you think is the most important attribute of organizational memory?
Justify your answer.
10. What is the difference between transaction processing and decision making?
11. When are data relevant?
12. Give some examples of specialized memories.
13. How can you measure the quality of a decision?
14. What is organizational culture? Can you name some organizations that have a
distinctive culture?
15. What is hypertext? How does it differ from linear text? Why might hypertext be
useful in an organizational memory system?
16. What is imaging? What are the characteristics of applications well suited for
imaging?
17. What is an external memory? Why do organizations use external memories
instead of building internal memories?
18. What is the common name used to refer to systems that help organizations
remember knowledge?
19. What is a DSS? What is its role in organizational memory?
20. What are the major shortcomings of many data management systems? Which
do you think is the most significant shortcoming?
21. What is the relationship between data, information, and knowledge?
22. Consider the following questions about The Expeditioner:
a. What sort of data management systems would you expect to find at The
Expeditioner?
b. For each type of data management system that you identify, discuss it
using the themes of data being shareable, transportable, secure, accurate,
timely, and relevant.
c. An organization’s culture is the set of key values, beliefs,
understandings, and norms that its members share. Discuss your ideas of
the likely organizational culture of The Expeditioner.
d. How difficult will it be to change the organizational culture of The
Expeditioner? Do you anticipate that Alice will have problems making
changes to the firm? If so, why?
23. Using the Web, find some stories about firms using data management systems.
You might enter keywords such as “database” and “business analytics” and access
the Web sites of publications such as Computerworld. Identify the purpose of each
system. How does the system improve organizational performance? What are the
attributes of the technology that make it useful? Describe any trade-offs the
organization might have made. Identify other organizations in which the same
technology might be applied.
24. Make a list of the organizational memory systems identified in this chapter.
Interview several people working in organizations. Ask them to indicate which
organizational memory systems they use. Ask which system is most important
and why. Write up your findings and your conclusion.
2. Information
Effective information management must begin by thinking about how people use
information—not with how people use machines.
Davenport, T. H. (1994). Saving IT’s soul: human-centered information
management. Harvard Business Review, 72(2), 119-131.
Learning objectives
Students completing this chapter will
✤ understand the importance of information to society and organizations;
✤ be able to describe the various roles of information in organizational change;
✤ be able to distinguish between soft and hard information;
✤ know how managers use information;
✤ be able to describe the characteristics of common information delivery systems;
✤ distinguish the different types of knowledge.
Introduction
There are three characteristics of the early years of the twenty-first century: global
warming, high-velocity global change, and the emerging power of information
organizations. The need to create a sustainable civilization, globalization of
business, and the rise of China as a major economic power are major forces
contributing to mass change. Organizations are undergoing large-scale
restructuring as they attempt to reposition themselves to survive the threats and
exploit the opportunities presented by these changes.
In the last few years, some very powerful and highly profitable information-based
organizations have emerged. Apple is the world’s largest company in terms of
market value. With its combination of iPod, iPhone, iPad, and iTunes, it has shown
the power of a new information service to change an industry. Google has become a
well-known global brand as it fulfills its mission “to organize the world’s
information and make it universally accessible and useful.” Founded in 1995, eBay
rightly calls itself the “The World’s Online Marketplace®.” The German software
firm, SAP, markets the enterprise resource planning (ERP) software that is used by
many of the world’s largest organizations. Information products and information
systems have become a major driver of organizational growth. We gain further
insights into the value of information by considering its role in civilization.
A historical perspective
Humans have been collecting data to manage their affairs for several thousand
years. In 3000 BCE, Mesopotamians recorded inventory details in cuneiform, an
early script. Today's society transmits Exabytes of data every day as billions of
people and millions of organizations manage their affairs. The dominant issue
facing each type of economy has changed over time, and the collection and use of
data has changed to reflect these concerns, as shown in the following table.
Societal focus
Economy Subsistence Agricultural Industrial Service Sustainable
Question How to survive? How to farm? How to manage resources? How to create customers? How to reduc
Survival
Production
Dominant issue
Customer service
Sustainability
Writing
Accounting BPM Simulation
Gesturing Calendar
Key information systems ERP CRM Optimization
Speech Money
Project management Analytics Design
Measuring
Agrarian society was concerned with productive farming, and an important issue
was when to plant crops. Ancient Egypt, for example, based its calendar on the
flooding of the Nile. The focus shifted during the industrial era to management of
resources (e.g., raw materials, labor, logistics). Accounting, ERP, and project
management became key information systems for managing resources. In the
current service economy, the focus has shifted to customer creation. There is an
oversupply of many consumer products (e.g., cars) and companies compete to
identify services and product features that will attract customers. They are
concerned with determining what types of customers to recruit and finding out what
they want. As a result, we have seen the rise of business analytics and customer
relationship management (CRM) to address this dominant issue. As well as creating
customers, firms need to serve those already recruited. High quality service often
requires frequent and reliable execution of multiple processes during manifold
encounters with customers by a firm’s many customer facing employees (e.g., a fast
food store, an airline, a hospital). Consequently, business process management
(BPM) has grown in importance since the mid 1990s when there was a surge of
interest in business process reengineering.
We are in transition to a new era, sustainability, where attention shifts to assessing
environmental impact because, after several centuries of industrialization,
atmospheric CO levels have become alarmingly high. We are also reaching the
2
limits of the planet’s resources as its population now exceeds seven billion people.
As a result, a new class of application is emerging, such as environmental
management systems and UPS's telematics project. These new systems will also
5
Richest
Face-
Telephone Personal documents (letters and memos) Impersonal written documents
to-face
Managers seek rich information when they are trying to resolve equivocality or
ambiguity. It means that managers cannot make sense of a situation because they
arrive at multiple, conflicting interpretations of the information. An example of an
equivocal situation is a class assignment where some of the instructions are missing
and others are contradictory (of course, this example is an extremely rare event).
Equivocal situations cannot be resolved by collecting more information, because
managers are uncertain about what questions to ask and often a clear answer is not
possible. Managers reduce equivocality by sharing their interpretations of the
available information and reaching a collective understanding of what the
information means. By exchanging opinions and recalling their experiences, they
try to make sense of an ambiguous situation.
Many of the situations that managers face each day involve a high degree of
equivocality. Formal organizational memories, such as databases, are not much
help, because the information they provide is lean. Many managers rely far more on
talking with colleagues and using informal organizational memories such as social
networks.
Data management is almost exclusively concerned with administering the formal
information systems that deliver lean information. Although this is their proper
role, data managers must realize that they can deliver only a portion of the data
required by decision makers.
Information classes
Information can be grouped into four classes: content, form, behavior, and action.
Until recently, most organizational information fell into the first category.
Information classes
Class Description
Content Quantity, location, and types of items
Form Shape and composition of an object
Behavior Simulation of a physical object
Action Creation of action (e.g., industrial robots)
Content information records details about quantity, location, and types of items. It
tends to be historical in nature and is traditionally the class of information collected
and stored by organizations. The content information of a car would describe its
model number, color, price, options, horsepower, and so forth. Hundreds of bytes
of data may be required to record the full content information of a car. Typically,
content data are captured by a TPS.
Form information describes the shape and composition of an object. For example,
the form information of a car would define the dimensions and composition of
every component in the car. Millions of bytes of data are needed to store the form of
a car. CAD/CAM systems are used to create and store form information.
Behavior information is used to predict the behavior of a physical object using
simulation techniques, which typically require form information as input. Massive
numbers of calculations per second are required to simulate behavior. For example,
simulating the flight of a new aircraft design may require trillions of computations.
Behavior information is often presented visually because the vast volume of data
generated cannot be easily processed by humans in other formats.
Action information enables the instantaneous creation of sophisticated action.
Industrial robots take action information and manufacture parts, weld car bodies, or
transport items. Antilock brakes are an example of action information in use.
A lifetime of information—every day
It is estimated that a single weekday issue of the New York Times contains more
information than the average person in seventeenth-century England came across in
a lifetime. Information, once rare, is now abundant and overflowing.
Roszak, T. 1986. The cult of information: The folklore of computers and the true art of thinking. New York, NY: Pantheon. p. 3
Information and organizational change
Organizations are goal-directed. They undergo continual change as they use their
resources, people, technology, and financial assets to reach some future desired
outcome. Goals are often clearly stated, such as to make a profit of $100 million,
win a football championship, or decrease the government deficit by 25 percent in
five years. Goals are often not easily achieved, however, and organizations
continually seek information that supports goal attainment. The information they
seek falls into three categories: goal setting, gap, and change.
Organizational information categories
The emergence of an information society also means that information provides dual
possibilities for change. Information is used to plan change, and information is a
medium for change.
Goal-setting information
Organizations set goals or levels of desired performance. Managers need
information to establish goals that are challenging but realistic. A common
approach is to take the previous goal and stretch it. For example, a company that had
a 15 percent return on investment (ROI) might set the new goal at 17 percent ROI.
This technique is known as “anchoring and adjusting.” Prior performance is used as
a basis for setting the new performance standards. The problem with anchoring and
adjusting is that it promotes incremental improvement rather than radical change
because internal information is used to set performance standards. Some
organizations have turned to external information and are using benchmarking as a
source of information for goal setting.
Planning
Planning is an important task for senior managers. To set the direction for the
company, they need information about consumers’ potential demands and social,
economic, technical, and political conditions. They use this information to
determine the opportunities and threats facing the organization, thus permitting
them to take advantage of opportunities and avoid threats.
Most of the information for long-term planning comes from sources external to the
company. There are think tanks that analyze trends and publish reports on future
conditions. Journal articles and books can be important sources of information
about future events. There also will be a demand for some internal information to
identify trends in costs and revenues. Major planning decisions, such as building a
new plant, will be based on an analysis of internal data (use of existing capacity) and
external data (projected customer demand).
Benchmarking
Benchmarking establishes goals based on best industry practices. It is founded on
the Japanese concept of dantotsu, striving to be the best of the best. Benchmarking is
externally directed. Information is sought on those companies, regardless of
industry, that demonstrate outstanding performance in the areas to be benchmarked.
Their methods are studied, documented, and used to set goals and redesign existing
practices for superior performance.
Other forms of external information, such as demographic trends, economic
forecasts, and competitors’ actions can be used in goal setting. External information
is valuable because it can force an organization to go beyond incremental goal
setting.
Organizations need information to identify feasible, motivating, and challenging
goals. Once these goals have been established, they need information on the extent
to which these goals have been attained.
Gap information
Because goals are meant to be challenging, there is often a gap between actual and
desired performance. Organizations use a number of mechanisms to detect a gap
and gain some idea of its size. Problem identification and scorekeeping are two
principal methods of providing gap information.
Problem identification
Business conditions are continually changing because of competitors’ actions,
trends in consumer demand, and government actions. Often these changes are
reflected in a gap between expectations and present performance. This gap is known
as a problem.
To identify problems, managers use exception reports, which are generated only
when conditions vary from the established standard. Once a potential problem has
been identified, managers collect additional information to confirm that a problem
really exists. Problem identification information can also be delivered by printed
report, computer screen, or on a tablet. Once management has been alerted, the
information delivery system needs to shift into high gear. Managers will request
rapid delivery of ad hoc reports from a variety of sources. The ideal organizational
memory system can adapt smoothly to deliver appropriate information quickly.
Scorekeeping
Keeping track of the score provides gap information. Managers ask many questions:
How many items did we make yesterday? What were the sales last week? Has our
market share increased in the last year? They establish measurement systems to
track variables that indicate whether organizational performance is on target.
Keeping score is important; managers need to measure in order to manage. Also,
measurement lets people know what is important. Once a manager starts to measure
something, subordinates surmise that this variable must be important and pay more
attention to the factors that influence it.
There are many aspects of the score that a manager can measure. The
overwhelming variety of potentially available information is illustrated by the sales
output information that a sales manager could track. Sales input information (e.g.,
number of service calls) can also be measured, and there is qualitative information
to be considered. Because of time constraints, most managers are forced to limit
their attention to 10 or fewer key variables singled out by the critical success factors
(CSF) method. Scorekeeping information is usually fairly stable.
Sales output tracking information
Category Example
Orders Number of current customers
Average order size
Batting average (orders to calls)
Sales volume Dollar sales volume
Unit sales volume
By customer type
By product category
Translated to market share
Quota achieved
Margins Gross margin
Net profit
By customer type
By product
Customers Number of new accounts
Number of lost accounts
Percentage of accounts sold
Number of accounts overdue
Dollar value of receivables
Collections of receivables
Change information
Once a gap has been detected, managers take action to close it. Change information
helps them determine which actions might successfully close the gap. Accurate
change information is very valuable because it enables managers to predict the
outcome of various actions with some certainty. Unfortunately, change information
is usually not very precise, and there are many variables that can influence the effect
of any planned change. Nevertheless, organizations spend a great deal of time
collecting information to support problem solving and planning.
Problem solution
Managers collect information to support problem solution. Once a concern has been
identified, a manager seeks to find its cause. A decrease in sales could be the result
of competitors introducing a new product, an economic downturn, an ineffective
advertising campaign, or many other reasons. Data can be collected to test each of
these possible causes. Additional data are usually required to support analysis of
each alternative. For example, if the sales decrease has been caused by an economic
recession, the manager might use a decision support system (DSS) to analyze the
effect of a price decrease or a range of promotional activities.
Information as a means of change
The emergence of an information society means that information can be used as a
means of changing an organization’s performance. Corporations are creating
information-based products and services, adding information to products, and using
information to enhance existing performance or gain a competitive advantage.
Further insights into the use of information as a change agent are gained by
examining marketing, customer service, and empowerment.
Marketing
Marketing is a key strategy for changing organizational performance by increasing
sales. Information has become an important component in marketing. Airlines and
retailers have established frequent flyer and buyer programs to encourage customer
loyalty and gain more information about customers. These systems are heavily
dependent on database technology because of the massive volume of data that must
be stored. Indeed, without database technology, some marketing strategies could
never be implemented.
Database technology offers the opportunity to change the very nature of
communications with customers. Broadcast media have been the traditional
approach to communication. Advertisements are aired on television, placed in
magazines, or displayed on a Web site. Database technology can be used to address
customers directly. No longer just a mailing list, today’s database is a memory of
customer relationships, a record of every message and response between the firm
and a customer. Some companies keep track of customers’ preferences and
customize advertisements to their needs. Leading online retailers now send
customers only those emails for products for which they estimate there is a high
probability of a purchase. For example, a customer with a history of buying jazz
music is sent an email promoting jazz recordings instead of one featuring classical
music.
The Web has significantly enhanced the value of database technology. Many firms
now use a combination of a Web site and a DBMS to market products and service
customers.
Customer Service
Customer service is a key competitive issue for nearly every organization. Many
American business leaders rank customer service as their most important goal for
organizational success. Many of the developed economies are service economies. In
the United States, services account for nearly 80 percent of the gross national
product and most of the new jobs. Many companies now fight for customers by
offering superior customer service. Information is frequently a key to this
improved service.
Skill Builder
For many businesses, information is the key to high-quality customer service. Thus,
some firms use information, and thus customer service, as a key differentiating
factor, while others might compete on price. Compare the electronics component of
Web sites Amazon and TigerDirect. How do they use price and information to
compete? What are the implications for data management if a firm uses information
to compete?
Empowerment
Empowerment means giving employees greater freedom to make decisions. More
precisely, it is sharing with frontline employees
• information about the organization’s performance;
• rewards based on the organization’s performance;
• knowledge and information that enable employees to understand and
contribute to organizational performance;
• power to make decisions that influence organizational direction and
performance.
Notice that information features prominently in the process. A critical component is
giving employees access to the information they need to perform their tasks with a
high degree of independence and discretion. Information is empowerment. An
important task for data managers is to develop and install systems that give
employees ready access to an organization’s memory. By linking employees to
organizational memory, data managers play a pivotal role in empowering people.
Organizations believe that empowerment contributes to organizational performance
by increasing the quality of products and services. Together, empowerment and
information are mechanisms of planned change.
Information and managerial work
Managers are a key device for implementing organizational change. Because they
frequently use data management systems in their normal activities as a source of
information about change and as a means of implementing change, it is crucial for
data management systems designers to understand how managers work. Failure to
take account of managerial behavior can result in a system that is technically sound
but rarely used because it does not fit the social system.
Studies over several decades reveal a very consistent pattern: Managerial work is
very fragmented. Managers spend an average of 10 minutes on any task, and their
day is organized into brief periods of concentrated attention to a variety of tasks.
They work unrelentingly and are frequently interrupted by unexpected disturbances.
Managers are action oriented and rely on intuition and judgment far more than
contemplative analysis of information.
Managers strongly prefer oral communication. They spend a great deal of time
conversing directly or by telephone. Managers use interpersonal communication to
establish networks, which they later use as a source of information and a way to
make things happen. The continual flow of verbal information helps them make
sense of events and lets them feel the organizational pulse.
Managers rarely use formal reporting systems. They do not spend their time
analyzing computer reports or querying databases but resort to formal reports to
confirm impressions, should interpersonal communications suggest there is a
problem. Even when managers are provided with a purpose-built, ultra-friendly
executive information system, their behavior changes very little. They may access a
few screens during the day, but oral communication is still their preferred method
of data gathering and dissemination.
Managers’ information requirements
Managers have certain requirements of the information they receive. These
expectations should shape a data management system’s content and how data are
processed.
Managers expect to receive information that is useful for their current task under
existing business conditions. Unfortunately, managerial tasks can change rapidly.
The interlinked, global, economic environment is highly turbulent. Since managers’
expectations are not stable, the information delivery system must be sufficiently
flexible to meet changing requirements.
Managers’ demands for information vary with their perception of its hardness; they
require only one source that scores 10 on the information hardness scale. The
Nikkei Index at the close of today’s market is the same whether you read it in the
Asian Wall Street Journal or The Western Australian or hear it on CNN. As
perceived hardness decreases, managers demand more information, hoping to
resolve uncertainty and gain a more accurate assessment of the situation. When the
reliability of a source is questionable, managers seek confirmation from other
sources. If a number of different sources provide consistent information, a manager
gains confidence in the information’s accuracy.
Relationship of perceived information hardness to volume of information requested
Information satisficing
Because managers face making many decisions in a short period, most do not have
the time or resources to collect and interpret all the information they need to make
the best decision. Consequently, they are often forced to satisfice. That is, they
accept the first satisfactory decision they discover. They also satisfice in their
information search, collecting only enough information to make a satisfactory
decision.
Ultimately, information satisficing leads to lower-quality decision making. If,
however, information systems can be used to accelerate delivery and processing of
the right information, then managers should be able to move from selecting the first
satisfactory decision to selecting the best of several satisfactory decisions.
Information delivery systems
Most organizations have a variety of delivery systems to provide information to
managers. Developed over many years, these systems are integrated, usually via a
Web site, to give managers better access to information. There are two aspects to
information delivery. First, there is a need for software that accesses an
organizational memory, extracts the required data, and formats it for presentation.
We can use the categories of organizational memories introduced in Chapter 1 to
describe the software side of information delivery systems. The second aspect of
delivery is the hardware that gets information from a computer to the manager.
Information delivery systems software
Organizational memory Delivery systems
People Conversation
E-mail
Meeting
Report
Groupware
Files Management information system (MIS)
Documents Web browser
E-mail attachment
Images Image processing system (IPS)
Graphics Computer aided design (CAD)
Geographic information system (GIS)
Voice Voice mail
Voice recording system
Mathematical model Decision support system (DSS)
Knowledge Expert system (ES)
Decisions Conversation
E-mail
Meeting
Report
Groupware
Software is used to move data to and from organizational memory. There is usually
tight coupling between software and the format of an organizational memory. For
example, a relational database management system can access tables but not
decision-support models. This tight coupling is particularly frustrating for
managers who often want integrated information from several organizational
memories. For example, a sales manager might expect a monthly report to include
details of recent sales (from a relational database) to be combined with customer
comments (from email messages to a customer service system). A quick glance at
the preceding table shows that there are many different information delivery
systems. We will discuss each of these briefly to illustrate the lack of integration of
organizational memories.
Verbal exchange
Conversations, meetings, and oral reporting are commonly used methods of
information delivery. Indeed, managers show a strong preference for verbal
exchange as a method for gathering information. This is not surprising because we
are accustomed to oral information. This is the way we learned for thousands of
years as a preliterate culture. Only recently have we learned to make decisions using
spreadsheets and computer reports.
Voice mail
Voice mail is useful for people who do not want to be interrupted. It supports
asynchronous communication; that is, the two parties to the conversation are not
simultaneously connected by a communication line. Voice-mail systems also can
store many prerecorded messages that can be selectively replayed using a phone’s
keypad. Organizations use this feature to support standard customer queries.
Electronic mail
Email is an important system of information delivery. It too supports asynchronous
messaging, and it is less costly than voice mail for communication. Many
documents are exchanged as attachments to email.
Written report
Formal, written reports have a long history in organizations. Before electronic
communication, they were the main form of information delivery. They still have a
role in organizations because they are an effective method of integrating
information of varying hardness and from a variety of organizational memories.
For example, a report can contain text, tables, graphics, and images.
Written reports are often supplemented by a verbal presentation of the key points in
the report. Such a presentation enhances the information delivered, because the
audience has an opportunity to ask questions and get an immediate response.
Meetings
Because managers spend between 30 and 80 percent of their time in meetings, these
become a key source of information.
Groupware
Since meetings occupy so much managerial time and in many cases are poorly
planned and managed, organizations are looking for improvements. Groupware is a
general term applied to a range of software systems designed to improve some
aspect of group work. It is excellent for tapping soft information and the informal
side of organizational memory.
Management information system
Management information systems are a common method of delivering information
from data management systems. A preplanned query is often used to extract the data.
Managers who have developed some skills in using a query language might create
their own custom reports as query languages become more powerful and easier to
use.
Preplanned reports often contain too much or too detailed information, because they
are written in anticipation of a manager ’s needs. Customized reports do not have
these shortcomings, but they are often more expensive and time consuming because
they need to be prepared by an analyst.
Web
Word processing, desktop publishing, or HTML editors are used for preparing
documents that are disseminated by placing them on a Web server for convenient
access and sharing.
Image processing system
An image processing system (IPS) captures data using a scanner to digitize an
image. Images in the form of letters, reports, illustrations, and graphics have always
been an important type of organizational memory. An IPS permits these forms of
information to be captured electronically and disseminated.
Computer-aided design
Computer-aided design (CAD) is used extensively to create graphics. For example,
engineers use CAD in product design, and architects use it for building design.
These plans are a part of organizational memory for designers and manufacturers.
Geographic information system
A geographic information system (GIS) stores graphical data about a geographic
region. Many cities use a GIS to record details of roads, utilities, and services. This
graphical information is another form of organizational memory. Again, special-
purpose software is required to access and present the information stored in this
memory.
Decision support system
A decision support system (DSS) is frequently a computer-based mathematical
model of a problem. DSS software, available in a range of packages, permits the
model and data to be retrieved and executed. Model parameters can be varied to
investigate alternatives.
Expert system
An expert system (ES) has the captured knowledge of someone who is particularly
skillful at solving a certain type of problem. It is convenient to think of an ES as a
set of rules. An ES is typically used interactively, as it asks the decision-maker
questions and uses the responses to determine the action to recommend.
Information integration
You can think of organizational memory as a vast, disorganized data dump. A
fundamental problem for most organizations is that their memory is fragmented
across a wide variety of formats and technologies. Too frequently, there is a one-to-
one correspondence between an organizational memory for a particular functional
area and the software delivery system. For example, sales information is delivered
by the sales system and production information by the production system. This is
not a very desirable situation because managers want all the information they need,
regardless of its source or format, amalgamated into a single report.
A response to providing access to the disorganized data dump has been the
development of software, such as a Web application, that integrates information
from a variety of delivery systems. An important task of this software is to integrate
and present information from multiple organizational memories. Recently, some
organizations have created vast integrated data stores—data warehouses— that are
organized repositories of organizational data.
Information integration—the present situation
There are two types of knowledge: explicit and tacit. Explicit knowledge is codified
and transferable. This textbook is an example of explicit knowledge. Knowledge
about how to design databases has been formalized and communicated with the
intention of transferring it to you, the reader. Tacit knowledge is personal
knowledge, experience, and judgment that is difficult to codify. It is more difficult to
transfer tacit knowledge because it resides in people’s minds. Usually, the transfer
of tacit knowledge requires the sharing of experiences. In learning to model data,
the subject of the next section, you will quickly learn how to represent an entity,
because this knowledge is made explicit. However, you will find it much harder to
model data, because this skill comes with practice. Ideally, you should develop
several models under the guidance of an experienced modeler, such as your
instructor, who can pass on his or her tacit knowledge.
Summary
Information has become a key foundation for organizational growth. The
information society is founded on computer and communications technology. The
accumulation of knowledge requires a capacity to encode and share information.
Hard information is very exact. Soft information is extremely imprecise. Rich
information exchange occurs in face-to-face conversation. Numeric reports are an
example of lean information. Organizations use information to set goals, determine
the gap between goals and achievements, determine actions to reach goals, and
create new products and services to enhance organizational performance. Managers
depend more on informal communication systems than on formal reporting
systems. They expect to receive information that meets their current, ever changing
needs. Operational managers need short-term information. Senior executives
require mainly long-term information but still have a need for both short- and
medium-term information. When managers face time constraints, they collect only
enough information to make a satisfactory decision. Most organizations have a wide
variety of poorly integrated information delivery systems. Organizational memory
should be integrated to provide managers with one interface to an organization’s
information stores.
Key terms and concepts
Skill builder
A ship has a name, registration code, gross tonnage, and a year of construction.
Ships are classified as cargo or passenger. Draw a data model for a ship.
Creating a single-table database
The next stage is to translate the data model into a relational database. The
translation rules are very direct:
• Each entity becomes a table.
• The entity name becomes the table name.
• Each attribute becomes a column.
• The identifier becomes the primary key.
The American National Standards Institute’s (ANSI) recommended language for
relational database definition and manipulation is SQL, which is both a data
definition language (DDL) (to define a database), a data manipulation language
(DML) (to query and maintain a database), and a data control language (DCL) (to
control access). SQL is a common standard for describing and querying databases
and is available with many commercial relational database products, including DB2,
Oracle, and Microsoft SQL Server, and open source products such as MySQL and
PostgreSQL.
In this book, MySQL is the relational database for teaching SQL. Because SQL is a
standard, it does not matter which implementation of the relational model you use as
the SQL language is common across both the proprietary and open variants. 8
SQL uses the CREATE statement to define a table. It is not a particularly friendly
command, and most products have friendlier interfaces for defining tables.
However, it is important to learn the standard, because this is the command that SQL
understands. Also, a table definition interface will generate a CREATE statement for
execution by SQL. Your interface interactions ultimately translate into a standard
SQL command.
It is common practice to abbreviate attribute names, as is done in the following
example.
Defining a table
The CREATE command to establish a table called share is as follows:
CREATE TABLE share (
shrcode CHAR(3),
shrfirm VARCHAR(20) NOT NULL,
shrprice DECIMAL(6,2),
shrqty DECIMAL(8),
shrdiv DECIMAL(5,2),
shrpe DECIMAL(2),
PRIMARY KEY (shrcode));
The first line of the command names the table; subsequent lines describe each of the
columns in it. The first component is the name of the column (e.g., shrcode). The
second component is the data type (e.g., CHAR), and its length is shown in
parentheses. shrfirm is a variable-length character field of length 20, which means it
can store up to 20 characters, including spaces. The column shrdiv stores a decimal
number that can be as large as 999.99 because its total length is 5 digits and there are
2 digits to the right of the decimal point. Some examples of allowable data types are
shown in the following table. The third component (e.g., NOT NULL), which is
optional, indicates any instance that cannot have null values. A column might have a
null value when it is either unknown or not applicable. In the case of the share table,
we specify that shrfirm must be defined for each instance in the database.
The final line of the CREATE statement defines shrcode as the primary key, the unique
identifier for share. When a primary key is defined, the relational database
management system (RDBMS) will enforce the requirement that the primary key is
unique and not null. Before any row is added to the table share, the RDBMS will
check that the value of shrcode is not null and that there does not already exist a
duplicate value for shrcode in an existing row of share. If either of these constraints is
violated, the RDBMS will not permit the new row to be inserted. This constraint, the
entity integrity rule, ensures that every row has a unique, non-null primary key.
Allowing the primary key to take a null value would mean there could be a row of
share that is not uniquely identified. Note that an SQL statement is terminated by a
semicolon.
SQL statements can be written in any mix of valid upper- and lowercase characters.
To make it easier for you to learn the syntax, this book adopts the following
conventions:
• SQL keywords are in uppercase.
• Table and column names are in lowercase.
There are more elaborate layout styles, but we will bypass those because it is more
important at this stage to learn SQL. You should lay out your SQL statements so that
they are easily read by you and others.
The following table shows some of the data types supported by most relational
databases. Other implementations of the relational model may support some of these
data types and additional ones. It is a good idea to review the available data types in
your RDBMS before defining your first table.
Some allowable data types
Category Data type Description
SMALLINT A 15-bit signed binary value
INTEGER A 31-bit signed binary value
Numeric
FLOAT(P) A scientific format number of p binary digits precision
DECIMAL(P,Q) A packed decimal number of p digits total length; q decimal spaces to the ri
CHAR(N) A fixed-length string of n characters
String VARCHAR(N) A variable-length string of up to n characters
TEXT A variable-length string of up to 65,535 characters
DATE Date in the form yyyymmdd
TIME Time in the form hhmmss
Date/time TIMESTAMP A combination of date and time to the nearest microsecond
TIME WITH TIME ZONE Same as time, with the addition of an offset from universal time coordinated
TIMESTAMP WITH TIME ZONE Same as timestamp, with the addition of an offset from UTC of the specified
Logical BOOLEAN A set of truth values: TRUE, FALSE, or UNKNOWN
The CHAR and VARCHAR data types are similar but differ in the way character strings
are stored and retrieved. Both can be up to 255 characters long. The length of a
CHAR column is fixed to the declared length. When values are stored, they are right-
padded with spaces to the specified length. When CHAR values are retrieved, trailing
spaces are removed. VARCHAR columns store variable-length strings and use only as
many characters as are needed to store the string. Values are not padded; instead,
trailing spaces are removed. In this book, we use VARCHAR to define most character
strings, unless they are short (less than five characters is a good rule-of-thumb).
Data modeling with MySQL Workbench
MySQL Workbench is a professional quality, open source, cross-platform tool for
data modeling and SQL querying. In this text, you will also learn some of the
9
features of Workbench that support data modeling and using SQL. You will find it
helpful to complete the MySQL Workbench tutorial on creating a data model prior 10
You will notice some differences from the data model we have created previously.
Workbench automatically generates the SQL code to create the table, so when
modeling you establish the names you want for tables and columns. A key symbol is
used to indicate the identifier, which becomes the primary key. An open diamond
indicates that a column can be null, whereas a blue diamond indicate the column
must have a value, as with shrfirm in this case.
When specifying columns in Workbench you must also indicate the datatype. We opt
to turn off the display of a column’s datatype in a model to maintain focus on the
11
entity.
A major advantage of using a tool such as Workbench is that it will automatically
generate the CREATE statement code (Database > Forward Engineer …) and
execute the code to create the database. The Workbench tutorial will have shown you
how to do this, and you should try this out for yourself by creating a database with
the single SHARE table.
Inserting rows into a table
The rows of a table store instances of an entity. A particular shareholding (e.g.,
Freedonia Copper) is an example of an instance of the entity SHARE. The SQL
statement INSERT is used to add rows to a table. Although most implementations of
the relational model use a spreadsheet for row insertion and you can also usually
import a structured text file, the INSERT command is defined for completeness.
The following command adds one row to the table share:
INSERT INTO share
(shrcode,shrfirm,shrprice,shrqty,shrdiv,shrpe)
VALUES ('FC','Freedonia Copper',27.5,10529,1.84,16);
There is a one-to-one correspondence between a column name in the first set of
parentheses and a value in the second set of parentheses. That is, shrcode has the value
“FC,” shrfirm the value “Freedonia Copper,” and so on. Notice that the value of a
column that stores a character string (e.g., shrfirm) is contained within straight quotes.
The list of field names can be omitted when values are inserted in all columns of the
table in the same order as that specified in the CREATE statement, so the preceding
expression could be written
INSERT INTO share
VALUES ('FC','Freedonia Copper',27.5,10529,1.84,16);
The data for the share table will be used in subsequent examples. If you have ready
access to a relational database, it is a good idea to now create a table and enter the
data. Then you will be able to use these data to practice querying the table.
Data for share
share
shrcode shrfirm shrprice shrqty shrdiv shrpe
FC Freedonia Copper 27.50 10,529 1.84 16
PT Patagonian Tea 55.25 12,635 2.50 10
AR Abyssinian Ruby 31.82 22,010 1.32 13
SLG Sri Lankan Gold 50.37 32,868 2.68 16
ILZ Indian Lead & Zinc 37.75 6,390 3.00 12
BE Burmese Elephant 0.07 154,713 0.01 3
BS Bolivian Sheep 12.75 231,678 1.78 11
NG Nigerian Geese 35.00 12,323 1.68 10
CS Canadian Sugar 52.78 4,716 2.50 15
ROF Royal Ostrich Farms 33.75 1,234,923 3.00 6
Notice that shrcode is underlined in the preceding table. This is a way we remind you
it is a primary key. In the data model it has an asterisk prefix to indicate it is an
identifier. In the relational model it becomes a primary key, a column that
guarantees that each row of the table can be uniquely addressed.
MySQL Workbench offers a spreadsheet interface for entering data, as explained in
the tutorial on adding data to a database. 12
Inserting rows with MySQL Workbench
Skill builder
Create a relational database for the ship entity you modeled previously. Insert some
rows.
Querying a database
The objective of developing a database is to make it easier to use the stored data to
solve problems. Typically, a manager raises a question (e.g., How many shares have
a PE ratio greater than 12?). A question or request for information, usually called a
query, is then translated into a specific data manipulation or query language. The
most widely used query language for relational databases is SQL. After the query
has been executed, the resulting data are displayed. In the case of a relational
database, the answer to a query is always a table.
There is also a query language called relational algebra, which describes a set of
operations on tables. Sometimes it is useful to think of queries in terms of these
operations. Where appropriate, we will introduce the corresponding relational
algebra operation.
Generally we use a four-phase format for describing queries:
1. A brief explanation of the query’s purpose
2. The query italics, prefixed by •, and some phrasing you might expect from a
manager
3. The SQL version of the query
4. The results of the query.
Displaying an entire table
All the data in a table can be displayed using the SELECT statement. In SQL, the all
part is indicated by an asterisk (*).
• List all data in the share table.
SELECT * FROM share;
Project—choosing columns
The relational algebra operation project creates a new table from the columns of an
existing table. Project takes a vertical slice through a table by selecting all the values
in specified columns. The projection of share on columns shrfirm and shrpe produces a
new table with 10 rows and 2 columns. These columns are shaded in the table.
Projection of shrfirm and shrpe
share
shrcode shrfirm shrprice shrqty shrdiv shrpe
FC Freedonia Copper 27.50 10,529 1.84 16
PT Patagonian Tea 55.25 12,635 2.50 10
AR Abyssinian Ruby 31.82 22,010 1.32 13
SLG Sri Lankan Gold 50.37 32,868 2.68 16
ILZ Indian Lead & Zinc 37.75 6,390 3.00 12
BE Burmese Elephant 0.07 154,713 0.01 3
BS Bolivian Sheep 12.75 231,678 1.78 11
NG Nigerian Geese 35.00 12,323 1.68 10
CS Canadian Sugar 52.78 4,716 2.50 15
ROF Royal Ostrich Farms 33.75 1,234,923 3.00 6
The SQL syntax for the project operation simply lists the columns to be displayed.
• Report a firm’s name and price-earnings ratio.
SELECT shrfirm, shrpe FROM share;
shrfirm shrpe
Freedonia Copper 16
Patagonian Tea 10
Abyssinian Ruby 13
Sri Lankan Gold 16
Indian Lead & Zinc 12
Burmese Elephant 3
Bolivian Sheep 11
Nigerian Geese 10
Canadian Sugar 15
Royal Ostrich Farms 6
Restrict—choosing rows
The relational algebra operation restrict creates a new table from the rows of an
existing table. The operation restricts the new table to those rows that satisfy a
specified condition. Restrict takes all columns of an existing table but only those
rows that meet the specified condition. The restriction of share to those rows where
the PE ratio is less than 12 will give a new table with five rows and six columns.
These rows are shaded.
Restriction of share
share
shrcode shrfirm shrprice shrqty shrdiv shrpe
FC Freedonia Copper 27.50 10,529 1.84 16
PT Patagonian Tea 55.25 12,635 2.50 10
AR Abyssinian Ruby 31.82 22,010 1.32 13
SLG Sri Lankan Gold 50.37 32,868 2.68 16
ILZ Indian Lead & Zinc 37.75 6,390 3.00 12
BE Burmese Elephant 0.07 154,713 0.01 3
BS Bolivian Sheep 12.75 231,678 1.78 11
NG Nigerian Geese 35.00 12,323 1.68 10
CS Canadian Sugar 52.78 4,716 2.50 15
ROF Royal Ostrich Farms 33.75 1,234,923 3.00 6
Restrict is implemented in SQL using the WHERE clause to specify the condition on
which rows are restricted.
• Get all firms with a price-earnings ratio less than 12.
SELECT * FROM share WHERE shrpe < 12;
The IN crowd
The keyword IN is used with a list to specify a set of values. IN is always paired with
a column name. All rows for which a value in the specified column has a match in
the list are selected. It is a simpler way of writing a series of OR statements.
• Report data on firms with codes of FC, AR, or SLG.
SELECT * FROM share WHERE shrcode IN ('FC','AR','SLG');
The foregoing query could have also been written as
SELECT * FROM share
WHERE shrcode = 'FC' or shrcode = 'AR' or shrcode = 'SLG';
Ordering columns
The order of reporting columns is identical to their order in the SQL command. For
instance, compare the output of the following queries.
SELECT shrcode, shrfirm FROM share WHERE shrpe = 10;
shrcode shrfirm
PT Patagonian Tea
NG Nigerian Geese
SELECT shrfirm, shrcode FROM share WHERE shrpe = 10;
shrfirm shrcode
Patagonian Tea PT
Nigerian Geese NG
Ordering rows
People can generally process an ordered (e.g., sorted alphabetically) report faster
than an unordered one. In SQL, the ORDER BY clause specifies the row order in a
report. The default ordering sequence is ascending (A before B, 1 before 2).
Descending is specified by adding DESC after the column name.
• List all firms where PE is at least 10, and order the report in descending PE.
Where PE ratios are identical, list firms in alphabetical order.
SELECT * FROM share WHERE shrpe >= 10
ORDER BY shrpe DESC, shrfirm;
investments
10
• How many firms have a holding greater than 50,000?
SELECT COUNT(*) AS bigholdings FROM share WHERE shrqty > 50000;
bigholdings
3
AVG—averaging
AVG computes the average of the values in a column of numeric data. Null values in
the column are not included in the calculation.
• Find the average dividend.
SELECT AVG(shrdiv) AS avgdiv FROM share;
avgdiv
2.03
• What is the average yield for the portfolio?
SELECT AVG(shrdiv/shrprice*100) AS avgyield FROM share;
avgyield
7.53
Subqueries
Sometimes we need the answer to another query before we can write the query of
ultimate interest. For example, to list all shares with a PE ratio greater than the
portfolio average, you first must find the average PE ratio for the portfolio. You
could do the query in two stages:
a. SELECT AVG(shrpe) FROM share;
b. SELECT shrfirm, shrpe FROM share WHERE shrpe > x;
where x is the value returned from the first query.
Unfortunately, the two-stage method introduces the possibility of errors. You might
forget the value returned by the first query or enter it incorrectly. It also takes
longer to get the results of the query. We can solve these problems by using
parentheses to indicate the first query is nested within the second one. As a result, the
value returned by the inner or nested subquery, the one in parentheses, is used in the
outer query. In the following example, the nested query returns 11.20, which is then
automatically substituted in the outer query.
• Report all firms with a PE ratio greater than the average for the portfolio.
SELECT shrfirm, shrpe FROM share
WHERE shrpe > (SELECT AVG(shrpe) FROM share);
shrfirm shrpe
Freedonia Copper 16
Abyssinian Ruby 13
Sri Lankan Gold 16
Indian Lead & Zinc 12
Canadian Sugar 15
Warning: The preceding query is often mistakenly written as
SELECT shrfirm, shrpe from share
WHERE shrpe > avg(shrpe);
Skill builder
Find the name of the firm for which the value of the holding is greatest.
Regular expression—pattern matching
Regular expression is a concise and powerful method for searching for a specified
pattern in a nominated column. Regular expression processing is supported by
programming languages such as Java and PHP. In this chapter, we introduce a few
typical regular expressions and will continue developing your knowledge of this
feature in the next chapter.
Search for a string
• List all firms containing 'Ruby' in their name.
SELECT shrfirm FROM share WHERE shrfirm REGEXP 'Ruby';
shrfirm
Abyssinian Ruby
shrfirm
Sri Lankan Gold
Indian Lead & Zinc
shrfirm
Sri Lankan Gold
shrfirm
Nigerian Geese
Skill builder
List the firms containing “ian” in their name.
DISTINCT—eliminating duplicate rows
The DISTINCT clause is used to eliminate duplicate rows. It can be used with
column functions or before a column name. When used with a column function, it
ignores duplicate values.
• Report the different values of the PE ratio.
SELECT DISTINCT shrpe FROM share;
shrpe
3
3
10
11
12
13
15
16
• Find the number of different PE ratios.
SELECT COUNT(DISTINCT shrpe) as 'Different PEs' FROM share;
Different PEs
8
When used before a column name, DISTINCT prevents the selection of duplicate rows.
Notice a slightly different use of the keyword AS. In this case, because the alias
includes a space, the entire alias is enclosed in straight quotes.
DELETE
Rows in a table can be deleted using the DELETE clause in an SQL statement. DELETE
is typically used with a WHERE clause to specify the rows to be deleted. If there is no
WHERE clause, all rows are deleted.
• Erase the data for Burmese Elephant. All the shares have been sold.
DELETE FROM share WHERE shrfirm = 'Burmese Elephant';
In the preceding statement, shrfirm is used to indicate the row to be deleted.
UPDATE
Rows can be modified using SQL’s UPDATE clause, which is used with a WHERE
clause to specify the rows to be updated.
• Change the share price of FC to 31.50.
UPDATE share SET shrprice = 31.50 WHERE shrcode = 'FC';
• Increase the total number of shares for Nigerian Geese by 10% because of the
recent bonus issue.
UPDATE share SET shrqty = shrqty*1.1 WHERE shrfirm = 'Nigerian Geese';
Debriefing
Now that you have learned how to model a single entity, create a table, and specify
queries, you are on the way to mastering the fundamental skills of database design,
implementation, and use. Remember, planning occurs before action. A data model is
a plan for a database. The action side of a database is inserting rows and running
queries.
Summary
The relational database model is an effective means of representing real-world
relationships. Data modeling is used to determine what data must be stored and how
data are related. An entity is something in the environment. An entity has attributes,
which describe it, and an identifier, which uniquely distinguishes an instance of an
entity. Every entity must have a unique identifier. A relational database consists of
tables with rows and columns. A data model is readily translated into a relational
database. The SQL statement CREATE is used to define a table. Rows are added to a
table using INSERT. In SQL, queries are written using the SELECT statement. Project
(choosing columns) and restrict (choosing rows) are common table operations. The
WHERE clause is used to specify row selection criteria. WHERE can be combined with
IN and NOT IN , which specify values for a single column. The rows of a report are
sorted using the ORDER BY clause. Arithmetic expressions can appear in SQL
statements, and SQL has built-in functions for common arithmetic operations. A
subquery is a query within a query. Regular expressions are used to find string
patterns within character strings. Duplicate rows are eliminated with the DISTINCT
clause. Rows can be erased using DELETE or modified with UPDATE.
Key terms and concepts
Alias
AS
Attribute
AVG
Column
COUNT
CREATE
Data modeling
Data type
Database
DELETE
DISTINCT
Entity
Entity integrity rule
Identifier
IN
INSERT
Instance
MAX
MIN
NOT IN
ORDER BY
Primary key
Project
Relational database
Restrict
Row
SELECT
SQL
Subquery
SUM
Table
UPDATE
WHERE
Exercises
1. Draw data models for the following entities. In each case, make certain that you
show the attributes and identifiers. Also, create a relational database and insert
some rows for each of the models.
a. Aircraft: An aircraft has a manufacturer, model number, call sign (e.g.,
N123D), payload, and a year of construction. Aircraft are classified as civilian
or military.
b. Car: A car has a manufacturer, range name, and style code (e.g., a Honda
Accord DX, where Honda is the manufacturer, Accord is the range, and DX is
the style). A car also has a vehicle identification code, registration code, and
color.
c. Restaurant: A restaurant has an address, seating capacity, phone number, and
style of food (e.g., French, Russian, Chinese).
d. Cow: A dairy cow has a name, date of birth, breed (e.g., Holstein), and a
numbered plastic ear tag.
2. Do the following queries using SQL:
a. List a share’s name and its code.
b. List full details for all shares with a price less than $1.
c. List the names and prices of all shares with a price of at least $10.
d. Create a report showing firm name, share price, share holding, and total
value of shares held. (Value of shares held is price times quantity.)
e. List the names of all shares with a yield exceeding 5 percent.
f. Report the total dividend payment of Patagonian Tea. (The total dividend
payment is dividend times quantity.)
g. Find all shares where the price is less than 20 times the dividend.
h. Find the share(s) with the minimum yield.
i. Find the total value of all shares with a PE ratio > 10.
j. Find the share(s) with the maximum total dividend payment.
k. Find the value of the holdings in Abyssinian Ruby and Sri Lankan Gold.
l. Find the yield of all firms except Bolivian Sheep and Canadian Sugar.
m. Find the total value of the portfolio.
n. List firm name and value in descending order of value.
o. List shares with a firm name containing “Gold.”
p. Find shares with a code starting with “B.”
3. Run the following queries and explain the differences in output. Write each
query as a manager might state it.
a. SELECT shrfirm FROM share WHERE shrfirm NOT REGEXP 's';
b. SELECT shrfirm FROM share WHERE shrfirm NOT REGEXP 'S';
c. SELECT shrfirm FROM share WHERE shrfirm NOT REGEXP 's|S';
d. SELECT shrfirm FROM share WHERE shrfirm NOT REGEXP '^S';
e. SELECT shrfirm FROM share WHERE shrfirm NOT REGEXP 's$';
4. A weekly newspaper, sold at supermarket checkouts, frequently reports stories
of aliens visiting Earth and taking humans on short trips. Sometimes a captured
human sees Elvis commanding the spaceship. To keep track of all these reports,
the newspaper has created the following data model.
The paper has also supplied some data for the last few sightings and asked you
to create the database, add details of these aliens, and answer the following
queries:
a. What’s the average number of heads of an alien?
b. Which alien has the most heads?
c. Are there any aliens with a double o in their names?
d. How many aliens are chartreuse?
e. Report details of all aliens sorted by smell and color.
5. Eduardo, a bibliophile, has a collection of several hundred books. Being a
little disorganized, he has his books scattered around his den. They are piled on
the floor, some are in bookcases, and others sit on top of his desk. Because he
has so many books, he finds it difficult to remember what he has, and
sometimes he cannot find the book he wants. Eduardo has a simple personal
computer file system that is fine for a single entity or file. He has decided that
he would like to list each book by author(s)’ name and type of book (e.g.,
literature, travel, reference). Draw a data model for this problem, create a
single entity table, and write some SQL queries.
a. How do you identify each instance of a book? (It might help to look at a few
books.)
b. How should Eduardo physically organize his books to permit fast retrieval
of a particular one?
c. Are there any shortcomings with the data model you have created?
6. What is an identifier? Why does a data model have an identifier?
7. What are entities?
8. What is the entity integrity rule?
CD library case
Ajay, a DJ in the student-operated radio station at his university, is impressed by the
station’s large collection of CDs but is frustrated by the time he spends searching
for a particular piece of music. Recently, when preparing his weekly jazz session, he
spent more than half an hour searching for a recording of Duke Ellington’s “Creole
Blues.” An IS major, Ajay had recently commenced a data management class. He
quickly realized that a relational database, storing details of all the CDs owned by
the radio station, would enable him to find quickly any piece of music.
Having just learned how to model a single entity, Ajay decided to start building a
CD database. He took one of his favorite CDs, John Coltrane’s Giant Steps, and
drew a model to record details of the tracks on this CD.
CD library V1.0
Ajay pondered using track number as the identifier since each track on a CD is
uniquely numbered, but he quickly realized that track number was not suitable
because it uniquely identifies only those tracks on a specific CD. So, he introduced
trackid to identify uniquely each track irrespective of the CD on which it is found.
He also recorded the name of the track and its length. He had originally thought that
he would record the track’s length in minutes and seconds (e.g., 4:43) but soon
discovered that the TIME data type of his DBMS is designed to store time of day
rather than the length of time of an event. Thus, he concluded that he should store
the length of a track in minutes as a decimal (i.e., 4:43 is stored as 4.72).
trkid trknum trktitle trklength
1 1 Giant Steps 4.72
2 2 Cousin Mary 5.75
3 3 Countdown 2.35
4 4 Spiral 5.93
5 5 Syeeda’s Song Flute 7.00
6 6 Naima 4.35
7 7 Mr. P.C. 6.95
8 8 Giant Steps 3.67
9 9 Naima 4.45
10 10 Cousin Mary 5.90
11 11 Countdown 4.55
12 12 Syeeda’s Song Flute 7.03
Skill builder
Create a single table database using the data in the preceding table.
Justify your choice of data type for each of the attributes.
Ajay ran a few queries on his single table database.
• On what track is “Giant Steps”?
SELECT trknum, trktitle FROM track
WHERE trktitle = 'Giant Steps';
trknum trktitle
1 Giant Steps
8 Giant Steps
Observe that there are two tracks with the title “Giant Steps.”
• Report all recordings longer than 5 minutes.
SELECT trknum, trktitle, trklength FROM track
WHERE trklength > 5;
trknum trktitle trklength
2 Cousin Mary 5.75
4 Spiral 5.93
5 Syeeda's Song Flute 7.00
7 Mr. P.C. 6.95
10 Cousin Mary 5.90
12 Syeeda's song Flute 7.03
• What is the total length of recordings on the CD?
This query makes use of the function SUM to total the length of tracks on the CD.
SELECT SUM(trklength) AS sumtrklen FROM track;
sumtrklen
62.65
When you run the preceding query, the value reported may not be exactly 62.65.
Why?
• Find the longest track on the CD.
SELECT trknum, trktitle, trklength FROM track
WHERE trklength = (SELECT MAX(trklength) FROM track);
This query follows the model described previously in this chapter. First determine
the longest track (i.e., 7.03 minutes) and then find the track, or tracks, that are this
length.
trknum trktitle trklength
12 Syeeda’s Song Flute 7.03
• How many tracks are there on the CD?
SELECT COUNT(*) AS tracks FROM track;
tracks
12
• Sort the different titles on the CD by their name.
SELECT DISTINCT(trktitle) FROM track
ORDER BY trktitle;
trktitle
Countdown
Cousin Mary
Giant Steps
Mr. P.C.
Naima
Spiral
Syeeda’s Song Flute
• List the shortest track containing “song” in its title.
SELECT trknum, trktitle, trklength FROM track
WHERE trktitle LIKE '%song%'
AND trklength = (SELECT MIN(trklength) FROM track
WHERE trktitle LIKE '%song%');
The subquery determines the length of the shortest track with “song” in its title.
Then any track with a length equal to the minimum and also containing “song” in its
title (just in case there is another without “song” in its title that is the same length) is
reported.
trknum trktitle trklength
5 Syeeda’s Song Flute 7.00
• Report details of tracks 1 and 5.
SELECT trknum, trktitle, trklength FROM track
WHERE trknum IN (1,5);
The exact same nation name and exchange rate data pair occurs 10 times for stocks
listed in the United Kingdom. Redundancy presents problems when we want to
insert, delete, or update data. These problems, generally known as update
anomalies, occur with these three basic operations.
Insert anomalies
We cannot insert a fact about a nation’s exchange rate unless we first buy a stock that
is listed in that nation. Consider the case where we want to keep a record of France’s
exchange rate and we have no French stocks. We cannot skirt this problem by
putting in a null entry for stock details because stkcode, the primary key, would be
null, and this is not allowed. If we have a separate table for facts about a nation, then
we can easily add new nations without having to buy stocks. This is particularly
useful when other parts of the organization, say International Trading, also need
access to exchange rates for many nations.
Delete anomalies
If we delete data about a particular stock, we might also lose a fact about exchange
rates. For example, if we delete details of Bombay Duck, we also erase the Indian
exchange rate.
Update anomalies
Exchange rates are volatile. Most companies need to update them every day. What
happens when the Australian exchange rate changes? Every row in stock with nation =
'Australia' will have to be updated. In a large portfolio, many rows will be changed.
There is also the danger of someone forgetting to update all the instances of the
nation and exchange rate pair. As a result, there could be two exchange rates for the
one nation. If exchange rate is stored in nation, however, only one change is
necessary, there is no redundancy, and there is no danger of inconsistent exchange
rates.
Creating a database with a 1:m relationship
As before, each entity becomes a table in a relational database, the entity name
becomes the table name, each attribute becomes a column, and each identifier
becomes a primary key. The 1:m relationship is mapped by adding a column to the
entity at the many end of the relationship. The additional column contains the
identifier of the one end of the relationship.
Consider the relationship between the entities STOCK and NATION. The database
has two tables: stock and nation. The table stock has an additional column, natcode, which
contains the identifier of nation. If natcode is not stored in stock, then there is no way of
knowing the identity of the nation where the stock is listed.
A relational database with tables nation and stock
nation
natcode natname exchrate
Notice that the definition of stock includes an additional phrase to specify the foreign
key and the referential integrity constraint. The CONSTRAINT clause defines the
column or columns in the table being created that constitute the foreign key. A
referential integrity constraint can be named, and in this case, the constraint’s name
is fk_has_nation. The foreign key is the column natcode in stock, and it references the
primary key of nation, which is natcode.
The ON DELETE clause specifies what processing should occur if an attempt is made
to delete a row in nation with a primary key that is a foreign key in stock. In this case,
the ON DELETE clause specifies that it is not permissible (the meaning of RESTRICT) to
delete a primary key row in nation while a corresponding foreign key in stock exists.
In other words, the system will not execute the delete. You must first delete all
corresponding rows in stock before attempting to delete the row containing the
primary key. ON DELETE is the default clause for most RDBMSs, so we will dispense
with specifying it for future foreign key constraints.
Observe that both the primary and foreign keys are defined as CHAR(3). The
relational model requires that a primary key–foreign key pair have the same data
type and are the same length.
Skill Builder
The university architect has asked you to develop a data model to record details of
the campus buildings. A building can have many rooms, but a room can be in only
one building. Buildings have names, and rooms have a size and purpose (e.g.,
lecture, laboratory, seminar). Draw a data model for this situation and create the
matching relational database.
MySQL Workbench
In Workbench, a 1:m relationship is represented in a similar manner to the method
you have just learned. Also, note that the foreign key is shown in the entity at the
many end with a red diamond. We omit the foreign key when data modeling because
it can be inferred. You will observe some additional symbols on the line between the
two entities, and these will be explained later, but take note of the crow’s foot
indicating the 1:m relationship between nation and stock. Because Workbench can
generate automatically the SQL to create the tables, we use lowercase table names
13
and abbreviated column names.
Specifying a 1:m relationship in MySQL Workbench
Notice that the nations reported have a space in their name, which is a character not
in the range a-z and not in A-Z.
Search for string containing a repeated pattern or repetition
A pair of curly brackets is used to denote the repetition factor for a pattern. For
example, {n} means repeat a specified pattern n times.
• List the names of firms with a double ‘e’.
SELECT stkfirm FROM stock WHERE stkfirm REGEXP '[e]{2}';
stkfirm
Bolivian Sheep
Freedonia Copper
Narembeen Emu
Nigerian Geese
shrfirm
Abyssinian Ruby
Freedonia Copper
Patagonian Tea
expressions and a feature for finding expressions to solve specific problems. Check
out the regular expression for checking whether a character string is a valid email
address.
Subqueries
A subquery, or nested SELECT, is a SELECT nested within another SELECT. A subquery
can be used to return a list of values subsequently searched with an IN clause.
• Report the names of all Australian stocks.
SELECT stkfirm FROM stock
WHERE natcode IN
(SELECT natcode FROM nation
WHERE natname = 'Australia');
stkfirm
Narembeen Emu
Queensland Diamond
Indooroopilly Ruby
Conceptually, the subquery is evaluated first. It returns a list of natcodes ('AUS') so that
the query then is the same as:
SELECT stkfirm FROM stock
WHERE natcode IN ('AUS');
When discussing subqueries, sometimes a subquery is also called an inner query.
The term outer query is applied to the SQL preceding the inner query. In this case,
the outer and inner queries are:
Outer query SELECT stkfirm FROM stock
WHERE natcode IN
(SELECT natcode FROM nation
Inner query WHERE natname = 'Australia');
Note that in this case we do not have to qualify natcode. There is no identity crisis,
because natcode in the inner query is implicitly qualified as nation.natcode and natcode in
the outer query is understood to be stock.natcode.
This query also can be run as a join by writing:
SELECT stkfirm FROM stock, nation
WHERE stock.natcode = nation.natcode
AND natname = 'Australia';
Correlated subquery
In a correlated subquery, the subquery cannot be evaluated independently of the
outer query. It depends on the outer query for the values it needs to resolve the inner
query. The subquery is evaluated for each value passed to it by the outer query. An
example illustrates when you might use a correlated subquery and how it operates.
• Find those stocks where the quantity is greater than the average for that country.
An approach to this query is to examine the rows of stock one a time, and each time
compare the quantity of stock to the average for that country. This means that for
each row, the subquery must receive the outer query’s country code so it can
compute the average for that country.
SELECT natname, stkfirm, stkqty FROM stock, nation
WHERE stock.natcode = nation.natcode
AND stkqty >
(SELECT avg(stkqty) FROM stock
WHERE stock.natcode = nation.natcode);
natname stkfirm stkqty
Australia Queensland Diamond 89251
United Kingdom Bolivian Sheep 231678
United Kingdom Royal Ostrich Farms 1234923
United States Minnesota Gold 816122
Conceptually, think of this query as stepping through the join of stock and nation one
row at a time and executing the subquery each time. The first row has natcode = 'AUS'
so the subquery becomes
SELECT AVG(stkqty) FROM stock
WHERE stock.natcode = 'AUS';
Since the average stock quantity for Australian stocks is 63,672.33, the first row in
the join, Narembeen Emu, is not reported. Neither is the second row reported, but
the third is.
The term correlated subquery is used because the inner query’s execution depends
on receiving a value for a variable (nation.natcode in this instance) from the outer
query. Thus, the inner query of correlated subquery cannot be evaluated once and
for all. It must be evaluated repeatedly—once for each value of the variable received
from the outer query. In this respect, a correlated subquery is different from a
subquery, where the inner query needs to be evaluated only once. The requirement
to compare each row of a table against a function (e.g., average or count) for some
rows of a column is usually a clue that you need to write a correlated subquery.
Skill builder
Why are no Indian stocks reported in the correlated subquery example? How would
you change the query to report an Indian stock?
Report only the three stocks with the largest quantities (i.e., do the query without
using ORDER BY).
Views—virtual tables
You might have noticed that in these examples we repeated the join and stock value
calculation for each query. Ideally, we should do this once, store the result, and be
able to use it with other queries. We can do so if we create a view, a virtual table. A
view does not physically exist as stored data; it is an imaginary table constructed
from existing tables as required. You can treat a view as if it were a table and write
SQL to query it.
A view contains selected columns from one or more tables. The selected columns
can be renamed and rearranged. New columns based on arithmetic expressions can
be created. GROUP BY can also be used to create a view. Remember, a view contains
no actual data. It is a virtual table.
This SQL command does the join, calculates stock value, and saves the result as a
view:
CREATE VIEW stkvalue
(nation, firm, price, qty, exchrate, value)
AS SELECT natname, stkfirm, stkprice, stkqty, exchrate,
stkprice*stkqty*exchrate
FROM stock, nation
WHERE stock.natcode = nation.natcode;
There are several things to notice about creating a view:
• The six names enclosed by parentheses are the column names for the view.
• There is a one-to-one correspondence between the names in parentheses and
the names or expressions in the SELECT clause. Thus the view column named
VALUE contains the result of the arithmetic expression stkprice*stkqty*exchrate.
A view can be used in a query, such as:
• Find stocks with a value greater than £100,000.
SELECT nation, firm, value FROM stkvalue WHERE value > 100000;
nation firm value
United Kingdom Freedonia Copper 289548
United Kingdom Patagonian Tea 698084
United Kingdom Abyssinian Ruby 700358
United Kingdom Sri Lankan Gold 1655561
United Kingdom Indian Lead & Zinc 241223
United Kingdom Bolivian Sheep 2953895
United Kingdom Nigerian Geese 431305
United Kingdom Canadian Sugar 248910
United Kingdom Royal Ostrich Farms 41678651
United States Minnesota Gold 29456210
United States Georgia Peach 609856
Australia Narembeen Emu 258952
Australia Queensland Diamond 276303
Australia Indooroopilly Ruby 411176
There are two main reasons for creating a view. First, as we have seen, query
writing can be simplified. If you find that you are frequently writing the same
section of code for a variety of queries, then isolate the common section and put it
in a view. This means that you will usually create a view when a fact, such as stock
value, is derived from other facts in the table.
The second reason is to restrict access to certain columns or rows. For example, the
person who updates stock could be given a view that excludes stkqty. In this case,
changes in stock prices could be updated without revealing confidential
information, such as the value of the stock portfolio.
Skill builder
How could you use a view to solve the following query that was used when
discussing the correlated subquery?
Find those stocks where the quantity is greater than the average for that country.
Summary
Entities are related to other entities by relationships. The 1:m (one-to-many)
relationship occurs frequently in data models. An additional entity is required to
represent a 1:m relationship to avoid update anomalies. In a relational database, a
1:m relationship is represented by an additional column, the foreign key, in the table
at the many end of the relationship. The referential integrity constraint insists that a
foreign key must always exist as a primary key in a table. A foreign key constraint
is specified in a CREATE statement.
Join creates a new table from two existing tables by matching on a column common
to both tables. Often the common column is a primary key–foreign key
combination. A theta-join can have matching conditions of =, <>, <=, <, >=, and >.
An equijoin describes the situation where the matching condition is equality. The
GROUP BY clause is used to create an elementary control break report. The HAVING
clause of GROUP BY is like the WHERE clause of SELECT. A subquery, which has a
SELECT statement within another SELECT statement, causes two SELECT statements to be
executed—one for the inner query and one for the outer query. A correlated
subquery is executed as many times as there are rows selected by the outer query. A
view is a virtual table that is created when required. Views can simplify report
writing and restrict access to specified columns or rows.
Key terms and concepts
Constraint
Control break reporting
Correlated subquery
Delete anomalies
Equijoin
Foreign key
GROUP BY
HAVING
Insert anomalies
Join
One-to-many (1:m) relationship
Referential integrity
Relationship
Theta-join
Update anomalies
Views
Virtual table
Exercises
1. Draw data models for the following situations. In each case, make certain that
you show the attributes and feasible identifiers:
a. A farmer can have many cows, but a cow belongs to only one farmer.
b. A university has many students, and a student can attend at most one
university.
c. An aircraft can have many passengers, but a passenger can be on only one
flight at a time.
d. A nation can have many states and a state many cities.
e. An art researcher has asked you to design a database to record details of
artists and the museums in which their paintings are displayed. For each
painting, the researcher wants to know the size of the canvas, year painted, title,
and style. The nationality, date of birth, and death of each artist must be
recorded. For each museum, record details of its location and specialty, if it has
one.
2. Report all values in British pounds:
a. Report the value of stocks listed in Australia.
b. Report the dividend payment of all stocks.
c. Report the total dividend payment by nation.
d. Create a view containing nation, firm, price, quantity, exchange rate, value,
and yield.
e. Report the average yield by nation.
f. Report the minimum and maximum yield for each nation.
g. Report the nations where the average yield of stocks exceeds the average
yield of all stocks.
3. How would you change the queries in exercise 4-2 if you were required to
report the values in American dollars, Australian dollars, or Indian rupees?
4. What is a foreign key and what role does it serve?
5. What is the referential integrity constraint? Why should it be enforced?
6. Kisha, against the advice of her friends, is simultaneously studying data
management and Shakespearean drama. She thought the two subjects would be an
interesting contrast. However, the classes are very demanding and often enter her
midsummer dreams. Last night, she dreamed that William Shakespeare wanted
her to draw a data model. He explained, before she woke up in a cold sweat, that a
play had many characters but the same character never appeared in more than one
play. “Methinks,” he said, “the same name may have appeareth more than the
once, but ‘twas always a person of a different ilk.” He then, she hazily recollects,
went on to spout about the quality of data dropping like the gentle rain.
Draw a data model to keep old Bill quiet and help Kisha get some sleep.
7. An orchestra has four broad classes of instruments (strings, woodwinds, brass,
and percussion). Each class contains musicians who play different instruments.
For example, the strings section of a full symphony orchestra contains 2 harps,
16 to 18 first violins, 14 to 16 second violins, 12 violas, 10 cellos, and 8 double
basses. A city has asked you to develop a database to store details of the musicians
in its three orchestras. All the musicians are specialists and play only one
instrument for one orchestra.
8. Answer the following queries based on the following database for a car dealer:
Giant Steps and Swing are both on the Atlantic label; the data are
lbltitle lblstreet lblcity lblstate lblpostcode lblnation
Atlantic 75 Rockefeller Plaza New York NY 10019 USA
The cd data, where lbltitle is the foreign key, are
cdid cdlblid cdtitle cdyear lbltitle
1 A2 1311 Giant Steps 1960 Atlantic
2 83012-2 Swing 1977 Atlantic
The track data, where cdid is the foreign key, are
trkid trknum trktitle trklength cdid
1 1 Giant Steps 4.72 1
2 2 Cousin Mary 5.75 1
3 3 Countdown 2.35 1
4 4 Spiral 5.93 1
5 5 Syeeda’s Song Flute 7.00 1
6 6 Naima 4.35 1
7 7 Mr. P.C. 6.95 1
8 8 Giant Steps 3.67 1
9 9 Naima 4.45 1
10 10 Cousin Mary 5.90 1
11 11 Countdown 4.55 1
12 12 Syeeda’s Song Flute 7.03 1
13 1 Stomp of King Porter 3.20 2
14 2 Sing a Study in Brown 2.85 2
15 3 Sing Moten’s Swing 3.60 2
16 4 A-tisket, A-tasket 2.95 2
17 5 I Know Why 3.57 2
18 6 Sing You Sinners 2.75 2
19 7 Java Jive 2.85 2
20 8 Down South Camp Meetin’ 3.25 2
21 9 Topsy 3.23 2
22 10 Clouds 7.20 2
23 11 Skyliner 3.18 2
24 12 It’s Good Enough to Keep 3.18 2
25 13 Choo Choo Ch’ Boogie 3.00 2
Skill builder
Create tables to store the label and CD data. Add a column to track to store the
foreign key. Then, either enter the new data or download it from the Web site.
Define lbltitle as a foreign key in cd and cdid as a foreign key in track.
Ajay ran a few queries on his revised database.
• What are the tracks on Swing?
This query requires joining track and cd because the name of a CD (Swing in this
case) is stored in cd and track data are stored in track. The foreign key of track is
matched to the primary key of cd (i.e., track.cdid = cd.cdid).
SELECT trknum, trktitle, trklength FROM track, cd
WHERE track.cdid = cd.cdid
AND cdtitle = 'Swing';
cdtitle tracks
Giant Steps 12
Swing 13
• Report, in ascending order, the total length of tracks on each CD.
This is another example of GROUP BY. In this case, SUM is used to accumulate track
length, and ORDER BY is used for sorting in ascending order, the default sort order.
SELECT cdtitle, SUM(trklength) AS sumtrklen FROM track, cd
WHERE track.cdid = cd.cdid
GROUP BY cdtitle
ORDER BY SUM(trklength);
cdtitle sumtrklen
Swing 44.81
Giant Steps 62.65
• Does either CD have more than 60 minutes of music?
This query uses the HAVING clause to limit the CDs reported based on their total
length.
SELECT cdtitle, SUM(trklength) AS sumtrklen FROM track, cd
WHERE track.cdid = cd.cdid
GROUP BY cdtitle HAVING SUM(trklength) > 60;
cdtitle sumtrklen
Giant Steps 62.65
Skill builder
1. Write SQL for the following queries:
a. List the tracks by CD in order of track length.
b. What is the longest track on each CD?
2. What is wrong with the current data model?
3. Could cdlblid be used as an identifier for CD?
5. The Many-to-Many Relationship
Fearful concatenation of circumstances.
Daniel Webster
Learning objectives
Students completing this chapter will be able to
✤ model a many-to-many relationship between two entities;
✤ define a database with a many-to-many relationship;
✤ write queries for a database with a many-to-many relationship.
Alice was deeply involved in uncovering the workings of her firm. She spent hou
Ned, the marketing manager, was a Jekyll and Hyde character. By day he was a
charming, savvy, marketing executive. Walking to work in his three-piece suit,
bowler hat, and furled umbrella, he was the epitome of the conservative English
businessman. By night he became a computer nerd with ragged jeans, a T-shirt,
black-framed glasses, and unruly hair. Ned had been desperate to introduce
computers to The Expeditioner for years, but he knew that he could never reveal his
second nature. The other staff at The Expeditioner just would not accept such a
radical change in technology. Why, they had not even switched to fountain pens until
their personal stock of quill pens was finally depleted. Ned was just not prepared to
face the indignant silence and contempt that would greet his mere mention of
computers. It was better to continue a secret double life than admit to being a cyber
punk.
But times were changing, and Ned saw the opportunity. He furtively suggested to
Lady Alice that perhaps she needed a database of sales facts. Her response was
instantaneous: “Yes, and do it as soon as possible.” Ned was ecstatic. This was truly
a wonderful woman.
The many-to-many relationship
Consider the case when items are sold. We can immediately identify two entities:
SALE and ITEM. A sale can contain many items, and an item can appear in many
sales. We are not saying the same item can be sold many times, but the particular
type of item (e.g., a compass) can be sold many times; thus we have a many-to-many
(m:m) relationship between SALE and ITEM. When we have an m:m relationship,
we create a third entity to link the entities through two 1:m relationships. Usually, it
is fairly easy to name this third entity. In this case, this third entity, typically known
as an associative entity, is called LINE ITEM. A typical sales form lists the items
purchased by a customer. Each of the lines appearing on the order form is generally
known in retailing as a line item, which links an item and a sale.
A sales form
The Expeditioner
Sale of Goods
Sale# Date:
Item# Description Quantity Unit price Total
1
2
3
4
5
6
Grand total
The representation of this m:m relationship is shown. We say many-to-many
because there are two relationships—an ITEM is related to many SALEs, and a
SALE is related to many ITEMs. This data model can also be read as: “a sale has
many line items, but a line item refers to only one sale. Similarly, an item can
appear as many line items, but a line item references only one item.”
An m:m relationship between SALE and ITEM
The entity SALE is identified by saleno and has the attributes saledate and saletext (a
brief comment on the customer—soft information). LINEITEM is partially
identified by lineno and has attributes lineqty (the number of units sold) and
lineprice (the unit selling price for this sale). ITEM is identified by itemno and has
attributes itemname, itemtype (e.g., clothing, equipment, navigation aids, furniture,
and so on), and itemcolor.
If you look carefully at the m:m relationship figure, you will notice that there is a
plus sign (+) above the crow’s foot at the “many” end of the 1:m relationship
between SALE and LINEITEM. This plus sign provides information about the
identifier of LINEITEM. As you know, every entity must have a unique identifier. A
sales order is a series of rows or lines, and lineno is unique only within a particular
order. If we just use lineno as the identifier, we cannot guarantee that every instance
of LINEITEM is unique. If we use saleno and lineno together, however, we have a
unique identifier for every instance of LINEITEM. Identifier saleno is unique for
every sale, and lineno is unique within any sale. The plus indicates that LINEITEM’s
unique identifier is the concatenation of saleno and lineno. The order of
concatenation does not matter.
LINEITEM is termed a weak entity because it relies on another entity for its
existence and identification.
MySQL Workbench
Workbench automatically creates an associative entity for an m:m relationship and
populates it with a composite primary key based on concatenating the primary keys
of the two entities forming the m:m relationship. First, draw the two tables and enter
their respective primary keys and columns. Second, select the m:m symbol and
connect the two tables through clicking on one and dragging to the second and
releasing. You can then modify the associative entity as required, such as changing
its primary key. The capability to automatically create an associative entity for an
m:m relationship is a very useful Workbench feature.
An m:m relationship with Workbench
Workbench distinguishes between two types of relationships. An identifying
relationship, shown by a solid line, is used when the entity at the many end of the
relationship is a weak entity and needs the identifier of the one end of the
relationship to uniquely identify an instance of the relationship, as in LINEITEM. An
identifying relationship corresponds to the + sign associated with a crow’s foot. The
other type of relationship, shown by a dashed line, is known as a non-identifying
relationship. The mapping between the type of relationship and the representation
(i.e., dashed or solid line) is arbitrary and thus not always easily recalled. We think
that using a + on the crow’s foot is a better way of denoting weak entities.
When the relationship between SALE and ITEM is drawn in Workbench, as shown
in the following figure, there are two things to notice. First, the table, lineitem, maps
the associative entity generated for the m:m relationship. Second, lineitem has an
identifying relationship with sale and a non-identifying relationship with item.
An m:m relationship between SALE and ITEM in MySQL Workbench
Why did we create a third entity?
When we have an m:m relationship, we create an associative entity to store data
about the relationship. In this case, we have to store data about the items sold. We
cannot store the data with SALE because a sale can be many items, and an instance
of an entity stores only single-value facts. Similarly, we cannot store data with ITEM
because an item can appear in many sales. Since we cannot store data in SALE or
ITEM, we must create another entity to store data about the m:m relationship.
You might find it useful to think of the m:m relationship as two 1:m relationships.
An item can appear on many line item listings, and a line item entry refers to only
one item. A sale has many line items, and each line item entry refers to only one
sale.
Social Security number is not unique!
Two girls named Sarah Lee Ferguson were born on May 3, 1959. The U.S.
government considered them one and the same and issued both the same Social
Security number (SSN), a nine-digit identifier of U.S. residents. Now Sarah Lee
Ferguson Boles and Sarah Lee Ferguson Johnson share the same SSN. 15
Mrs. Boles became aware of her SSN twin in 1987 when the Internal Revenue
Service claimed there was a discrepancy in her reported income. Because SSN is
widely used as an attribute or identifier in many computer systems, Mrs. Boles
encountered other incidents of mistaken identity. Some of Mrs. Johnson’s purchases
appeared on Mrs. Boles’ credit reports.
In late 1989, the Social Security Administration notified Mrs. Boles that her original
number was given to her in error and she had to provide evidence of her age,
identity, and citizenship to get a new number. When Mrs. Boles got her new SSN, it
is likely she had to also get a new driver ’s license and establish a new credit history.
Creating a relational database with an m:m relationship
As before, each entity becomes a table in a relational database, the entity name
becomes the table name, each attribute becomes a column, and each identifier
becomes a primary key. Remember, a 1:m relationship is mapped by adding a
column to the entity of the many end of the relationship. The new column contains
the identifier of the one end of the relationship.
Conversion of the foregoing data model results in the three tables following. Note
the one-to-one correspondence between attributes and columns for sale and item.
Observe the lineitem has two additional columns, saleno and itemno. Both of these
columns are foreign keys in lineitem (remember the use of italics to signify foreign
keys). Two foreign keys are required to record the two 1:m relationships. Notice in
lineitem that saleno is both part of the primary key and a foreign key.
Tables sale, lineitem, and item
sale
saleno saledate saletext
1 2003-01-15 Scruffy Australian—called himself Bruce.
2 2003-01-15 Man. Rather fond of hats.
3 2003-01-15 Woman. Planning to row Atlantic—lengthwise!
4 2003-01-15 Man. Trip to New York—thinks NY is a jungle!
5 2003-01-16 Expedition leader for African safari.
lineitem
lineno lineqty lineprice saleno itemno
1 1 4.50 1 2
1 1 25.00 2 6
2 1 20.00 2 16
3 1 25.00 2 19
4 1 2.25 2 2
1 1 500.00 3 4
2 1 2.25 3 2
1 1 500.00 4 4
2 1 65.00 4 9
3 1 60.00 4 13
4 1 75.00 4 14
5 1 10.00 4 3
6 1 2.25 4 2
1 50 36.00 5 10
2 50 40.50 5 11
3 8 153.00 5 12
4 1 60.00 5 13
5 1 0.00 5 2
item
itemno itemname itemtype itemcolor
1 Pocket knife—Nile E Brown
2 Pocket knife—Avon E Brown
3 Compass N —
4 Geopositioning system N —
5 Map measure N —
6 Hat—Polar Explorer C Red
7 Hat—Polar Explorer C White
8 Boots—snake proof C Green
9 Boots—snake proof C Black
10 Safari chair F Khaki
11 Hammock F Khaki
12 Tent—8 person F Khaki
13 Tent—2 person F Khaki
14 Safari cooking kit E —
15 Pith helmet C Khaki
16 Pith helmet C White
17 Map case N Brown
18 Sextant N —
19 Stetson C Black
20 Stetson C Brown
The SQL commands to create the three tables are as follows:
CREATE TABLE sale (
saleno INTEGER,
saledate DATE NOT NULL,
saletext VARCHAR(50),
PRIMARY KEY(saleno));
CREATE TABLE item (
itemno INTEGER,
itemname VARCHAR(30) NOT NULL,
itemtype CHAR(1) NOT NULL,
itemcolor VARCHAR(10),
PRIMARY KEY(itemno));
CREATE TABLE lineitem (
lineno INTEGER,
lineqty INTEGER NOT NULL,
lineprice DECIMAL(7,2) NOT NULL,
saleno INTEGER,
itemno INTEGER,
PRIMARY KEY(lineno,saleno),
CONSTRAINT fk_has_sale FOREIGN KEY(saleno)
REFERENCES sale(saleno),
CONSTRAINT fk_has_item FOREIGN KEY(itemno)
REFERENCES item(itemno));
Although the sale and item tables are created in a similar fashion to previous
examples, there are two things to note about the definition of lineitem. First, the
primary key is a composite of lineno and saleno, because together they uniquely
identify an instance of lineitem. Second, there are two foreign keys, because lineno is at
the "many" end of two 1: m relationships.
Skill builder
A hamburger shop makes several types of hamburgers, and the same type of
ingredient can be used with several types of hamburgers. This does not literally
mean the same piece of lettuce is used many times, but lettuce is used with several
types of hamburgers. Draw the data model for this situation. What is a good name
for the associative entity?
Querying an m:m relationship
A three-table join
The join operation can be easily extended from two tables to three or more merely
by specifying the tables to be joined and the matching conditions. For example:
SELECT * FROM sale, lineitem, item
WHERE sale.saleno = lineitem.saleno
AND item.itemno = lineitem.itemno;
The three tables to be joined are listed after the FROM clause. There are two
matching conditions: one for sale and lineitem (sales.saleno = lineitem.saleno) and one for the
item and lineitem tables (item.itemno = lineitem.itemno ). The table lineitem is the link between sale
and item and must be referenced in both matching conditions.
You can tailor the join to be more precise and report some columns rather than all.
• List the name, quantity, price, and value of items sold on January 16, 2011.
SELECT itemname, lineqty, lineprice, lineqty*lineprice AS total
FROM sale, lineitem, item
WHERE lineitem.saleno = sale.saleno
AND item.itemno = lineitem.itemno
AND saledate = '2011-01-16';
itemname itemcolor
Hat—Polar Explorer Red
Boots—snake proof Black
Pith helmet White
Stetson Black
Conceptually, we can think of this query as evaluating the subquery for each row of
item. The first item with itemtype = 'C', Hat—Polar Explorer (red), in item has itemno = 6.
Thus, the query becomes
SELECT itemname, itemcolor FROM item
WHERE itemtype = 'C'
AND EXISTS (SELECT * FROM lineitem
WHERE lineitem.itemno = 6);
Because there is at least one row in lineitem with itemno = 6, the subquery returns true.
The item has been sold and should be reported. The second clothing item, Hat
—Polar Explorer (white), in item has itemno = 7 . There are no rows in lineitem with itemno =
7 , so the subquery returns false. That item has not been sold and should not be
reported.
You can also think of the query as, “Select clothing items for which a sale exists.”
Remember, for EXISTS to return true, there needs to be only one row for which the
condition is true.
NOT EXISTS—select a value if it does not exist
NOT EXISTS is the negative of EXISTS. It is used in a WHERE clause to test whether all
rows in a table fail to satisfy a specified condition. It returns the value true if there
are no rows satisfying the condition; otherwise it returns false.
• Report all clothing items that have not been sold.
SELECT itemname, itemcolor FROM item
WHERE itemtype = 'C'
AND NOT EXISTS
(SELECT * FROM lineitem
WHERE item.itemno = lineitem.itemno);
If we consider this query as the opposite of that used to illustrate EXISTS, it seems
logical to use NOT EXISTS. Conceptually, we can also think of this query as evaluating
the subquery for each row of item. The first item with itemtype = 'C', Hat—Polar
Explorer (red), in item has itemno = 6. Thus, the query becomes
SELECT itemname, itemcolor FROM item
WHERE itemtype = 'C'
AND NOT EXISTS
(SELECT * FROM lineitem
WHERE lineitem.itemno = 6);
There is at least one row in lineitem with itemno = 6, so the subquery returns true. The
NOT before EXISTS then negates the true to give false; the item will not be reported
because it has been sold.
The second item with itemtype = 'C', Hat—Polar Explorer (white), in item has itemno = 7.
The query becomes
SELECT itemname, itemcolor FROM item
WHERE itemtype = 'C'
AND NOT EXISTS
(SELECT * FROM lineitem
WHERE lineitem.itemno = 7);
Because there are no rows in lineitem with itemno = 7, the subquery returns false, and this
is negated by the NOT before EXISTS to give true. The item has not been sold and
should be reported.
itemname itemcolor
Hat—Polar Explorer White
Boots—snake proof Green
Pith helmet Khaki
Stetson Brown
You can also think of the query as, “Select clothing items for which no sales exist.”
Also remember, for NOT EXISTS to return true, no rows should satisfy the condition.
Skill builder
Report all red items that have not been sold. Write the query twice, once using EXISTS
and once without EXISTS.
Divide (and be conquered)
In addition to the existential quantifier that you have already encountered, formal
logic has a universal quantifier known as forall that is necessary for queries such
as
• Find the items that have appeared in all sales.
If a universal qualifier were supported by SQL, this query could be phrased as,
“Select item names where forall sales there exists a lineitem row recording that this
item was sold.” A quick inspection of the first set of tables shows that one item
satisfies this condition (itemno = 2).
While SQL does not directly support the universal qualifier, formal logic shows that
forall can be expressed using EXISTS. The query becomes, “Find items such that there
does not exist a sale in which this item does not appear.” The equivalent SQL
expression is
SELECT itemno, itemname FROM item
WHERE NOT EXISTS
(SELECT * FROM sale
WHERE NOT EXISTS
(SELECT * FROM lineitem
WHERE lineitem.itemno = item.itemno
AND lineitem.saleno = sale.saleno));
itemno itemname
2 Pocket knife–Avon
If you are interested in learning the inner workings of the preceding SQL for divide,
see the additional material for Chapter 5 on the book’s Web site.
Relational algebra (Chapter 9) has the divide operation, which makes divide queries
easy to write. Be careful: Not all queries containing the word all are divides. With
experience, you will learn to recognize and conquer divide.
To save the tedium of formulating this query from scratch, we have developed a
template for dealing with these sorts of queries. Divide queries typically occur with
m:m relationships.
A template for divide
itemname
Hammock
Map case
Pocket knife—Avon
Pocket knife—Nile
Safari chair
Stetson
Tent—2 person
Tent—8 person
• List items that were sold on January 16, 2011, and are brown.
This query uses the same two tables as the previous query. In this case, INTERSECT
(i.e., and) then combines the results of the tables including only rows in both tables
and excluding duplicates. 16
Associative entity
Divide
Existential quantifier
EXISTS
INTERSECT
Many-to-many (m:m) relationship
NOT EXISTS
UNION
Universal quantifier
Exercises
1. Draw data models for the following situations. In each case, think about the
names you give each entity:
a. Farmers can own cows or share cows with other farmers.
b. A track and field meet can have many competitors, and a competitor can
participate in more than one event.
c. A patient can have many physicians, and a physician can have many patients.
d. A student can attend more than one class, and the same class can have many
students.
e. The Marathoner, a monthly magazine, regularly reports the performance of
professional marathon runners. It has asked you to design a database to record
the details of all major marathons (e.g., Boston, London, and Paris).
Professional marathon runners compete in several races each year. A race may
have thousands of competitors, but only about 200 or so are professional
runners, the ones The Marathoner tracks. For each race, the magazine reports a
runner ’s time and finishing position and some personal details such as name,
gender, and age.
2. The data model shown was designed by a golf statistician. Write SQL statements
to create the corresponding relational database.
3. Write the following SQL queries for the database described in this chapter:
a. List the names of items for which the quantity sold is greater than one for
any sale.
b. Compute the total value of sales for each item by date.
c. Report all items of type “F” that have been sold.
d. List all items of type “F” that have not been sold.
e. Compute the total value of each sale.
4. Why do you have to create a third entity when you have an m:m relationship?
5. What does a plus sign near a relationship arc mean?
6. How does EXISTS differ from other clauses in an SQL statement?
7. How is the relational algebra divide statement implemented in SQL? What do
you think of this approach?
8. Answer the following queries based on the described relational database.
a. List the phone numbers of donors Hays and Jefts.
b. How many donors are there in the donor table?
c. How many people made donations in 1999?
d. What is the name of the person who made the largest donation in 1999?
e. What was the total amount donated in 2000?
f. List the donors who have made a donation every year.
g. List the donors who give more than twice the average.
h. List the total amount given by each person across all years; sort the report by
the donor ’s name.
i. Report the total donations in 2001 by state.
j. In which years did the total donated exceed the goal for the year?
9. The following table records data found on the side of a breakfast cereal carton.
Use these data as a guide to develop a data model to record nutrition facts for a
meal. In this case, a meal is a cup of cereal and 1/2 cup of skim milk.
Nutrition facts
Serving size 1 cup (30g)
Servings per container about 17
Amount per serving Cereal with 1/2 cup of skim milk
Calories 110 150
Calories from Fat 10 10
Total Fat 1g 1% 2%
Saturated Fat 0g 0% 0%
Polyunsaturated Fat 0g
Monounsaturated Fat 0g
Cholesterol 0mg 0% 1%
Sodium 220mg 9% 12%
Potassium 105 mg 3% 9%
Total Carbohydrate 24g 8% 10%
Dietary Fiber 3g 13% 13%
Sugars 4g
Other Carbohydrate 17g
Protein 3g
Vitamin A 10% 15%
Vitamin C 10% 10%
Calcium 2% 15%
Iron 45% 45%
Vitamin D 10% 25%
Thiamin 50% 50%
Riboflavin 50% 50%
Niacin 50% 50%
Vitamin B12 50% 60%
Phosphorus 10% 20%
Magnesium 8% 10%
Zinc 50% 50%
Copper 4% 4%
CD library case
After learning how to model m:m relationships, Ajay realized he could now model
a number of situations that had bothered him. He had been puzzling over some
relationships that he had recognized but was unsure how to model:
• A person can appear on many tracks, and a track can have many persons
(e.g., The Manhattan Transfer, a quartet, has recorded many tracks).
• A person can compose many songs, and a song can have multiple
composers (e.g., Rodgers and Hammerstein collaborated many times with
Rodgers composing the music and Hammerstein writing the lyrics).
• An artist can release many CDs, and a CD can feature several artists (e.g.,
John Coltrane, who has many CDs, has a CD, Miles & Coltrane, with Miles
Davis).
Ajay revised his data model to include the m:m relationships he had recognized. He
wrestled with what to call the entity that stored details of people. Sometimes these
people are musicians; other times singers, composers, and so forth. He decided that
the most general approach was to call the entity PERSON. The extended data model
is shown in the following figure.
CD library V3.0
In version 3.0, Ajay introduced several new entities, including COMPOSITION to
record details of a composition. He initially linked COMPOSITION to TRACK with
a 1:m relationship. In other words, a composition can have many tracks, but a track
is of one composition. This did not seem right.
After further thought, he realized there was a missing entity—RECORDING. A
composition has many recordings (e.g., multiple recordings of John Coltrane’s
composition “Giant Steps”), and a recording can appear as a track on different CDs
(typically those whose title begins with The Best of … feature tracks on earlier CDs).
Also, because a recording is a performance of a composition, it is a composition
that has a title, not recording or track. So, he revised version 3.0 of the data model
to produce version 4.0.
CD library V4.0
Version 4.0 of the model captures:
• the m:m between PERSON and RECORDING with the associative entity
PERSON-RECORDING. The identifier for PERSON-RECORDING is a
composite key (psnid, rcdid, psnrcdrole). The identifiers of PERSON and
RECORDING are required to identify uniquely an instance of PERSON-
RECORDING, which is why there are plus signs on the crow’s feet attached to
PERSON-RECORDING. Also, because a person can have multiple roles on a
recording, (e.g., play the piano and sing), then psnrcdrole is required to
uniquely identify these situations.
• the m:m between PERSON and COMPOSITION with the associative entity
PERSON-COMPOSITION. The identifier is a composite key (psnid, compid,
psncomprole). The attribute psncomporder is for remembering the correct
sequence in those cases where there are multiple people involved in a
composition. That is, the database needs to record that it is Rodgers and
Hammerstein, not Hammerstein and Rodgers.
• the m:m between PERSON and CD with the associative entity PERSON-CD.
Again the identifier is a composite key (psnid, cdid) of the identifiers of entities
in the m:m relationship, and there is an attribute (psncdorder) to record the
order of the people featured on the CD.
• the 1:m between RECORDING and TRACK. A recording can appear on
many tracks, but a track has only one recording.
• the 1:m between CD and TRACK. A CD can have many tracks, but a track
appears on only one CD. You can also think of TRACK as the name for the
associative entity of the m:m relationship between RECORDING and CD. A
recording can appear on many CDs, and a CD has many recordings.
After creating the new tables, Ajay was ready to add some data. He decided first to
correct the errors introduced by his initial, incorrect modeling of TRACK by
inserting data for RECORDING and revising TRACK. He realized that because
RECORDING contains a foreign key to record the 1:m relationship with
COMPOSITION, he needed to start by entering the data for COMPOSITION. Note
that some data are missing for the year of composition.
The table composition
compid comptitle compyear
1 Giant Steps
2 Cousin Mary
3 Countdown
4 Spiral
5 Syeeda’s Song Flute
6 Naima
7 Mr. P.C.
8 Stomp of King Porter 1924
9 Sing a Study in Brown 1937
10 Sing Moten’s Swing 1997
11 A-tisket, A-tasket 1938
12 I Know Why 1941
13 Sing You Sinners 1930
14 Java Jive 1940
15 Down South Camp Meetin’ 1997
16 Topsy 1936
17 Clouds
18 Skyliner 1944
19 It’s Good Enough to Keep 1997
20 Choo Choo Ch’ Boogie 1945
Once composition was entered, Ajay moved on to the recording table. Again, some data
are missing.
The table recording
rcdid compid rcdlength rcddate
1 1 4.72 1959-May-04
2 2 5.75 1959-May-04
3 3 2.35 1959-May-04
4 4 5.93 1959-May-04
5 5 7.00 1959-May-04
6 6 4.35 1959-Dec-02
7 7 2.95 1959-May-04
8 1 5.93 1959-Apr-01
9 6 7.00 1959-Apr-01
10 2 6.95 1959-May-04
11 3 3.67 1959-May-04
12 2 4.45 1959-May-04
13 8 3.20
14 9 2.85
15 10 3.60
16 11 2.95
17 12 3.57
18 13 2.75
19 14 2.85
20 15 3.25
21 16 3.23
22 17 7.20
23 18 3.18
24 19 3.18
25 20 3.00
The last row of the recording table indicates a recording of the composition “Choo
Choo Ch’ Boogie,” which has compid = 20.
Next, the data for track could be entered.
The table track
CDid trknum rcdid
1 1 1
1 2 2
1 3 3
1 4 4
1 5 5
1 6 6
1 7 7
1 8 1
1 9 6
1 10 2
1 11 3
1 12 5
2 1 13
2 2 14
2 3 15
2 4 16
2 5 17
2 6 18
2 7 19
2 8 20
2 9 21
2 10 22
2 11 23
2 12 24
2 13 25
The last row of the track data indicates that track 13 of Swing (cdid = 2) is a recording
of “Choo Choo Ch’ Boogie” (rcdid = 25 and compid = 20).
The table person-recording stores details of the people involved in each recording.
Referring to the accompanying notes for the CD Giant Steps, Ajay learned that the
composition “Giants Steps,” recorded on May 4, 1959, included the following
personnel: John Coltrane on tenor sax, Tommy Flanagan on piano, Paul Chambers
on bass, and Art Taylor on drums. Using the following data, he inserted four rows
in the person table.
The table person
psnid psnfname psnlname
1 John Coltrane
2 Tommy Flanagan
3 Paul Chamber
4 Art Taylor
Then, he linked these people to the particular recording of “Giant Steps” by entering
the data in the table, person-recording.
The table person-recording
psnid rcdid psncdrole
1 1 tenor sax
2 1 piano
3 1 bass
4 1 drums
Skill builder
1. What data will you enter in person-cd to relate John Coltrane to the Giant Steps
CD?
2. Enter appropriate values in person-composition to relate John Coltrane to
compositions with compid 1 through 7. You should indicate he wrote the music
by entering a value of “music” for the person’s role in the composition.
3. Use SQL to solve the following queries:
a. List the tracks on “Swing.”
b. Who composed the music for “Spiral”?
c. Who played which instruments for the May 4, 1959 recording of “Giant
Steps”?
d. List the composers who write music and play the tenor sax.
4. What is the data model missing?
6. One-to-One and Recursive Relationships
Self-reflection is the school of wisdom.
Baltasar Gracián, The Art of Worldly Wisdom, 1647
Learning objectives
Students completing this chapter will be able to
✤ model one-to-one and recursive relationships;
✤ define a database with one-to-one and recursive relationships;
✤ write queries for a database with one-to-one and recursive relationships.
Alice was convinced that a database of business facts would speed up decision m
She divided the firm into four departments and appointed a boss for each
department. The first person listed in each department was its boss; of course, Alice
was paramount.
Ned had finished the sales database, so the time was right to create an employee
database. Ned seemed to be enjoying his new role, though his dress standard had
slipped. Maybe Alice would have to move him out of Marketing. In fact, he
sometimes reminded her of fellows who used to hang around the computer lab all
day playing weird games. She wasn’t quite ready to sound a “nerd alert,” but it was
getting close.
Modeling a one-to-one relationship
Initially, the organization chart appears to record two relationships. First, a
department has one or more employees, and an employee belongs to one
department. Second, a department has one boss, and a person is boss of only one
department. That is, boss is a 1:1 relationship between DEPT and EMP. The data
model for this situation is shown.
A data model illustrating a 1:1 relationship
As a general rule, the 1:1 relationship is labeled to avoid confusion because the
meaning of such a relationship cannot always be inferred. This label is called a
relationship descriptor. The 1:m relationship between DEPT and EMP is not
labeled because its meaning is readily understood by reading the model. Use a
relationship descriptor when there is more than one relationship between entities or
when the meaning of the relationship is not readily inferred from the model.
If we think about this problem, we realize there is more to boss than just a
department. People also have a boss. Thus, Alice is the boss of all the other
employees. In this case, we are mainly interested in who directly bosses someone
else. So, Alice is the direct boss of Ned, Todd, Brier, and Sophie. We need to record
the person-boss relationship as well as the department-boss relationship.
The person-boss relationship is a recursive 1:m relationship because it is a
relationship between employees—an employee has one boss and a boss can have
many employees. The data model is shown in the following figure.
A data model illustrating a recursive 1:m relationship
MySQL Workbench version of a recursive 1:m relationship
It is a good idea to label the recursive relationship, because its meaning is often not
obvious from the data model.
Mapping a one-to-one relationship
Since mapping a 1:1 relationship follows the same rules as for any other data
model, the major consideration is where to place the foreign key(s). There are three
alternatives:
1. Put the foreign key in dept.
Doing so means that every instance of dept will record the empno of the employee
who is boss. Because all departments in this case have a boss, the foreign key will
always be non-null.
2. Put the foreign key in emp.
Choosing this alternative means that every instance of emp should record deptname
of the department this employee bosses. Since many employees are not bosses,
the value of the foreign key column will generally be null.
3. Put a foreign key in both dept and emp.
The consequence of putting a foreign key in both tables in the 1:1 relationship is
the combination of points 1 and 2.
A sound approach is to select the entity that results in the fewest nulls, since this
tends to be less confusing to clients. In this case, the simplest approach is to put the
foreign key in dept.
Mapping a recursive one-to-many relationship
A recursive 1:m relationship is mapped like a standard 1:m relationship. An
additional column, for the foreign key, is created for the entity at the “many” end of
the relationship. Of course, in this case the “one” and “many” ends are the same
entity, so an additional column is added to emp. This column contains the key empno of
the “one” end of the relationship. Since empno is already used as a column name, a
different name needs to be selected. In this case, it makes sense to call the foreign
key column bossno because it stores the boss’s employee number.
The mapping of the data model is shown in the following table. Note that deptname
becomes a column in emp, the “many” end of the 1:m relationship, and empno becomes
a foreign key in dept, an end of the 1:1 relationship.
The tables dept and emp
dept
deptname deptfloor deptphone empno
Management 5 2001 1
Marketing 1 2002 2
Accounting 4 2003 5
Purchasing 4 2004 7
Personnel & PR 1 2005 9
emp
empno empfname empsalary deptname bossno
1 Alice 75000 Management
2 Ned 45000 Marketing 1
3 Andrew 25000 Marketing 2
4 Clare 22000 Marketing 2
5 Todd 38000 Accounting 1
6 Nancy 22000 Accounting 5
7 Brier 43000 Purchasing 1
8 Sarah 56000 Purchasing 7
9 Sophie 35000 Personnel & PR 1
If you examine emp, you will see that the boss of the employee with empno = 2 (Ned)
has bossno = 1. You can then look up the row in emp with empno = 1 to find that Ned’s
boss is Alice. This “double lookup” is frequently used when manually interrogating
a table that represents a recursive relationship. Soon you will discover how this is
handled with SQL.
Here is the SQL to create the two tables:
CREATE TABLE dept (
deptname VARCHAR(15),
deptfloor SMALLINT NOT NULL,
deptphone SMALLINT NOT NULL,
empno SMALLINT NOT NULL,
PRIMARY KEY(deptname));
CREATE TABLE emp (
empno SMALLINT,
empfname VARCHAR(10),
empsalary DECIMAL(7,0),
deptname VARCHAR(15),
bossno SMALLINT,
PRIMARY KEY(empno),
CONSTRAINT fk_belong_dept FOREIGN KEY(deptname)
REFERENCES dept(deptname),
CONSTRAINT fk_has_boss foreign key (bossno)
REFERENCES emp(empno));
You will notice that there is no foreign key definition for empno in dept and for bossno
in emp (the recursive boss relationship). Why? Observe that deptname is a foreign key
in emp. If we make empno a foreign key in dept, then we have a deadly embrace. A new
department cannot be added to the dept table until there is a boss for that department
(i.e., there is a person in the emp table with the empno of the boss); however, the other
constraint states that an employee cannot be added to the emp table unless there is a
department to which that person is assigned. If we have both foreign key constraints,
we cannot add a new department until we have added a boss, and we cannot add a
boss until we have added a department for that person. Nothing, under these
circumstances, can happen if both foreign key constraints are in place. Thus, only
one of them is specified.
In the case of the recursive employee relationship, we can create a constraint to
ensure that bossno exists for each employee, except of course the person, Alice, who
is top of the pyramid. This form of constraint is known as a self-referential foreign
key. However, we must make certain that the first person inserted into emp is Alice.
The following statements illustrate that we must always insert a person’s boss
before we insert the person.
INSERT INTO emp (empno, empfname, empsalary, deptname)
VALUES (1,'Alice',75000,'Management');
INSERT INTO emp VALUES (2,'Ned',45000,'Marketing',1);
INSERT INTO emp VALUES (3,'Andrew',25000,'Marketing',2);
INSERT INTO emp VALUES (4,'Clare',22000,'Marketing',2);
INSERT INTO emp VALUES (5,'Todd',38000,'Accounting',1);
INSERT INTO emp VALUES (6,'Nancy',22000,'Accounting',5);
INSERT INTO emp VALUES (7,'Brier',43000,'Purchasing',1);
INSERT INTO emp VALUES (8,'Sarah',56000,'Purchasing',7);
INSERT INTO emp VALUES (9,'Sophie',35000,'Personnel',1);
In more complex modeling situations, such as when there are multiple relationships
between a pair of entities, use of a FOREIGN KEY clause may result in a deadlock.
Always consider the consequences of using a FOREIGN KEY clause before applying it.
Skill builder
A consulting company has assigned each of its employees to a specialist group (e.g.,
database management). Each specialist group has a team leader. When employees
join the company, they are assigned a mentor for the first year. One person might
mentor several employees, but an employee has at most one mentor.
Querying a one-to-one relationship
Querying a 1:1 relationship presents no special difficulties but does allow us to see
additional SQL features.
• List the salary of each department’s boss.
SELECT empfname, deptname, empsalary FROM emp
WHERE empno IN (SELECT empno FROM dept);
or
SELECT empfname, dept.deptname, empsalary
FROM emp, dept
WHERE dept.empno = emp.empno;
empfname deptname empsalary
Alice Management 75000
Ned Marketing 45000
Todd Accounting 38000
Brier Purchasing 43000
Sophie Personnel & PR 35000
Querying a recursive 1:m relationship
Querying a recursive relationship is puzzling until you realize that you can join a
table to itself by creating two copies of the table. In SQL, you create a temporary
copy, a table alias, by following the table’s name with the alias (e.g., emp wrk creates
a temporary copy, called wrk, of the permanent table emp). Table aliases are always
required so that SQL can distinguish the copy of the table being referenced. To
demonstrate:
• Find the salary of Nancy’s boss.
SELECT wrk.empfname, wrk.empsalary,boss.empfname, boss.empsalary
FROM emp wrk, emp boss
WHERE wrk.empfname = 'Nancy'
AND wrk.bossno = boss.empno;
Many queries are solved by getting all the data you need to answer the request in
one row. In this case, the query is easy to answer once the data for Nancy and her
boss are in the one row. Thus, think of this query as joining two copies of the table
emp to get the worker and her boss’s data in one row. Notice that there is a qualifier
(wrk and boss) for each copy of the table to distinguish between them. It helps to use a
qualifier that makes sense. In this case, the wrk and boss qualifiers can be thought of
as referring to the worker and boss tables, respectively. You can understand how the
query works by examining the following table illustrating the self-join.
wrk boss
empno empfname empsalary deptname bossno empno empfname empsalary deptname bossno
This would be very easy if the employee and boss data were in the same row. We
could simply compare the salaries of the two people. To get the data in the one row,
we repeat the self-join with a different WHERE condition. The result is as follows:
empfname
Sarah
Skill builder
Find the name of Sophie’s boss.
Alice has found several histories of The Expeditioner from a variety of eras. Beca
Modeling a recursive one-to-one relationship
The British monarchy can be represented by a simple one-entity model. A monarch
has one direct successor and one direct predecessor. The sequencing of monarchs
can be modeled by a recursive 1:1 relationship, shown in the following figure.
A data model illustrating a recursive 1:1 relationship
MySQL Workbench version of a recursive 1:1 relationship
Because the 1:1 relationship is recursive, you cannot insert Queen Victoria without
first inserting King William IV. What you can do is first insert King William,
without any reference to the preceding monarch (i.e., a null foreign key). The
following code illustrates the order of record insertion so that the referential
integrity constraint is obeyed.
INSERT INTO MONARCH (montype,monname, monnum,rgnbeg)
VALUES ('King','William','IV','1830-06-26');
INSERT INTO MONARCH
VALUES ('Queen','Victoria','I','1837-06-20','William','IV');
INSERT INTO MONARCH
VALUES ('King','Edward','VII','1901-01-22','Victoria','I');
INSERT INTO MONARCH
VALUES ('King','George','V','1910-05-06','Edward','VII');
INSERT INTO MONARCH
VALUES ('King','Edward','VIII','1936-01-20','George','V');
INSERT INTO MONARCH
VALUES('King','George','VI','1936-12-11','Edward','VIII');
INSERT INTO MONARCH
VALUES('Queen','Elizabeth','II','1952-02-06','George','VI');
Skill builder
In a competitive bridge competition, the same pair of players play together for the
entire tournament. Draw a data model to record details of all the players and the
pairs of players.
Querying a recursive one-to-one relationship
Some queries on the MONARCH table demonstrate querying a recursive 1:1
relationship.
• Who preceded Elizabeth II?
SELECT premonname, premonnum FROM monarch
WHERE monname = 'Elizabeth' and monnum = 'II';
premonname premonnum
George VI
This is simple because all the data are in one row. A more complex query is:
• Was Elizabeth II’s predecessor a king or queen?
SELECT pre.montype FROM monarch cur, monarch pre
WHERE cur.premonname = pre.monname
AND cur.premonnum = pre.monnum
AND cur.monname = 'Elizabeth' AND cur.monnum = 'II';
montype
King
This is very similar to the query to find the salary of Nancy’s boss. The monarch table
is joined with itself to create a row that contains all the details to answer the query.
• List the kings and queens of England in ascending chronological order.
SELECT montype, monname, monnum, rgnbeg
FROM monarch ORDER BY rgnbeg;
montype monname monnum rgnbeg
Queen Victoria I 1837-06-20
King Edward VII 1901-01-22
King George V 1910-05-06
King Edward VIII 1936-01-20
King George VI 1936-12-11
Queen Elizabeth II 1952-02-06
This is a simple query because rgnbeg is like a ranking column. It would not be
enough to store just the year in rgnbeg, because two kings started their reigns in 1936;
hence, the full date is required.
Skill builder
Who succeeded Queen Victoria?
Of course, Alice soon had another project for Ned. The Expeditioner keeps a wid
Modeling a recursive many-to-many relationship
The assembly of products to create other products is very common in business.
Manufacturing even has a special term to describe it: a bill of materials. The data
model is relatively simple once you realize that a product can appear as part of
many other products and can be composed of many other products; that is, we have
a recursive many-to-many (m:m) relationship for product. As usual, we turn an
m:m relationship into two one-to-many (1:m) relationships. Thus, we get the data
model displayed in the following figure.
A data model illustrating a recursive m:m relationship
MySQL Workbench version of a recursive m:m relationship
assembly
quantity prodid subprodid
1 1000 101
1 1000 102
1 1000 103
1 1000 104
1 1000 105
2 1000 106
1 1000 107
10 1000 108
Mapping a recursive many-to-many relationship
Mapping follows the same procedure described previously, producing the two
tables shown below. The SQL statements to create the tables are shown next.
Observe that assembly has a composite key, and there are two foreign key constraints.
CREATE TABLE product (
prodid INTEGER,
proddesc VARCHAR(30),
prodcost DECIMAL(9,2),
prodprice DECIMAL(9,2),
PRIMARY KEY(prodid));
CREATE TABLE assembly (
quantity INTEGER NOT NULL,
prodid INTEGER,
subprodid INTEGER,
PRIMARY KEY(prodid, subprodid),
CONSTRAINT fk_assembly_product FOREIGN KEY(prodid)
REFERENCES product(prodid),
CONSTRAINT fk_assembly_subproduct FOREIGN KEY(subprodid)
REFERENCES product(prodid));
Skill builder
An army is broken up into many administrative units (e.g., army, brigade, platoon).
A unit can contain many other units (e.g., a regiment contains two or more
battalions), and a unit can be part of a larger unit (e.g., a squad is a member of a
platoon). Draw a data model for this situation.
Querying a recursive many-to-many relationship
• List the product identifier of each component of the animal photography kit.
SELECT subprodid FROM product, assembly
WHERE proddesc = 'Animal photography kit'
AND product.prodid = assembly.prodid;
subprodid
101
106
107
105
104
103
102
108
Why are the values for subprodid listed in no apparent order? Remember, there is no
implied ordering of rows in a table, and it is quite possible, as this example
illustrates, for the rows to have what appears to be an unusual ordering. If you want
to order rows, use the ORDER BY clause.
• List the product description and cost of each component of the animal photography
kit.
SELECT proddesc, prodcost FROM product
WHERE prodid IN
(SELECT subprodid FROM product, assembly
WHERE proddesc = 'Animal photography kit'
AND product.prodid = assembly.prodid);
In this case, first determine the prodid of those products in the animal photography kit
(the inner query), and then report the description of these products. Alternatively, a
three-way join can be done using two copies of product.
SELECT b.proddesc, b.prodcost FROM product a, assembly, product b
WHERE a.proddesc = 'Animal photography kit'
AND a.prodid = assembly.prodid
AND assembly.subprodid = b.prodid;
proddesc prodcost
Camera 150.00
Camera case 10.00
70–210 zoom lens 125.00
28–85 zoom lens 115.00
Photographer’s vest 25.00
Lens cleaning cloth 1.00
Tripod 35.00
16 GB SDHC memory card 0.85
Skill builder
How many lens cleaning cloths are there in the animal photography kit?
Summary
Relationships can be one-to-one and recursive. A recursive relationship is within a
single entity rather than between entities. Recursive relationships are mapped to the
relational model in the same way as other relationships. A self-referential foreign
key constraint permits a foreign key reference to a key within the same table.
Resolution of queries involving recursive relationships often requires a table to be
joined with itself. Recursive many-to-many relationships occur in business in the
form of a bill of materials.
Key terms and concepts
Exercises
1. Draw data models for the following two problems:
a. (i) A dairy farmer, who is also a part-time cartoonist, has several herds of
cows. He has assigned each cow to a particular herd. In each herd, the farmer
has one cow that is his favorite—often that cow is featured in a cartoon.
(ii) A few malcontents in each herd, mainly those who feel they should have
appeared in the cartoon, disagree with the farmer ’s choice of a favorite cow,
whom they disparagingly refer to as the sacred cow. As a result, each herd now
has elected a herd leader.
b. The originator of a pyramid marketing scheme has a system for selling
ethnic jewelry. The pyramid has three levels—gold, silver, and bronze. New
associates join the pyramid at the bronze level. They contribute 30 percent of
the revenue of their sales of jewelry to the silver chief in charge of their clan.
In turn, silver chiefs contribute 30 percent of what they receive from bronze
associates to the gold master in command of their tribe. Finally, gold masters
pass on 30 percent of what they receive to the originator of the scheme.
c. The legion, the basic combat unit of the ancient Roman army, contained
3,000 to 6,000 men, consisting primarily of heavy infantry (hoplites),
supported by light infantry (velites), and sometimes by cavalry. The hoplites
were drawn up in three lines. The hastati (youngest men) were in the first, the
principes (seasoned troops) in the second, and the triarii (oldest men) behind
them, reinforced by velites. Each line was divided into 10 maniples, consisting
of two centuries (60 to 80 men per century) each. Each legion had a
commander, and a century was commanded by a centurion. Julius Caesar,
through one of his Californian channelers, has asked you to design a database
to maintain details of soldiers. Of course, Julius is a little forgetful at times,
and he has not supplied the titles of the officers who command maniples, lines,
and hoplites, but he expects that you can handle this lack of fine detail.
d. A travel agency is frequently asked questions about tourist destinations. For
example, customers want to know details of the climate for a particular month,
the population of the city, and other geographic facts. Sometimes they request
the flying time and distance between two cities. The manager has asked you to
create a database to maintain these facts.
e. The Center for the Study of World Trade keeps track of trade treaties
between nations. For each treaty, it records details of the countries signing the
treaty and where and when it was signed.
f. Design a database to store details about U.S. presidents and their terms in
office. Also, record details of their date and place of birth, gender, and political
party affiliation (e.g., Caluthumpian Progress Party). You are required to
record the sequence of presidents so that the predecessor and successor of any
president can be identified. How will you model the case of Grover Cleveland,
who served nonconsecutive terms as president? Is it feasible that political party
affiliation may change? If so, how will you handle it?
g. The IS department of a large organization makes extensive use of software
modules. New applications are built, where possible, from existing modules.
Software modules can also contain other modules. The IS manager realizes that
she now needs a database to keep track of which modules are used in which
applications or other modules. (Hint: It is helpful to think of an application as a
module.)
h. Data modeling is finally getting to you. Last night you dreamed you were
asked by Noah to design a database to store data about the animals on the ark.
All you can remember from Sunday school is the bit about the animals entering
the ark two-by-two, so you thought you should check the real thing.
Take with you seven pairs of every kind of clean animal, a male and its mate,
and two of every kind of unclean animal, a male and its mate, and also seven
pair of every kind of bird, male and female. Genesis 7:2
Next time Noah disturbs your sleep, you want to be ready. So, draw a data
model and make certain you record the two-by-two relationship.
2. Write SQL to answer the following queries using the DEPT and EMP tables
described in this chapter:
a. Find the departments where all the employees earn less than their boss.
b. Find the names of employees who are in the same department as their boss
(as an employee).
c. List the departments having an average salary greater than $25,000.
d. List the departments where the average salary of the employees, excluding
the boss, is greater than $25,000.
e. List the names and manager of the employees of the Marketing department
who have a salary greater than $25,000.
f. List the names of the employees who earn more than any employee in the
Marketing department.
3. Write SQL to answer the following queries using the monarch table described in
this chapter:
a. Who succeeded Victoria I?
b. How many days did Victoria I reign?
c. How many kings are there in the table?
d. Which monarch had the shortest reign?
4. Write SQL to answer the following queries using the product and assembly tables:
a. How many different items are there in the animal photography kit?
b. What is the most expensive item in the animal photography kit?
c. What is the total cost of the components of the animal photography kit?
d. Compute the total quantity for each of the items required to assemble 15
animal photography kits.
CD Library case
Ajay had some more time to work on this CD Library database. He picked up The
Manhattan Transfer ’s Swing CD to enter its data. Very quickly he recognized the
need for yet another entity, if he were going to store details of the people in a group.
A group can contain many people (there are four artists in The Manhattan Transfer),
and also it is feasible that over time a person could be a member of multiple groups.
So, there is an m:m between PERSON and GROUP. Building on his understanding
of the relationship between PERSON and RECORDING and CD, Ajay realized there
are two other required relationships: an m:m between GROUP and RECORDING,
and between GROUP and CD. This led to version 5.0, shown in the following figure.
CD library V5.0
Exercises
1. The members of The Manhattan Transfer are Cheryl Bentyne, Janis Siegel, Tim
Hauser, and Alan Paul. Update the database with this information, and write SQL
to answer the following:
a. Who are the members of The Manhattan Transfer?
b. What CDs have been released by The Manhattan Transfer?
c. What CDs feature Cheryl Bentyne as an individual or member of a group?
2. The group known as Asleep at the Wheel is featured on tracks 3, 4, and 7 of
Swing. Record these data and write SQL to answer the following:
a. What are the titles of the recordings on which Asleep at the Wheel appear?
b. List all CDs that have any tracks featuring Asleep at the Wheel.
3. Record the following facts. The music of “Sing a Song of Brown,” composed in
1937, is by Count Basie and Larry Clinton and lyrics by Jon Hendricks. “Sing
Moten’s Swing,” composed in 1932, features music by Buster and Benny Moten,
and lyrics by Jon Hendricks. Write SQL to answer the following:
a. For what songs has Jon Hendricks written the lyrics?
b. Report all compositions and their composers where more than one person
was involved in composing the music. Make certain you report them in the
correct order.
4. Test your SQL skills with the following:
a. List the tracks on which Alan Paul appears as an individual or as a member
of The Manhattan Transfer.
b. List all CDs featuring a group, and report the names of the group members.
c. Report all tracks that feature more than one group.
d. List the composers appearing on each CD.
5. How might you extend the data model?
7. Data Modeling
Man is a knot, a web, a mesh into which relationships are tied. Only those
relationships matter.
Antoine de Saint-Exupéry in Flight to Arras
Learning objectives
Students completing this chapter will be able to create a well-formed, high-fidelity
data model.
Modeling
Modeling is widely used within business to learn about organizational problems and
design solutions. To understand where data modeling fits within the broader context,
it is useful to review the full range of modeling activities.
A broad perspective on modeling
Scope Model Technology
Motivation Why Goals Business plan Groupware
People Who Business units Organization chart System interface
Time When Key events PERT chart Scheduling
Data What Key entities Data model Relational database
Function How Key processes Process model Application software
Network Where Locations Logistics network System architecture
Modeling occurs at multiple levels. At the highest level, an organization needs to
determine the scope of its business by identifying the major elements of its
environment, such as its goals, business units, where it operates, and critical events,
entities, and processes. At the top level, textual models are frequently used. For
example, an organization might list its major goals and business processes. A map
will be typically used to display where the business operates.
Once senior managers have clarified the scope of a business, models can be
constructed for each of the major elements. Goals will be converted into a business
plan, business units will be shown on an organizational chart, and so on. The key
elements, from an IS perspective, are data and process modeling, which are
typically covered in the core courses of an IS program.
Technology, the final stage, converts models into operational systems to support the
organization. Many organizations rely extensively on e-mail and Web technology
for communicating business decisions. People connect to systems through an
interface, such as a Web browser. Ensuring that events occur at the right time is
managed by scheduling software. This could be implemented with operating system
procedures that schedule the execution of applications. In some database
management systems (DBMSs), triggers can be established. A trigger is a database
procedure that is automatically executed when some event is recognized. For
example, U.S. banks are required to report all deposits exceeding USD 10,000, and a
trigger could be coded for this event.
Data models are typically converted into relational databases, and process models
become computer programs. Thus, you can see that data modeling is one element of
a comprehensive modeling activity that is often required to design business systems.
When a business undergoes a major change, such as a reengineering project, many
dimensions can be altered, and it may be appropriate to rethink many elements of
the business, starting with its goals. Because such major change is very disruptive
and costly, it occurs less frequently. It is more likely that data modeling is conducted
as a stand-alone activity or part of process modeling to create a new business
application.
Data modeling
You were introduced to the basic building blocks of data modeling in the earlier
chapters of this section. Now it is time to learn how to assemble blocks to build a
data model. Data modeling is a method for determining what data and relationships
should be stored in the database. It is also a way of communicating a database
design.
The goal of data modeling is to identify the facts that must be stored in a database. A
data model is not concerned with how the data will be stored. This is the concern of
those who implement the database. A data model is not concerned with how the data
will be processed. This is the province of process modeling. The goal is to create a
data model that is an accurate representation of data needs and real-world data
relationships.
Building a data model is a partnership between a client, a representative of the
eventual owners of the database, and a designer. Of course, there can be a team of
clients and designers. For simplicity, we assume there is one client and one designer.
Drawing a data model is an iterative process of trial and revision. A data model is a
working document that will change as you learn more about the client’s needs and
world. Your early versions are likely to be quite different from the final product.
Use a product such as MySQL Workbench for drawing and revising a data model.
The building blocks
The purpose of a database is to store data about things, which can include facts (e.g.,
an exchange rate), plans (e.g., scheduled production of two-person tents for June),
estimates (e.g., forecast of demand for geopositioning systems), and a variety of
other data. A data model describes these things and their relationships with other
things using four components: entity, attribute, relationship, and identifier.
Entity
The entity is the basic building block of a data model. an entity is a thing about
which data should be stored, something we need to describe. Each entity in a data
model has a unique name that we write in singular form. Why singular? Because we
want to emphasize that an entity describes an instance of a thing. Thus, we
previously used the word SHARE to define an entity because it describes each
instance of a share rather than shares in general.
We have already introduced the convention that an entity is represented by a
rectangle, and the name of the entity is shown in uppercase letters, as follows.
The entity SHARE
A large data model might contain over a 1,000 entities. A database can easily contain
hundreds of millions of instances (rows) for any one entity (table). Imagine the
number of instances in a national tax department’s database.
How do you begin to identify entities? One approach is to underline any nouns in
the system proposal or provided documentation. Most nouns are possible entities,
and underlining ensures that you do not overlook any potential entities. Start by
selecting an entity that seems central to the problem. If you were designing a student
database, you might start with the student. Once you have picked a central entity,
describe it. Then move to the others.
Attribute
An attribute describes an entity. When an entity has been identified, the next step is to
determine its attributes, that is, the data that should be kept to describe fully the
entity. An attribute name is singular and unique within the data model. You may need
to use a modifier to make an attribute name unique (e.g., “hire date” and “sale date”
rather than just “date”).
Our convention is that the name of an attribute is recorded in lowercase letters
within the entity rectangle. The following figure illustrates that SHARE has
attributes share code, share name, share price, share quantity, share dividend, and
share PE. Notice the frequent use of the modifier “share”. It is possible that other
entities in the database might also have a price and quantity, so we use a modifier to
create unique attribute names. Note also that share code is the identifier.
The entity SHARE with its attributes
Defining attributes generally takes considerable discussion with the client. You
should include any attribute that is likely to be required for present or future
decision making, but don’t get carried away. Avoid storing unnecessary data. For
example, to describe the entity STUDENT you usually record date of birth, but it is
unlikely that you would store height. If you were describing an entity PATIENT, on
the other hand, it might be necessary to record height, because that is sometimes
relevant to medical decision making.
An attribute has a single value, which may be null. Multiple values are not allowed.
If you need to store multiple values, it is a signal that you have a one-to-many (1:m)
relationship and need to define another entity.
Relationship
Entities are related to other entities. If there were no relationships between entities,
there would be no need for a data model and no need for a relational database. A
simple, flat file (a single-entity database) would be sufficient. A relationship is
binary. It describes a linkage between two entities and is represented by a line
between them.
Because a relationship is binary, strictly speaking it has two relationship
descriptors, one for each entity. Each relationship descriptor has a degree stating
how many instances of the other entity may be related to each instance of the
described entity. Consider the entities STOCK and NATION and their 1:m
relationship (see the following figure). We have two relationship descriptors: stocks
of nation for NATION and nation of stock for STOCK. The relationship descriptor
stocks of nation has a degree of m because a nation may have zero or more listed
stocks. The relationship descriptor nation of stock has a degree of 1 because a stock
is listed in at most one nation. To reduce data model clutter, where necessary for
clarification, we will use a single label for the pair of relationship descriptors.
Experience shows this approach captures the meaning of the relationship and
improves readability.
A 1:m relationship between STOCK and NATION
Because the meaning of relationships frequently can be inferred, there is no need to
label every one; however, each additional relationship between two entities should
be labeled to clarify meaning. Also, it is a good idea to label one-to-one (1:1)
relationships, because the meaning of the relationship is not always obvious.
Consider the fragment in the following figure. The descriptors of the 1:m
relationship can be inferred as a firm has employees and an employee belongs to a
firm. the 1:1 relationship is clarified by labeling the 1:1 relationship as firm’s boss.
Relationship labeling
Identifier
An identifier uniquely distinguishes an instance of an entity. An identifier can be one
or more attributes and may include the identifier of a related entity. When a related
entity’s identifier is part of an identifier, a plus sign is placed on the line closest to
the entity being identified. In the following figure, an instance of the entity
LINEITEM is identified by the composite of lineno and saleno, the identifier of
SALE. The value lineno does not uniquely identify an instance of LINEITEM,
because it is simply a number that appears on the sales form. Thus, the identifier
must include saleno (the identifier of SALE). So, any instance of LINEITEM is
uniquely identified by the composite saleno and lineno.
A related entity’s identifier as part of the identifier
Occasionally, there will be several possible identifiers, and these are each given a
different symbol. Our convention is to prefix an identifier with an asterisk (*). If
there are multiple identifiers, use other symbols such as #, !, or &. Be careful: If you
have too many possible identifiers, your model will look like comic book swearing.
In most cases, you will have only one identifier.
No part of an identifier can be null. If this were permitted, there would be no
guarantee that the other parts of the identifier were sufficiently unique to distinguish
all instances in the database.
Data model quality
There are two criteria for judging the quality of a data model. It must be well-
formed and have high fidelity.
A well-formed data model
A well-formed data model clearly communicates information to the client. Being
well-formed means the construction rules have been obeyed. There is no ambiguity;
all entities are named, and all entities have identifiers. If identifiers are missing, the
client may make incorrect inferences about the data model. All relationships are
recorded using the proper notation and labeled whenever there is a possibility of
confusion.
Characteristics of a well-formed data model
All construction rules are obeyed.
There is no ambiguity.
All entities are named.
Every entity has an identifier.
All relationships are represented, using the correct notation.
Relationships are labeled to avoid misunderstanding.
All attributes of each entity are listed.
All attribute names are meaningful and unique.
All the attributes of an entity are listed because missing attributes create two types of
problems. First, it is unclear what data will be stored about each instance. Second,
the data model may be missing some relationships. Attributes that can have multiple
values become entities. It is only by listing all of an entity’s attributes during data
modeling that these additional entities are recognized.
In a well-formed data model, all attribute names are meaningful and unique. The
names of entities, identifiers, attributes, and relationships must be meaningful to the
client because they are meant to describe the client’s world. Indeed, in nearly all
cases they are the client’s everyday names. Take care in selecting words because
they are critical to communicating meaning. The acid test for comprehension is to
get the client to read the data model to other potential users. Names need to be
unique to avoid confusion.
A high-fidelity image
Music lovers aspire to own a high-fidelity stereo system—one that faithfully
reproduces the original performance with minimal or no distortion. A data model is
a high-fidelity image when it faithfully describes the world it is supposed to
represent. All relationships are recorded and are of the correct degree. There are no
compromises or distortions. If the real-world relationship is many-to-many (m:m),
then so is the relationship shown in the data model. a well-formed, high-fidelity data
model is complete, understandable, accurate, and syntactically correct.
Quality improvement
A data model is an evolving representation. Each change should be an incremental
improvement in quality. Occasionally, you will find a major quality problem at the
data model’s core and have to change the model dramatically.
Detail and context
The quality of a data model can be determined only by understanding the context in
which it will be used. Consider the data model (see the following figure) used in
Chapter 4 to discuss the 1:m relationship. This fragment says a nation has many
stocks. Is that really what we want to represent? Stocks are listed on a stock
exchange, and a nation may have several stock exchanges. For example, the U.S. has
the New York Stock Exchange and NASDAQ. Furthermore, some foreign stocks are
listed on the New York Stock Exchange. If this is the world we have to describe, the
data model is likely to differ from that shown in the figure. Try drawing the revised
data model.
A 1:m relationship between STOCK and NATION
As you can see, the data model in the following figure is quite different from the
initial data model of preceding figure. Which one is better? They both can be valid
models; it just depends on the world you are trying to represent and possible future
queries. In the first case, the purpose was to determine the value of a portfolio.
There was no interest in on which exchange stocks were listed or their home
exchange. So the first model has high fidelity for the described situation.
The second data model would be appropriate if the client needed to know the home
exchange of stocks and their price on the various exchanges where they were listed.
The second data model is an improvement on the first if it incorporates the
additional facts and relationships required. If it does not, then the additional detail is
not worth the extra cost.
Revised NATION-STOCK data model
Before looking more closely at the data model, let’s clarify the meaning of unittype.
Because most countries are divided into administrative units that are variously
called states, provinces, territories, and so forth, unittype indicates the type of
administrative unit (e.g., state). How many errors can you find in the initial data
model?
Problems and solutions for the initial world’s cities data model
Problem
1 City names are not unique. There is an Athens in Greece and the U.S. has an Athens in Georg
2 Administrative unit names are not necessarily unique. There used to be a Georgia in the old U
3 There is an unlabeled 1:1 relationship between CITY and ADMIN UNIT and CITY and NATION.
4 The assumption is that a nation has only one capital, but there are exceptions. South Africa ha
5 The assumption is that an administrative unit has only one capital, but there are exceptions. C
6 Some values can be derived. National population, natpop, and area, natarea, are the sum of th
A revised world’s cities data model
This geography lesson demonstrates how you often start with a simple model of
low fidelity. By additional thinking about relationships and consideration of
exceptions, you gradually create a high-fidelity data model.
Skill Builder
Write SQL to determine which administrative units have two capitals and which have
a shared capital.
Family matters
Families can be very complicated. It can be tricky to keep track of all those relations
—maybe you need a relational database (this is the worst joke in the book; they
improve after this). We start with a very limited view of marriage and gradually
ease the restrictions to demonstrate how any and all aspects of a relationship can be
modeled. An initial fragment of the data model is shown in the following figure.
There are several things to notice about it.
A man-woman data model
1. There is a 1:1 relationship between man and woman. The labels indicate that this
relationship is marriage, but there is no indication of the marriage date. We are
left to infer that the data model records the current marriage.
2. MAN and WOMAN have the same attributes. A prefix is used to make them
unique. Both have an identifier (id, in the United States this would probably be a
Social Security number), date of birth (dob), first name ( fname), other names
(oname), and last name (name).
3. Other names (oname) looks like a multivalued attribute, which is not allowed.
Should oname be single or multivalue? This is a tricky decision. It depends on
how these data will be used. If queries such as “find all men whose other names
include Herbert” are likely, then a person’s other names—kept as a separate entity
that has a 1:m relationship for MAN and WOMAN—must be established. That is,
a man or woman can have one or more other names. However, if you just want to
store the data for completeness and retrieve it in its entirety, then oname is fine. It
is just a text string. The key question to ask is, “What is the lowest level of detail
possibly required for future queries?” Also, remember that SQL’s REGEXP
clause can be used to search for values within a text string.
The use of a prefix to distinguish attribute names in different entities suggests you
should consider combining the entities. In this case, we could combine the entities
MAN and WOMAN to create an entity called PERSON. Usually when entities are
combined, you have to create a new attribute to distinguish between the different
types. In this example, the attribute gender is added. We can also generalize the
relationship label to spouse. The revised data model appears in the following figure.
A PERSON data model
Marriage is an m:m relationship between two persons. It has attributes begindate and
enddate. An instance of marriage is uniquely identified by a composite identifier:
the two spouse identifiers and begindate. This means any marriage is uniquely
identified by the composite of two person identifiers and the beginning date of the
marriage. We need begindate as part of the identifier because the same couple might
have more than one marriage (e.g., get divorced and remarry each other later).
Furthermore, we can safely assume that it is impossible for a couple to get married,
divorced, and remarried all on the one day. Begindate and enddate can be used to
determine the current state of a marriage. If enddate is null, the marriage is current;
otherwise, the couple has divorced.
This data model assumes a couple goes through some formal process to get
married or divorced, and there is an official date for both of these events. What
happens if they just gradually drift into cohabitation, and there is no official
beginning date? Think about it. (The data model problem, that is—not cohabitation!)
Many countries recognize this situation as a common-law marriage, so the data
model needs to recognize it. The present data model cannot handle this situation
because begindate cannot be null—it is an identifier. Instead, a new identifier is
needed, and begindate should become an attribute.
Two new attributes can handle a common-law marriage. Marriageno can count the
number of times a couple has been married to each other. In the majority of cases,
marriageno will be 1. Marriagestatus can record whether a marriage is current or
ended. Now we have a data model that can also handle common-law marriages. This
is also a high-quality data model in that the client does not have to remember to
examine enddate to determine a marriage’s current status. It is easier to remember
to examine marriagestatus to check status. Also, we can allow a couple to be
married, divorced, and remarried as many times as they like on the one day—which
means we can now use the database in Las Vegas.
A revised marriage data model
All right, now that we have the couple successfully married, we need to start
thinking about children. A marriage has zero or more children, and let’s start with
the assumption a child belongs to only one marriage. Therefore, we have a 1:m
relationship between marriage and person to represent the children of a marriage.
A marriage with children data model
You might want to consider how the model would change to handle single-parent
families, adopted children, and other aspects of human relationships.
Skill builder
The International Commission for Border Resolution Disputes requires a database
to record details of which countries have common borders. Design the database.
Incidentally, which country borders the most other countries?
When’s a book not a book?
Sometimes we have to rethink our ideas of physical objects. Consider a data model
for a library.
A library data model fragment
The preceding fragment assumes that a person borrows a book. What happens if the
library has two copies of the book? Do we add an attribute to BOOK called copy
number? No, because we would introduce redundancy by repeating the same
information for each book. What you need to recognize is that in the realm of data
modeling, a book is not really a physical thing, but the copy is, because it is what
you borrow.
A revised library data model fragment
As you can see, a book has many copies, and a copy is of one book. A book can b
The International Standard Book Number (ISBN) uniquely identifies any instance of
a book (not copy). Although it sounds like a potential identifier for BOOK, it is not.
ISBNs were introduced in the second half of the twentieth century, and books
published before then do not have an ISBN.
A history lesson
Many organizations maintain historical data (e.g., a person’s job history or a
student’s enrollment record). A data model can depict historical data relationships
just as readily as current data relationships.
Consider the case of an employee who works for a firm that consists of divisions
(e.g., production) and departments (e.g., quality control). The firm contains many
departments, but a department belongs to only one division. At any one time, an
employee belongs to only one department.
Employment history—take 1
The fragment in the figure can be amended to keep track of the divisions and
departments in which a person works. While employed with a firm, a person can
work in more than one department. Since a department can have many employees,
we have an m:m relationship between DEPARTMENT and EMPLOYEE. We might
call the resulting associative entity POSITION.
Employment history—take 2
The revised fragment records employee work history. Note that any instance of P
People who work get paid. How do we keep track of an employee’s pay data? An
employee has many pay slips, but a pay slip belongs to one employee. When you
look at a pay slip, you will find it contains many items: gross pay and a series of
deductions for tax, medical insurance, and so on. Think of a pay slip as containing
many lines, analogous to a sales form. Now look at the revised data model.
Employment history—take 3
Typically, an amount shown on a pay slip is identified by a short text field (e.g., g
Employment history—take 4
A ménage à trois for entities
Consider aircraft leasing. A plane is leased to an airline for a specific period. When
a lease expires, the plane can be leased to another airline. So, an aircraft can be
leased many times, and an airline can lease many aircraft. Furthermore, there is an
agent responsible for handling each deal. An agent can lease many aircraft and deal
with many airlines. Over time, an airline will deal with many agents. When an agent
reaches a deal with an airline to lease a particular aircraft, you have a transaction. If
you analyze the aircraft leasing business, you discover there are three m:m
relationships. You might think of these as three separate relationships.
An aircraft-airline-agent data model
The problem with three separate m:m relationships is that it is unclear where to
store data about the lease. Is it stored in AIRLINE-AIRCRAFT? If you store the
information there, what do you do about recording the agent who closed the deal?
After you read the fine print and do some more thinking, you discover that a lease is
the association of these three entities in an m:m relationship.
A revised aircraft-airline-agent data model
Skill builder
Horse racing is a popular sport in some parts of the world. A horse competes in at
most one race on a course at a particular date. Over time, a horse can compete in
many races on many courses. A horse’s rider is called a jockey, and a jockey can
ride many horses and a horse can have many jockeys. Of course, there is only ever
one jockey riding a horse at a particular time. Courses vary in their features, such as
the length of the course and the type of surface (e.g., dirt or grass). Design a
database to keep track of the results of horse races.
Project management—planning and doing
Project management involves both planned and actual data. A project is divided into
a number of activities that use resources. Planning includes estimation of the
resources to be consumed. When a project is being executed, managers keep track
of the resources used to monitor progress and keep the project on budget. Planning
data may not be as detailed as actual data and is typically fairly broad, such as an
estimate of the number of hours that an activity will take. Actual data will be more
detailed because they are usually collected by getting those assigned to the project to
log a daily account of how they spent their time. Also, resources used on a project
are allocated to a particular activity. For the purposes of this data model, we will
focus only on recording time and nothing else.
A project management data model
Notice that planned hours is an attribute of ACTIVITY, and actual hours is an
attribute of DAILY WORK. The hours spent on an activity are derived by summing
actual hours in the associated DAILY WORK entity. This is a high-fidelity data
model if planning is done at the activity level and employees submit daily
worksheets. Planning can be done in greater detail, however, if planners indicate
how many hours of each day each employee should spend on a project.
A revised project management data model
Now you see that planned hours and actual hours are both attributes of DAILY
WORK. The message of this example, therefore, is not to assume that planned and
actual data have the same level of detail, but do not be surprised if they do.
Cardinality
Some data modeling languages are quite precise in specifying the cardinality, or
multiplicity, of a relationship. Thus, in a 1:m relationship, you might see additional
information on the diagram. For example, the “many” end of the relationship might
contain notation (such as 0,n) to indicate zero or more instances of the entity at the
“many” end of the relationship.
Cardinality and modality options
Cardinality Modality Meaning
0,1 There can be zero or one instance of the entity relative to the other enti
Optional
0,n There can be zero or many instances of the entity relative to the other e
1,1 There is exactly one instance of the entity relative to the other entity.
Mandatory
1,n The entity must have at least one and can have many instances relative t
The data modeling method of this text, as you now realize, has taken a broad
approach to cardinality (is it 1:1 or 1:m?). We have not been concerned with greater
precision (is it 0,1 or 1,1?), because the focus has been on acquiring basic data
modeling skills. Once you know how to model, you can add more detail. When
learning to model, too much detail and too much terminology can get in the way of
mastering this difficult skill. Also, it is far easier to sketch models on a whiteboard
or paper when you have less to draw. Once you switch to a data modeling tool, then
adding cardinality precision is easier and appropriate.
Modality
Modality, also known as optionality, specifies whether an instance of an entity must
participate in a relationship. Cardinality indicates the range of instances of an entity
participating in a relationship, while modality defines the minimum number of
instances. Cardinality and modality are linked, as shown in the table above. If an
entity is optional, the minimum cardinality will be 0, and if mandatory, the
minimum cardinality is 1.
By asking the question, “Does an occurrence of this entity require an occurrence of
the other entity?” you can assess whether an entity is mandatory. The usual practice
is to use “O” for optional and place it near the entity that is optional. Mandatory
relationships are often indicated by a short line (or bar) at right angles to the
relationship between the entities.
In the following figure, it is mandatory for an instance of STOCK to have an
instance of NATION, and optional for an instance of NATION to have an instance of
STOCK, which means we can record details of a nation for which stocks have not
yet been purchased. During conversion of the data model to a relational database,
the mandatory requirement (i.e., each stock must have a nation) is enforced by the
foreign key constraint.
1:m relationship showing modality
Now consider an example of an m:m relationship. A SALE can have many
LINEITEMs, and a LINEITEM has only one SALE. An instance of LINEITEM must
be related to an instance of SALE, otherwise it can’t exist. Similarly, it is mandatory
for an instance of SALE to have an instance of LINEITEM, because it makes no
sense to have a sale without any items being sold.
In a relational database, the foreign key constraint handles the mandatory
relationship between instances of LINEITEM and SALE, The mandatory
requirement between SALE and LINEITEM has to be built into the processing logic
of the application that processes a sales transaction. Basic business logic requires
that a sale include some items, so a sales transaction creates one instance of SALE
and multiple instances of LINEITEM. It does not make sense to have a sale that does
not sell something.
An m:m relationship showing modality
We rely again on the British monarchy. This time we use it to illustrate modality
with a 1:1 recursive relationship. As the following figure shows, it is optional for a
monarch to have a successor. We are not talking about the end of a monarchy, but
rather how we address the data modeling issue of not knowing who succeeds the
current monarch. Similarly, there must be a monarch who did not succeed someone
(i.e., the first king or queen). Thus, both ends of the succession relationship have
optional modality.
1:1 recursive with modality
Finally, in our exploration of modality, we examine the m:m recursive. The model
indicates that it is optional for a product to have components and optional for a
product to be a component in other products. However, every assembly must have
associated products.
m:m recursive with modality
Summary
Modality adds additional information to a data model, and once you have grasped
the fundamental ideas of entities and relationships, it is quite straightforward to
consider whether a relationship is optional or mandatory. When a relationship is
mandatory, you need to consider what constraint you can add to a table’s definition
to enforce the relationship. As we have seen, in some cases the foreign key
constraint handles mandatory relationships. In other cases, the logic for enforcing a
requirement will be built into the application.
Entity types
A data model contains different kinds of entities, which are distinguished by the
format of their identifiers. Labeling each of these entities by type will help you to
determine what questions to ask.
Independent entity
An independent entity is often central to a data model and foremost in the client’s
mind. Independent entities are frequently the starting points of a data model.
Remember that in the investment database, the independent entities are STOCK and
NATION. Independent entities typically have clearly distinguishable names because
they occur so frequently in the client’s world. In addition, they usually have a single
identifier, such as stock code or nation code.
Independent entities are often connected to other independent entities in a 1:m or
m:m relationship. The data model in the following figure shows two independent
entities linked in a 1:m relationship.
NATION and STOCK—independent entities
Associative entity
Associative entities are by-products of m:m relationships. They are typically found
between independent entities. Associative entities sometimes have obvious names
because they occur in the real world. For instance, the associative entity for the m:m
relationship between DEPARTMENT and EMPLOYEE is usually called POSITION.
If the associative entity does not have a common name, the two entity names are
generally hyphenated (e.g., DEPARTMENT-EMPLOYEE). Always search for the
appropriate name, because it will improve the quality of the data model. Hyphenated
names are a last resort.
POSITION is an associative entity
Associative entities can show either the current or the historic relationship between
two entities. If an associative entity’s only identifiers are the two related entities’
identifiers, then it records the current relationship between the entities. If the
associative entity has some time measure (e.g., date or hour) as a partial identifier,
then it records the history of the relationship. Whenever you find an associative
entity, ask the client whether the history of the relationship should be recorded.
Creating a single, arbitrary identifier for an associative entity will change it to an
independent one. This is likely to happen if the associative entity becomes central to
an application. For example, if a personnel department does a lot of work with
POSITION, it may find it more expedient to give it a separate identifier (e.g.,
position number).
Aggregate entity
An aggregate entity is created when several different entities have similar attributes
that are distinguished by a preceding or following modifier to keep their names
unique. For example, because components of an address might occur in several
entities (e.g., CUSTOMER and SUPPLIER), an aggregate address entity can be
created to store details of all addresses. Aggregate entities usually become
independent entities. In this case, we could use address number to identify an
instance of address uniquely.
Subordinate entity
A subordinate entity stores data about an entity that can vary among instances. A
subordinate entity is useful when an entity consists of mutually exclusive classes that
have different descriptions. The farm animal database shown in the following figure
indicates that the farmer requires different data for sheep and horses. Notice that
each of the subordinate entities is identified by the related entity’s identifier. You
could avoid subordinate entities by placing all the attributes in ANIMAL, but then
you make it incumbent on the client to remember which attributes apply to horses
and which to sheep. This becomes an important issue for null fields. For example, is
the attribute hay consumption null because it does not apply to sheep, or is it null
because the value is unknown?
SHEEP and HORSE are subordinate entities
If a subordinate entity becomes important, it is likely to evolve into an independent
entity. The framing of a problem often determines whether subordinate entities are
created. Stating the problem as, “a farmer has many animals, and these animals can
be horses or sheep,” might lead to the creation of subordinate entities. Alternatively,
saying that “a farmer has many sheep and horses” might lead to setting up SHEEP
and HORSE as independent entities.
Generalization and aggregation
Generalization and aggregation are common ideas in many modeling languages,
particularly object-oriented modeling languages such as the Unified Modeling
Language (UML), which we will use in this section.
Generalization
A generalization is a relationship between a more general element and a more
specific element. In the following figure, the general element is animal, and the
specific elements are sheep and horse. A horse is a subtype of animal. A
generalization is often called an “is a” relationship because the subtype element is a
member of the generalization.
A generalization is directly mapped to the relational model with one table for each
entity. For each of the subtype entities (i.e., SHEEP and HORSE), the primary key is
that of the supertype entity (i.e., ANIMAL). You must also make this column a
foreign key so that a subtype cannot be inserted without the presence of the
matching supertype. A generalization is represented by a series of 1:1 relationships
to weak entities.
A generalization hierarchy in UML
Aggregation
Aggregation is a part-whole relationship between two entities. Common phrases
used to describe an aggregation are “consists of,” “contains,” and “is part of.” The
following figure shows an aggregation where a herd consists of many cows. Cows
can be removed or added, but it is still a herd. An open diamond next to the whole
denotes an aggregation. In data modeling terms, there is a 1:m relationship between
herd and cow.
An aggregation in UML and data modeling
There are two special types of aggregation: shared and composition. With shared
aggregation, one entity owns another entity, but other entities can own that entity as
well. A sample shared aggregation is displayed in the following figure, which
shows that a class has many students, and a student can be a member of many
classes. Students, the parts in this case, can be part of many classes. An open
diamond next to the whole denotes a shared aggregation, which we represent in data
modeling as an m:m relationship.
Shared aggregation in UML and data modeling
This fragment is typical of the models produced by novice data modelers. It can be
reduced by recognizing that PAINTING, SCULPTURE, and CERAMIC are all types
of art. A more general entity, ART, could be used to represent the same data. Of
course, it would need an additional attribute to distinguish the different types of art
recordings.
The shrunken data model
As you discover new entities, your data model will grow. As you generalize, your
data model will shrink.
Identifier
If there is no obvious simple identifier, invent one. The simplest is a meaningless
code (e.g., person number or order number). If you create an identifier, you can also
guarantee its uniqueness.
Don’t overwork an identifier. The worst case we have seen is a 22-character product
code used by a plastics company that supposedly not only uniquely identified an
item but told you its color, its manufacturing process, and type of plastic used to
make it! Color, manufacturing process, and type of plastic are all attributes. This
product code was unwieldy. Data entry error rates were extremely high, and very
few people could remember how to decipher the code.
An identifier has to do only one thing: uniquely identify every instance of the entity.
Sometimes trying to make it do double, or even triple, duty only creates more work
for the client.
Position and order
There is no ordering in a data model. Entities can appear anywhere. You will usually
find a central entity near the middle of the data model only because you tend to start
with prominent entities (e.g., starting with STUDENT when modeling a student
information system). What really matters is that you identify all relevant entities.
Attributes are in no order. You find that you will tend to list them as they are
identified. For readability, it is a good idea to list some attributes sequentially. For
example, first name, other names, and last name are usually together. This is not
necessary, but it speeds up verification of a data model’s completeness.
Instances are also assumed to have no ordering. There is no first instance, next
instance, or last instance. If you must recognize a particular order, create an attribute
to record the order. For example, if ranking of potential investment projects must be
recorded, include an attribute (projrank) to store this data. This does not mean the
instances will be stored in projrank order, but it does allow you to use the ORDER
BY clause of SQL to report the projects in rank order.
If you need to store details of a precedence relationship (i.e., there is an ordering of
instances and a need to know the successor and predecessor of any instance), then
use a 1:1 recursive relationship.
Using an attribute to record an ordering of instances
Attributes and consistency
Attributes must be consistent, retaining the same meaning for every instance. An
example of an inconsistent attribute would be an attribute stock info that contains a
stock’s PE ratio or its ROI (return on investment). Another attribute, say stock info
code, is required to decipher which meaning applies to any particular instance. The
code could be “1” for PE and “2” for ROI. If one attribute determines the meaning
of another, you have inconsistency. Writing SQL queries will be extremely
challenging because inconsistent data increases query complexity. It is better to
create separate attributes for PE and ROI.
Names and addresses
An attribute is the smallest piece of data that will conceivably form part of a query.
If an attribute has segments (e.g., a person’s name), determine whether these could
form part of a query. If so, then make them separate attributes and reapply the query
test. When you apply the query test to person name, you will usually decide to create
three attributes: first name, other names, and last name.
What about titles? There are two sorts of modifier titles: preceding titles (e.g., Mr.,
Mrs., Ms., and Dr.) and following titles (e.g., Jr. and III). These should be separate
attributes, especially if you have divided name into separate attributes.
Addresses seem to cause more concern than names because they are more variant.
Business addresses tend to be the most complicated, and foreign addresses add a few
more twists. The data model fragment in the following figure works in most cases.
Handling addresses
Common features of every address are city, state (province in Canada, county in
Eire), postal code (ZIP code in the United States), and nation. Some of these can be
null. For example, Singapore does not have any units corresponding to states. In the
United States, city and state can be derived from the zip code, but this is not
necessarily true for other countries. Furthermore, even if this were true for every
nation, you would need a different derivation rule for each country, which is a
complexity you might want to avoid. If there is a lifetime guarantee that every
address in the database will be for one country, then examine the possibility of
reducing redundancy by just storing post code.
Notice that the problem of multiple address lines is represented by a 1:m
relationship. An address line is a text string that appears as one line of an address.
There is often a set sequence in which these are displayed. The attribute address rank
records this order.
How do you address students? First names may be all right for class, but it is not
adequate enough for student records. Students typically have multiple addresses: a
home address, a school address, and maybe a summer address. The following data
model fragment shows how to handle this situation. Any instance of ADDRESS
LINE is identified by the composite of studentid, addresstype, and address rank.
Students with multiple addresses
The identifier address type is used to distinguish among the different types of
addresses. The same data model fragment works in situations where a business has
different mailing and shipping addresses.
When data modeling, can you take a shortcut with names and addresses? It is time
consuming to write out all the components of name and address. It creates clutter
and does not add much fidelity. Our practical advice is to do two things. First, create
a policy for all names and addresses (e.g., all names will be stored as three parts:
first name, other names, last name). Second, use shorthand forms of attributes for
names and addresses when they obey the policy. So from now on, we will
sometimes use name and address as attributes with the understanding that when the
database is created, the parts will become separate columns.
Single-instance entities
Do not be afraid of creating an entity with a single instance. Consider the data model
fragment in the following figure which describes a single fast-food chain. Because
the data model describes only one firm, the entity FIRM will have only one instance.
The inclusion of this single instance entity permits facts about the firm to be
maintained. Furthermore, it provides flexibility for expansion. If two fast-food
chains combine, the data model needs no amendment.
FIRM—a single-instance entity
Picking words
Words are very important. They are all we have to make ourselves understood. Let
the client choose the words, because the data model belongs to the client and
describes the client’s world. If you try to impose your words on the client, you are
likely to create a misunderstanding.
Synonyms
Synonyms are words that have the same meaning. For example, task, assignment,
and project might all refer to the same real-world object. One of these terms, maybe
the one in most common use, needs to be selected for naming the entity. Clients need
to be told the official name for the entity and encouraged to adopt this term for
describing it. Alternatively, create views that enable clients to use their own term.
Synonyms are not a technical problem. It is a social problem of getting clients to
agree on one word.
Homonyms
Homonyms are words that sound the same but have different meanings. Homonyms
can create real confusion. Sale date is a classic example. In business it can have
several meanings. To the salespeople, it might be the day that the customer shakes
hands and says, “We have a deal.” For the lawyers, it is could be the day when the
contract is signed, and for production, it might be the day the customer takes
delivery. In this case, the solution is reasonably obvious; redefine sale date to be
separate terms for each area (e.g., sales date, contract date, and delivery date for
sales, legal, and production departments, respectively).
Homonyms cause real confusion when clients do not realize that they are using the
same word for different entities or attributes. Hunt down homonyms by asking lots
of questions and querying a range of clients.
Exception hunting
Go hunting for exceptions. Keep asking the client questions such as
• Is it always like this?
• Would there be any situations where this could be an m:m relationship?
• Have there ever been any exceptions?
• Are things likely to change in the future?
Always probe for exceptions and look for them in the examples the client uses.
Redesigning the data model to handle exceptions increases fidelity.
Relationship labeling
Relationship labeling clutters a data model. In most cases, labels can be correctly
inferred, so use labels only when there is a possibility of ambiguity.
Keeping the data model in shape
As you add to a data model, maintain fidelity. Do not add entities without also
completing details of the identifier and attributes. By keeping the data model well
formed, you avoid ambiguity, which frequently leads to miscommunication.
Used entities
Would you buy a used data model? Certainly, because a used data model is more
likely to have higher fidelity than a brand new one. A used data model has been
subjected to much scrutiny and revision and should be a more accurate
representation of the real world.
Meaningful identifiers
An identifier is said to be meaningful when some attributes of the entity can be
inferred from the identifier ’s value (e.g., a code of B78 for an item informs a clerk
that the item is black). The word “meaningful” in everyday speech usually connotes
something desirable, but many data managers agree that meaningful identifiers or
codes create problems. While avoiding meaningful identifiers is generally accepted
as good data management practice, they are still widely used. After a short
description of some examples of meaningful identifiers, this section discusses the
pros and cons of meaningful identifiers.
The invoice number “12dec0001” is a meaningful identifier. The first five positions
indicate in which period the invoice was generated. The last four digits of the
invoice number have a meaning. The 0001 indicates it was the first invoice sent in
December. In addition, e-mail addresses, like [email protected], have a meaning.
When coding a person’s gender, some systems use the codes “m” and “f” and others
“1” and “2.” The pair (m, f) is meaningful, but (1, 2) is not and puts the onus on the
user to remember the meaning. The codes “m” and “f’” are derived from the reality
that they represent. They are the first letters of the words male and female. The “1”
and “2” are not abstracted from reality, and therefore they do not have a meaning.
This observation is reiterated by the following quote from the International
Organization for Standardization (ISO) information interchange standard for the
representation of human sexes (ISO 5218):
No significance is to be placed on the fact that “Male” is coded “1” and “Female”
is coded “2.” This standard was developed based upon predominant practices of the
countries involved and does not convey any meaning of importance, ranking or any
other basis that could imply discrimination.
In determining whether an identifier is meaningful, one can look at the way new
values of that identifier are generated. If the creation takes place on the basis of
characteristics that appear in reality, the identifier is meaningful. That is why the
identifier of the invoice “12dec0002,” which is the identifier assigned to the second
invoice generated in December 2012, is meaningful. It is also why the e-mail
address [email protected] is meaningful.
Meaningful identifiers clearly have some advantages and disadvantages, and these
are now considered. Advantages and disadvantages of meaningful identifiers
Advantages Disadvantages
Identifier exhaustion
Recognizable and rememberable
Reality changes
Administrative simplicity
Loss of meaningfulness
Advantages of meaningful identifiers
Recognizable and rememberable
People can often recognize and remember the significance of meaningful
identifiers. If, for example, someone receives an e-mail from
[email protected], this person can probably deduce the sender ’s name and
workplace. If a manufacturer, for instance, uses the last two digits of its furniture
code to indicate the color (e.g., 08 = beech), employees can then quickly determine
the color of a packed piece of furniture. Of course, the firm could also show the
color of the furniture on the label, in which case customers, who are most unlikely
to know the coding scheme, can quickly determine the color of the enclosed
product.
Administrative simplicity
By using meaningful identifiers, administration can be relatively easily
decentralized. Administering e-mail addresses, for instance, can be handled at the
domain level. This way, one can avoid the process of synchronizing with other
domains whenever a new e-mail address is created.
A second example is the EAN (European article number), the bar code on European
products. Issuing EANs can be administered per country, since each country has a
unique number in the EAN. When issuing a new number, a country need only issue a
code that is not yet in use in that country. The method of contacting all countries to
ask whether an EAN has already been issued is time consuming and open to error.
On the other hand, creating a central database of issued EAN codes is an option. If
every issuer of EAN codes can access such a database, alignment among countries
is created and duplicates are prevented.
Disadvantages of meaningful identifiers
Identifier exhaustion
Suppose a company with a six-digit item identifier decides to use the first three
digits to specify the product group (e.g., stationery) uniquely. Within this group, the
remaining three digits are used to define the item (e.g., yellow lined pad). As a
consequence, only one thousand items can be specified per product group. This
problem remains even when some numbers in a specific product group, or even
when entire product groups, are not used.
Consider the case when product group “010” is exhausted and is supplemented by
the still unused product group code “940.” What seems to be a simple quick fix
causes problems, because a particular product group is not uniquely identified by a
single identifier. Clients have to remember there is an exception.
A real-world example of identifier exhaustion is a holding company that identifies
its subsidiaries by using a code, where the first character is the first letter of the
subsidiary’s name. When the holding company acquired its seventh subsidiary, the
meaningfulness of the coding system failed. The newly acquired subsidiary name
started with the same letter as one of the existing subsidiaries. The system had
already run out of identifiers.
When existing identifiers have to be converted because of exhaustion, printed codes
on labels, packages, shelves, and catalogs will have to be redone. Recoding is an
expensive process and should be avoided.
Exhaustion is not only a problem with meaningful identifiers. It can happen with
non-meaningful identifiers. The problem is that meaningful identifier systems tend
to exhaust sooner. To avoid identifier exhaustion and maintain meaningful
identifiers, some designers increase identifier lengths (e.g., four digits for product
group).
Reality changes
The second problem of meaningful identifiers is that the reality they record
changes. For example, a company changes its name. The College of Business at the
University of Georgia was renamed the Terry College of Business. E-mail
addresses changed (e.g., [email protected] became [email protected]). An
identifier can remain meaningful over a long period only if the meaningful part
rarely or never changes.
Non-meaningful identifiers avoid reality changes. Consider the case of most
telephone billing systems, which use an account ID to identify a customer rather
than a telephone number. This means a customer can change telephone numbers
without causing any coding problems because telephone number is not the unique
identifier of each customer.
A meaningful identifier can lose its meaningfulness
Consider the problems that can occur with the identifier that records both the color
and the material of an article. Chipboard with a beech wood veneer is coded BW. A
black product, on the other hand, is coded BL. Problems arise whenever there is an
overlap between these characteristics. What happens with a product that is finished
with a black veneer on beech wood? Should the code be BW or BL? What about a
black-painted product that is made out of beech wood? Maybe a new code, BB for
black beech wood, is required. There will always be employees and customers who
do not know the meaning of these not-so-meaningful identifiers and attribute the
wrong meaning to them.
The solution—non-meaningful identifiers
Most people are initially inclined to try to make identifiers meaningful. There is a
sense that something is lost by using a simple numeric to identify a product or
customer. Nothing, however, is lost and much is gained. Non-meaningful identifiers
serve their sole purpose well—to identify an entity uniquely. Attributes are used to
describe the characteristics of the entity (e.g., color, type of wood). A clear
distinction between the role of identifiers and attributes creates fewer data
management problems now and in the future.
Vehicle identification
In most countries, every road vehicle is uniquely identified. The Vehicle
Identification Number (VIN) was originally described in ISO Standard 3779 in
February 1977 and last revised in 1983. It is designed to identify motor vehicles,
trailers, motorcycles, and mopeds. A VIN is 17 characters, A through Z and 0
through 9.
The European Union and the United States have different implementations of the
standard. The U.S. VIN is divided into four parts
• World Manufacturer's Identification (WMI) - three characters
• Vehicle Description Section (VDS) - five characters
• Check digit
• Vehicle Identification Section (VIS) - eight characters
When decoded, a VIN specifies the country and year of manufacture; make, model,
and serial number; assembly plant; and even some equipment specifications.
VINs are normally located in several locations on a car, but the most common
places are in the door frame of the front doors, on the engine itself, around the
steering wheel, or on the dash near the window.
What do you think of the VIN as a method of vehicle identification? How do you
reconcile it with the recommendation to use meaningless identifiers? If you were
consulted on a new VIN system, what would advise?
The seven habits of highly effective data modelers
There is often a large gap between the performance of average and expert data
modelers. An insight into the characteristics that make some data modelers more
skillful than others should improve your data modeling capabilities.
Immerse
Find out what the client wants by immersing yourself in the task environment. Spend
some time following the client around and participate in daily business. Firsthand
experience of the problem will give you a greater understanding of the client’s
requirements. Observe, ask questions, reflect, and talk to a wide variety of people
(e.g., managers, operations personnel, customers, and suppliers). The more you
learn about the problem, the better equipped you are to create a high-fidelity data
model.
Challenge
Challenge existing assumptions; dig out the exceptions. Test the boundaries of the
data model. Try to think about the business problem from different perspectives
(e.g., how might the industry leader tackle this problem?). Run a brainstorming
session with the client to stimulate the search for breakthrough solutions.
Generalize
Reduce the number of entities whenever possible by using generalized structures
(remember the Art Collection model) to simplify the data model. Simpler data
models are usually easier to understand and less costly to implement. Expert data
modelers can see beyond surface differences to discern the underlying similarities
of seemingly different entities.
Test
Test the data model by reading it to yourself and several people intimately familiar
with the problem. Test both directions of every relationship (e.g., a farmer has many
cows and a cow belongs to only one farmer). Build a prototype so that the client can
experiment and learn with a concrete model. Testing is very important because it
costs very little to fix a data model but a great deal to repair a system based on an
incorrect data model.
Limit
Set reasonable limits to the time and scope of data modeling. Don’t let the data
modeling phase continue forever. Discover the boundaries early in the project and
stick to these unless there are compelling reasons to extend the project. Too many
projects are allowed to expand because it is easier to say yes than no. In the long run,
however, you do the client a disservice by promising too much and extending the
life of the project. Determine the core entities and attributes that will solve most of
the problems, and confine the data model to this core. Keep sight of the time and
budget constraints of the project.
Integrate
Step back and reflect on how your project fits with the organization’s information
architecture. Integrate with existing systems where feasible and avoid duplication.
How does your data model fit with the corporate data model and those of other
projects? Can you use part of an existing data model? A skilled data modeler has to
see both the fine-grained detail of a project data model and the big picture of the
corporate data resource.
Complete
Good data modelers don’t leave data models ill-defined. All entities, attributes, and
relationships are carefully defined, ambiguities are resolved, and exceptions are
handled. Because the full value of a data model is realized only when the system is
complete, the data modeler should stay involved with the project until the system is
implemented. The data modeler, who generally gets involved in the project from its
earliest days, can provide continuity through the various phases and ensure the
system solves the problem.
Summary
Data modeling is both a technique for modeling data and its relationships and a
graphical representation of a database. It communicates a database’s design. The
goal is to identify the facts that must be stored in a database. Building a data model
is a partnership between a client, a representative of the eventual owners of the
database, and a designer. The building blocks are an entity, attribute, identifier, and
relationship. A well-formed data model, which means the construction rules have
been obeyed, clearly communicates information to the client. A high-fidelity data
model faithfully describes the world it represents.
A data model is an evolving representation. Each change should be an incremental
improvement in quality. The quality of a data model can be determined only by
understanding the context in which it will be used. A data model can model
historical data relationships just as readily as current ones. Cardinality specifies the
precise multiplicity of a relationship. Modality indicates whether a relationship is
optional or mandatory. The five different types of entities are independent,
dependent, associative, aggregate, and subordinate. Expect a data model to expand
and contract. A data model has no ordering. Introduce an attribute if ordering is
required. An attribute must have the same meaning for every instance. An attribute is
the smallest piece of data that will conceivably form part of a query. Synonyms are
words that have the same meaning; homonyms are words that sound the same but
have different meanings. Identifiers should generally have no meaning. Highly
effective data modelers immerse, challenge, generalize, test, limit, integrate, and
complete.
Key terms and concepts
Aggregate entity
Aggregation
Associative entity
Attribute
Cardinality
Composite aggregation
Data model
Data model quality
Dependent entity
Determinant
Domain
Entity
Generalization
High-fidelity image
Homonym
Identifier
Independent entity
Instance
Mandatory
Modality
Modeling
Optional
Relationship
Relationship descriptor
Relationship label
Shared aggregation
Synonym
References and additional readings
Carlis, J. V. 1991. Logical data structures. Minneapolis, MN: University of
Minnesota.
Hammer, M. 1990. Reengineering work: Don’t automate, obliterate. Harvard
Business Review 68 (4):104–112.
Exercises
Short answers
1. What is data modeling?
2. What is a useful technique for identifying entities in a written description of a
data modeling problem?
3. When do you label relationships?
4. When is a data model well formed, and when is it high-fidelity?
5. How do you handle exceptions when data modeling?
6. Describe the different types of entities.
7. Why might a data model grow?
8. Why might a data model contract?
9. How do you indicate ordering of instances in a data model?
10. What is the difference between a synonym and a homonym?
Data modeling
Create a data model from the following narratives, which are sometimes
intentionally incomplete. You will have to make some assumptions. Make certain
you state these alongside your data model. Define the identifier(s) and attributes of
each entity.
1. The president of a book wholesaler has told you that she wants information
about publishers, authors, and books.
2. A university has many subject areas (e.g., MIS, Romance languages). Professors
teach in only one subject area, but the same subject area can have many
professors. Professors can teach many different courses in their subject area. An
offering of a course (e.g., Data Management 457, French 101) is taught by only
one professor at a particular time.
3. Kids’n’Vans retails minivans for a number of manufacturers. Each manufacturer
offers several models of its minivan (e.g., SE, LE, GT). Each model comes with a
standard set of equipment (e.g., the Acme SE comes with wheels, seats, and an
engine). Minivans can have a variety of additional equipment or accessories
(radio, air-conditioning, automatic transmission, airbag, etc.), but not all
accessories are available for all minivans (e.g., not all manufacturers offer a
driver ’s side airbag). Some sets of accessories are sold as packages (e.g., the
luxury package might include stereo, six speakers, cocktail bar, and twin
overhead foxtails).
4. Steve operates a cinema chain and has given you the following information:
“I have many cinemas. Each cinema can have multiple theaters. Movies are shown
throughout the day starting at 11 a.m. and finishing at 1 a.m. Each movie is given
a two-hour time slot. We never show a movie in more than one theater at a time,
but we do shift movies among theaters because seating capacity varies. I am
interested in knowing how many people, classified by adults and children,
attended each showing of a movie. I vary ticket prices by movie and time slot. For
instance, Lassie Get Lost is 50 cents for everyone at 11 a.m. but is 75 cents at 11
p.m.”
5. A university gymnastics team can have as many as 10 gymnasts. The team
competes many times during the season. A meet can have one or more opponents
and consists of four events: vault, uneven bars, beam, and floor routine. A
gymnast can participate in all or some of these events though the team is limited
to five participants in any event.
6. A famous Greek shipping magnate, Stell, owns many container ships. Containers
are collected at one port and delivered to another port. Customers pay a
negotiated fee for the delivery of each container. Each ship has a sailing schedule
that lists the ports the ship will visit over the next six months. The schedule shows
the expected arrival and departure dates. The daily charge for use of each port is
also recorded.
7. A medical center employs several physicians. A physician can see many patients,
and a patient can be seen by many physicians, though not always on the one visit.
On any particular visit, a patient may be diagnosed to have one or more illnesses.
8. A telephone company offers a 10 percent discount to any customer who phones
another person who is also a customer of the company. To be eligible for the
discount, the pairing of the two phone numbers must be registered with the
telephone company. Furthermore, for billing purposes, the company records both
phone numbers, start time, end time, and date of call.
9. Global Trading (GT), Inc. is a conglomerate. It buys and sells businesses
frequently and has difficulty keeping track of what strategic business units (SBUs)
it owns, in what nations it operates, and what markets it serves. For example, the
CEO was recently surprised to find that GT owns 25 percent of Dundee’s Wild
Adventures, headquartered in Zaire, that has subsidiaries operating tours of
Australia, Zaire, and New York. You have been commissioned to design a
database to keep track of GT’s businesses. The CEO has provided you with the
following information:
SBUs are headquartered in one country, not necessarily the United States. Each
SBU has subsidiaries or foreign agents, depending on local legal requirements,
in a number of countries. Each subsidiary or foreign agent operates in only one
country but can operate in more than one market. GT uses the standard industrial
code (SIC) to identify a market (e.g., newspaper publishing). The SIC is a unique
four-digit code.
While foreign agents operate as separate legal entities, GT needs to know in what
countries and markets they operate. On the other hand, subsidiaries are fully or
partly owned by GT, and it is important for GT to know who are the other owners
of any subsidiary and what percentage of the subsidiary they own. It is not unusual
for a corporation to have shares in several of GT’s subsidiary companies and for
several corporations to own a portion of a subsidiary. Multiple ownership can
also occur at the SBU level.
10. A real estate investment company owns many shopping malls. Each mall
contains many shops. To encourage rental of its shops, the company gives a
negotiated discount to retailers who have shops in more than one mall. Each shop
generates an income stream that can vary from month to month because rental is
based on a flat rental charge and a negotiated percentage of sales revenue. Also,
each shop has monthly expenses for scheduled and unscheduled maintenance. The
company uses the data to compute its monthly net income per square meter for
each shop and for ad hoc querying.
11. Draw a data model for the following table taken from a magazine that evaluates
consumer goods. The reports follow a standard fashion of listing a brand and
model, price, overall score, and then an evaluation of a series of attributes, which
can vary with the product. For example, the sample table evaluates stereo systems.
A table for evaluating microwave ovens would have a similar layout, but different
features would be reported (e.g., cooking quality).
Overall Sound Taping FM CD Ease
Brand and model Price
score quality quality tuning handling of use
Phillips SC-AK103 140 62 Very good Good Very good Excellent Fair
Panasonic MC-50 215 55 Good Good Very good Very good Good
Rio G300 165 38 Good Good Fair Very good Poor
12. Draw a data model for the following freight table taken from a mail order
catalog.
Merchandise Regular delivery Rush delivery Express delivery
subtotals 7–10 days 4–5 business days 1–2 business days
Up to $30.00 $4.95 $9.95 $12.45
$30.01–$65.00 $6.95 $11.95 $15.45
$65.01–$125.00 $8.95 $13.95 $20.45
$125.01+ $9.95 $15.95 $25.45
Reference 1: Basic Structures
Few things are harder to put up with than the annoyance of a good example.
Mark Twain, Pudd’nhead Wilson, 1894
Every data model is composed of the same basic structures. This is a major
advantage because you can focus on a small part of a full data model without being
concerned about the rest of it. As a result, translation to a relational database is very
easy because you systematically translate each basic structure. This section
describes each of the basic structures and shows how they are mapped to a relational
database. Because the mapping is shown as a diagram and SQL CREATE statements,
you will use this section frequently.
One entity
No relationships
The unrelated entity was introduced in Chapter 3. This is simply a flat file, and the
mapping is very simple. Although it is unlikely that you will have a data model with
a single entity, the single entity is covered for completeness.
It is possible to have more than one 1:m recursive relationship. For example, details
of a mother-child relationship would be represented in the same manner and result
in the data model having a second 1:m recursive relationship. The mapping to the
relational model would result in an additional column to contain a foreign key
mother, the personid of a person’s mother.
A single entity can have multiple m:m recursive relationships. Relationships such as
enmity (not enemyship) and siblinghood are m:m recursive on person. The approach
to recording these relationships is the same as that outlined previously.
Two entities
No relationship
When there is no relationship between two entities, the client has decided there is no
need to record a relationship between the two entities. When you are reading the
data model with the client, be sure that you check whether this assumption is correct
both now and for the foreseeable future. When there is no relationship between two
entities, map them each as you would a single entity with no relationships.
A 1:1 relationship
A 1:1 relationship sometimes occurs in parallel with a 1:m relationship between two
entities. It signifies some instances of an entity that have an additional role. For
example, a department has many employees (the 1:m relationship), and a department
has one boss (the 1:1). The data model fragment shown in the following figure
represents the 1:1 relationship.
A 1:m relationship
The 1:m relationship is possibly the easiest to understand and map. The mapping to
the relational model is very simple. The primary key of the “one” end becomes a
foreign key in the “many” end.
An m:m relationship
An m:m relationship is transformed into two 1:m relationships. The mapping is then
a twofold application of the 1:m rule.
The book and borrower tables must be created first because copy contains foreign key
constraints that refer to book and borrower. The column borrowerid can be null because a
book need not be borrowed; if it’s sitting on the shelf, there is no borrower.
CREATE TABLE book (
callno VARCHAR(12),
isbn … ,
booktitle … ,
…
PRIMARY KEY(callno));
CREATE TABLE borrower (
borrowerid INTEGER,
…
PRIMARY KEY(borrowerid));
An associative entity
In the following figure, observe that cityname and firmname are both part of the primary
key (signified by the plus near the crow’s foot) and foreign keys (because of the two
1:m relationships) of store.
A tree structure
The interesting feature of the following figure is the primary key. Notice that the
primary key of a lower level of the tree is a composite of its partial identifier and
the primary key of the immediate higher level. The primary key of department is a
composite of deptname, divname, and firmname. Novice modelers often forget to make this
translation.
CREATE TABLE firm (
firmname VARCHAR(15),
…
PRIMARY KEY(firmname));
CREATE TABLE division (
divname VARCHAR(15),
…
firmname VARCHAR(15),
PRIMARY KEY(firmname,divname),
CONSTRAINT fk_division_firm
FOREIGN KEY(firmname) REFERENCES firm(firmname));
CREATE TABLE department (
deptname VARCHAR(15),
…
divname VARCHAR(15),
firmname VARCHAR(15),
PRIMARY KEY(firmname,divname,deptname),
CONSTRAINT fk_department_division
FOREIGN KEY (firmname,divname)
REFERENCES division(firmname,divname));
CREATE TABLE section (
sectionid VARCHAR(15),
…
divname VARCHAR(15),
firmname VARCHAR(15),
deptname VARCHAR(15),
PRIMARY KEY(firmname,divname,deptname,sectionid),
CONSTRAINT fk_department_department
FOREIGN KEY (firmname,divname,deptname)
REFERENCES department(firmname,divname,deptname));
n.
o.
p.
q. Under what circumstances might you use choose a fixed tree over a 1:m
recursive data model?
8. Normalization and Other Data Modeling Methods
There are many paths to the top of the mountain, but the view is always the same.
Chinese Proverb
Learning objectives
Students completing this chapter will
✤ understand the process of normalization;
✤ be able to distinguish between different normal forms;
✤ recognize that different data modeling approaches, while they differ in
representation methods, are essentially identical.
Introduction
There are often many ways to solve a problem, and different methods frequently
produce essentially the same solution. When this is the case, the goal is to find the
most efficient path. We believe that data modeling, as you have learned in the
preceding chapters, is an efficient and easily learned approach to database design.
There are, however, other paths to consider. One of these methods, normalization,
was the initial approach to database design. It was developed as part of the
theoretical foundation of the relational data model. It is useful to understand the key
concepts of normalization because they advance your understanding of the
relational model. Experience and research indicate, however, that normalization is
more difficult to master than data modeling.
Data modeling first emerged as entity-relationship (E-R) modeling in a paper by
Peter Pin-Shan Chen. He introduced two key database design concepts:
• Identify entities and the relationships between them.
• A graphical representation improves design communication.
Chen’s core concepts spawned many species of data modeling. To give you an
appreciation of the variation in data modeling approaches, we briefly review, later
in this chapter, Chen’s E-R approach and IDEF1X, an approach used by the
Department of Defense.
Normalization
Normalization is a method for increasing the quality of database design. It is also a
theoretical base for defining the properties of relations. The theory gradually
developed to create an understanding of the desirable properties of a relation. The
goal of normalization is identical to that of data modeling—a high-fidelity design.
The need for normalization seems to have arisen from the conversion of prior file
systems into database format. Often, analysts started with the old file design and
used normalization to design the new database. Now, designers are more likely to
start with a clean slate and use data modeling.
Normal forms can be arrived at in several ways. The recommended approach is data
modeling, as experience strongly indicates people find it is an easier path to high
quality database design. If the principles of data modeling are followed faithfully,
then the outcome should be a high-fidelity model and a normalized database. In
other words, if you model data correctly, you create a normalized design.
Nevertheless, modeling mistakes can occur, and normalization is a useful
crosscheck for ensuring the soundness of a data model. Normalization also
provides a theoretical underpinning to data modeling.
Normalization gradually converts a file design into normal form by the successive
application of rules to move the design from first to fifth normal form. Before we
look at these steps, it is useful to learn about functional dependency.
Functional dependency
A functional dependency is a relationship between attributes in an entity. It simply
means that one or more attributes determine the value of another. For example,
given a stock’s code, you can determine its current PE ratio. In other words, PE
ratio is functionally dependent on stock code. In addition, stock name, stock price,
stock quantity, and stock dividend are functionally dependent on stock code. The
notation for indicating that stock code functionally determines stock name is
stock code → stock name
An identifier functionally determines all the attributes in an entity. That is, if we
know the value of stock code, then we can determine the value of stock name, stock
price, and so on.
Formulae, such as yield = stock dividend/stock price*100, are a form of functional
dependency. In this case, we have
(stock dividend, stock price) → yield
This is an example of full functional dependency because yield can be determined
only from both attributes.
An attribute, or set of attributes, that fully functionally determines another attribute
is called a determinant. Thus, stock code is a determinant because it fully
functionally determines stock PE. An identifier, usually called a key when
discussing normalization, is a determinant. Unlike a key, a determinant need not be
unique. For example, a university could have a simple fee structure where
undergraduate courses are $5000 and graduate courses are $7500. Thus, course type
→ fee. Since there are many undergraduate and graduate courses, course type is not
unique for all records.
There are situations where a given value determines multiple values. This
multidetermination property is denoted as A → → B and reads “A multidetermines
B.” For instance, a department multidetermines a course. If you know the
department, you can determine the set of courses it offers. Multivalued dependency
means that functional dependencies are multivalued.
Functional dependency is a property of a relation’s data. We cannot determine
functional dependency from the names of attributes or the current values.
Sometimes, examination of a relation’s data will indicate that a functional
dependency does not exist, but it is by understanding the relationships between data
elements that we determine functional dependency.
Functional dependency is a theoretical avenue for understanding relationships
between attributes. If we have two attributes, say A and B, then three relations are
possible, as shown in the following table.
Functional dependencies of two attributes
Relationship Functional dependency Relationship
They determine each other A → B and B → A 1:1
One determines the other A → B 1:m
They do not determine each other A not→ B and B not→A m:m
One-to-one attribute relationship
Consider two attributes that determine each other (A → B and B → A), for instance a
country’s code and its name. Using the example of Switzerland, there is a one-to-
one (1:1) relationship between CH and Switzerland: CH → Switzerland and
Switzerland → CH. When two attributes have a 1:1 relationship, they must occur
together in at least one table in a database so that their equivalence is a recorded fact.
One-to-many attribute relationship
Examine the situation where one attribute determines another (i.e., A→ B), but the
reverse is not true (i.e., A not→ B), as is the case with country name and its currency
unit. If you know a country’s name, you can determine its currency, but if you know
the currency unit (e.g., the euro), you cannot always determine the country (e.g.,
both Italy and Portugal use the euro). As a result, if A and B occur in the same table,
then A must be the key. In our example, country name would be the key, and
currency unit would be a nonkey column.
Many-to-many attribute relationship
The final case to investigate is when neither attribute determines the other (i.e., A
not→ B and B not→A). The relationship between country name and language is
many-to-many (m:m). For example, Belgium has two languages (French and
Flemish), and French is spoken in many countries. To record the m:m relationship
between these attributes, a table containing both attributes as a composite key is
required. This is essentially the associative entity created during data modeling
when there is an m:m relationship between entities.
As you can understand from the preceding discussion, functional dependency is an
explicit form of presenting some of the ideas you gained implicitly in earlier
chapters on data modeling. The next step is to delve into another theoretical concept
underlying the relational model: normal forms.
Normal forms
Normal forms describe a classification of relations. Initial work by Codd identified
first (1NF), second (2NF), and third (3NF) normal forms. Later researchers added
Boyce-Codd (BCNF), fourth (4NF), and fifth (5NF) normal forms. Normal forms
are nested like a set of Russian dolls, with the innermost doll, 1NF, contained within
all other normal forms. The hierarchy of normal forms is 5NF, 4NF, BCNF, 3NF,
2NF, and 1NF. Thus, 5NF is the outermost doll.
A new normal form, domain-key normal form (DK/NF), appeared in 1981. When a
relation is in DK/NF, there are no modification anomalies. Conversely, any relation
that is free of anomalies must be in DK/NF. The difficult part is discovering how to
convert a relation to DK/NF.
First normal form
A relation is in first normal form if and only if all columns are single-valued. In
other words, 1NF states that all occurrences of a row must have the same number of
columns. In data modeling terms, this means that an attribute must have a single
value. An attribute that can have multiple values must be represented as a one-to-
many (1:m) relationship. At a minimum, a data model will be in 1NF because all
attributes of an entity are required to be single-valued.
Second normal form
Second normal form is violated when a nonkey column is dependent on only a
component of the primary key. This can also be stated as a relation is in second
normal form if and only if it is in first normal form, and all nonkey columns are
dependent on the key.
Consider the following table. The primary key of order is a composite of itemno and
customerid . The problem is that customer-credit is a fact about customerid (part of the
composite key) rather than the full key (itemno+customerid), or in other words, it is not
fully functionally dependent on the primary key. An insert anomaly arises when you
try to add a new customer to order. You cannot add a customer until that person
places an order, because until then you have no value for item number, and part of
the primary key will be null. Clearly, this is neither an acceptable business practice
nor an acceptable data management procedure.
Analyzing this problem, you realize an item can be in many orders, and a customer
can order many items—an m:m relationship. By drawing the data model in the
following figure, you realize that customer-credit is an attribute of CUSTOMER,
and you get the correct relational mapping.
Second normal form violation
order
itemno customerid quantity customer-credit
12 57 25 OK
34 679 3 POOR
Resolving second normal form violation
The relational mapping of the three-way relationships results in four tables, with the
associative entity ASSIGNMENT mapping shown in the following table.
Relation table assignment
assignment
consultid firmid skilldesc
Tan IBM Database
Tan Apple Data comm
The table is in 5NF because the combination of all three columns is required to
identify which consultants supply which firms with which skills. For example, we
see that Tan advises IBM on database and Apple on data communications. The data
in this table cannot be reconstructed from other tables.
The preceding table is not in 5NF if there is a rule of the following form: If a
consultant has a certain skill (e.g., database) and has a contract with a firm that
requires that skill (e.g., IBM), then the consultant advises that firm on that skill (i.e.,
he advises IBM on database). Notice that this rule means we can infer a relationship,
and we no longer need the combination of the three columns. As a result, we break
the single three-entity m:m relationship into three two-entity m:m relationships.
Revised data model after the introduction of the rule
developer, Codd believed that a sound theoretical model would solve most practical
problems that could potentially arise.
In another article, Codd expounds the case for adopting the relational over earlier
database models. There is a threefold thrust to his argument. First, earlier models
force the programmer to code at a low level of structural detail. As a result,
application programs are more complex and take longer to write and debug.
Second, no commands are provided for processing multiple records at one time.
Earlier models do not provide the set-processing capability of the relational model.
The set-processing feature means that queries can be more concisely expressed.
Third, the relational model, through a query language such as structured query
language (SQL), recognizes the clients’ need to make ad hoc queries. The relational
model and SQL permits an IS department to respond rapidly to unanticipated
requests. It can also mean that analysts can write their own queries. Thus, Codd’s
assertion that the relational model is a practical tool for increasing the productivity
of IS departments is well founded.
The productivity increase arises from three of the objectives that drove Codd’s
research. The first was to provide a clearly delineated boundary between the logical
and physical aspects of database management. Programming should be divorced
from considerations of the physical representation of data. Codd labels this the data
independence objective. The second objective was to create a simple model that is
readily understood by a wide range of analysts and programmers. This
communicability objective promotes effective and efficient communication
between clients and IS personnel. The third objective was to increase processing
capabilities from record-at-a-time to multiple-records-at-a-time—the set-
processing objective. Achievement of these objectives means fewer lines of code
are required to write an application program, and there is less ambiguity in client-
analyst communication.
The relational model has three major components:
• Data structures
• Integrity rules
• Operators used to retrieve, derive, or modify data
Data structures
Like most theories, the relational model is based on some key structures or
concepts. We need to understand these in order to understand the theory.
Domain
A domain is a set of values all of the same data type. For example, the domain of
nation name is the set of all possible nation names. The domain of all stock prices is
the set of all currency values in, say, the range $0 to $10,000,000. You can think of a
domain as all the legal values of an attribute.
In specifying a domain, you need to think about the smallest unit of data for an
attribute defined on that domain. In Chapter 8, we discussed how a candidate
attribute should be examined to see whether it should be segmented (e.g., we divide
name into first name, other name, and last name, and maybe more). While it is
unlikely that name will be a domain, it is likely that there will be a domain for first
name, last name, and so on. Thus, a domain contains values that are in their atomic
state; they cannot be decomposed further.
The practical value of a domain is to define what comparisons are permissible.
Only attributes drawn from the same domain should be compared; otherwise it is a
bit like comparing bananas and strawberries. For example, it makes no sense to
compare a stock’s PE ratio to its price. They do not measure the same thing; they
belong to different domains. Although the domain concept is useful, it is rarely
supported by relational model implementations.
Relations
A relation is a table of n columns (or attributes) and m rows (or tuples). Each
column has a unique name, and all the values in a column are drawn from the same
domain. Each row of the relation is uniquely identified. The order of columns and
rows is immaterial.
The cardinality of a relation is its number of rows. The degree of a relation is the
number of columns. For example, the relation nation (see the following figure) is of
degree 3 and has a cardinality of 4. Because the cardinality of a relation changes
every time a row is added or deleted, you can expect cardinality to change
frequently. The degree changes if a column is added to a relation, but in terms of
relational theory, it is considered to be a new relation. So, only a relation’s
cardinality changes.
A relational database with tables nation and stock
nation
natcode natname exchrate
v 1 w x y z 1 1 1 1
v 2 w x y z 2 2 2 2
v 3 w 3
A TIMES B
V WXYZ
v 1 w x y z 1 1 1 1
v1 w1 x 2 y 2 z 2
v2 w2 x 1 y 1 z 1
v2 w2 x 2 y 2 z 2
v3 w3 x 1 y 1 z 1
v3 w3 x 2 y 2 z 2
Union
The union of two relations is a new relation containing all rows appearing in one or
both relations. The two relations must be union compatible, which means they have
the same column names, in the same order, and drawn on the same domains.
Duplicate rows are automatically eliminated—they must be, or the relational model
is no longer satisfied. The relational command to create the union of two tables is
tablename1 UNION tablename2
Union is illustrated in the following figure. Notice that corresponding columns in
tables A and B have the same names. While the sum of the number of rows in
relations A and B is five, the union contains four rows, because one row (x , y ) is2 2
x2 y x y
2 4 4
x3
y
3
A UNION B
X Y
x1 y
1
x2 y
2
x3 y
3
x4 y
4
Intersect
The intersection of two relations is a new relation containing all rows appearing in
both relations. The two relations must be union compatible. The relational
command to create the intersection of two tables is
tablename1 INTERSECT tablename2
The result of A INTERSECT B is one row, because only one row (x , y ) is common to
2 2
x y x y
2 2 4 4
x y
3 3
A INTERSECT B
X Y
x2 y2
Difference
The difference between two relations is a new relation containing all rows
appearing in the first relation but not in the second. The two relations must be union
compatible. The relational command to create the difference between two tables is
tablename1 MINUS tablename2
The result of A MINUS B is two rows (see the following figure). Both of these rows
are in relation A but not in relation B. The row containing (x1, y2) appears in both A
and B and thus is not in A MINUS B.
Relational operator difference
A B
X Y XY
x1 y x y 1 2 2
x2 y x y 2 4 4
x3 y 3
A MINUS B
X Y
x1 y 1
x2 y 2
Join
Join creates a new relation from two relations for all combinations of rows
satisfying the join condition. The general format of join is
tablename1 JOIN tablename2 WHERE tablename1.columnname1 theta
tablename2.columnname2
where theta can be =, <>, >, >=, <, or <=.
The following figure illustrates A JOIN B WHERE W = Z, which is an equijoin because
theta is an equals sign. Tables A and B are matched when values in columns W and Z
in each relation are equal. The matching columns should be drawn from the same
domain. You can also think of join as a product followed by restrict on the resulting
relation. So the join can be written
(A TIMES B) WHERE W theta Z
Relational operator join
A B
VW XYZ
v wz x y wz
1 1 1 1 1
v wz x y wz
2 2 2 2 2
v wz
3 3
A EQUIJOIN B
V W XYZ
v1 wz1 x 1 y 1 wz1
v3 wz3 x 2 y 2 wz3
Divide
Divide is the hardest relational operator to understand. Divide requires that A and B
have a set of attributes, in this case Y, that are common to both relations.
Conceptually, A divided by B asks the question, “Is there a value in the X column of
A (e.g., x ) that has a value in the Y column of A for every value of y in the Y
1
column of B?” Look first at B, where the Y column has values y and y . When you
1 2
examine the X column of A, you find there are rows (x , y ) and (x , y ). That is, for
1 1 1 2
Thus, the result of the division is a new relation with a single row and column
containing the value x . 1
x1 y y
1 1
x1 y y
3 2
x1 y2
x2 y1
x3 y3
A DIVIDE B
X
x 1
Querying with relational algebra
A few queries will give you a taste of how you might use relational algebra. To
assist in understanding these queries, the answer is expressed, side-by-side, in both
relational algebra (on the left) and SQL (on the right).
• List all data in SHARE.
A very simple report of all columns in the table share.
share SELECT * FROM share
• Report a firm’s name and price-earnings ratio.
A projection of two columns from share.
share [shrfirm, shrpe] SELECT shrfirm, shrpe FROM share
• Get all shares with a price-earnings ratio less than 12.
A restriction of share on column shrpe.
SELECT * FROM share
share WHERE shrpe < 12
WHERE shrpe < 12
• List details of firms where the share holding is at least 100,000.
A restriction and projection combined. Notice that the restriction is expressed within
parentheses.
SELECT shrfirm, shrprice,
(share WHERE shrqty >= 100000)
shrqty, shrdiv FROM share
[shrfirm, shrprice, shrqty, shrdiv]
WHERE shrqty >= 100000
• Find all shares where the PE is 12 or higher and the share holding is less than
10,000.
Intersect is used with two restrictions to identify shares satisfying both conditions.
(share WHERE shrpe >= 12) SELECT * FROM share
INTERSECT WHERE shrpe >= 12
(share WHERE shrqty < 10000) AND shrqty < 10000
• Report all shares other than those with the code CS or PT.
Difference is used to subtract those shares with the specified codes. Notice the use of
union to identify shares satisfying either condition.
share MINUS
((share WHERE shrcode = 'CS')
UNION SELECT * FROM share WHERE shrcode NOT IN ('CS', 'PT')
(share WHERE shrcode = 'PT'))
• Report the value of each stock holding in UK pounds.
A join of STOCK and NATION and projection of specified attributes.
(STOCK JOIN nation SELECT natname, stkfirm,
WHERE stock.natcode = stkprice, stkqty, exchrate,
nation.natcode ) stkprice*stkqty*exchrate
[natname, stkfirm, FROM stock, nation
stkprice, stkqty, exchrate, WHERE stock.natcode =
stkprice*stkqty*exchrate] nation.natcode
• Find the items that have appeared in all sales.
Divide reports the ITEMNO of items appearing in all sales, and this result is then
joined with ITEM to report the items.
SELECT itemno, itemname FROM item
WHERE NOT EXISTS
(SELECT * FROM sale
((lineitem[itemno, saleno] WHERE NOT EXISTS
DIVIDEBY sale[saleno]) (SELECT * FROM lineitem
JOIN item) [itemno, itemname] WHERE lineitem.itemno =
item.itemno
AND lineitem.saleno =
sale.saleno))
Skill builder
Write relational algebra statements for the following queries:
Report a firm’s name and dividend.
Find the names of firms where the holding is greater than 10,000.
A primitive set of relational operators
The full set of eight relational operators is not required. As you have already seen,
JOIN can be defined in terms of product and restrict. Intersection and divide can
also be defined in terms of other commands. Indeed, only five operators are
required: restrict, project, product, union, and difference. These five are known as
primitives because these are the minimal set of relational operators. None of the
primitive operators can be defined in terms of the other operators. The following
table illustrates how each primitive can be expressed as an SQL command, which
implies that SQL is relationally complete.
Comparison of relational algebra primitive operators and SQL
Operator Relational algebra SQL
Restrict A WHERE condition SELECT * FROM A WHERE condition
Project A [X] SELECT X FROM A
Product A TIMES B SELECT * from A, B
Union A UNION B SELECT * FROM A UNION SELECT * FROM B
SELECT * FROM A
Difference A MINUS B WHERE NOT EXISTS
(SELECT * FROM B WHERE
A.X = B.X AND A.Y = B.Y AND …*
* Essentially where all columns of A are equal to all columns of B
A fully relational database
The three components of a relational database system are structures (domains and
relations), integrity rules (primary and foreign keys), and a manipulation language
(relational algebra). A fully relational database system provides complete support
for each of these components. Many commercial systems support SQL but do not
provide support for domains. Such systems are not fully relational but are
relationally complete.
In 1985, Codd established the 12 commandments of relational database systems (see
the following table). In addition to providing some insights into Codd’s thinking
about the management of data, these rules can also be used to judge how well a
database fits the relational model. The major impetus for these rules was uncertainty
in the marketplace about the meaning of “relational DBMS.” Codd’s rules are a
checklist for establishing the completeness of a relational DBMS. They also provide
a short summary of the major features of the relational DBMS.
Codd’s rules for a relational DBMS
Rules
The information rule
The guaranteed access rule
Systematic treatment of null values
Active online catalog of the relational model
The comprehensive data sublanguage rule
The view updating rule
High-level insert, update, and delete
Physical data independence
Logical data independence
Integrity independence
Distribution independence
The nonsubversion rule
The information rule
There is only one logical representation of data in a database. All data must appear
to be stored as values in a table.
The guaranteed access rule
Every value in a database must be addressable by specifying its table name, column
name, and the primary key of the row in which it is stored.
Systematic treatment of null values
There must be a distinct representation for unknown or inappropriate data. This
must be unique and independent of data type. The DBMS should handle null data in a
systematic way. For example, a zero or a blank cannot be used to represent a null.
This is one of the more troublesome areas because null can have several meanings
(e.g., missing or inappropriate).
Active online catalog of the relational model
There should be an online catalog that describes the relational model. Authorized
users should be able to access this catalog using the DBMS’s query language (e.g.,
SQL).
The comprehensive data sublanguage rule
There must be a relational language that supports data definition, data manipulation,
security and integrity constraints, and transaction processing operations.
Furthermore, this language must support both interactive querying and application
programming and be expressible in text format. SQL fits these requirements.
The view updating rule
The DBMS must be able to update any view that is theoretically updatable.
High-level insert, update, and delete
The system must support set-at-a-time operations. For example, multiple rows must
be updatable with a single command.
Physical data independence
Changes to storage representation or access methods will not affect application
programs. For example, application programs should remain unimpaired even if a
database is moved to a different storage device or an index is created for a table.
Logical data independence
Information-preserving changes to base tables will not affect application programs.
For instance, no applications should be affected when a new table is added to a
database.
Integrity independence
Integrity constraints should be part of a database’s definition rather than embedded
within application programs. It must be possible to change these constraints without
affecting any existing application programs.
Distribution independence
Introduction of a distributed DBMS or redistribution of existing distributed data
should have no impact on existing applications.
The nonsubversion rule
It must not be possible to use a record-at-a-time interface to subvert security or
integrity constraints. You should not be able, for example, to write a Java program
with embedded SQL commands to bypass security features.
Codd issued an additional higher-level rule, rule 0, which states that a relational
DBMS must be able to manage databases entirely through its relational capacities. In
other words, a DBMS is either totally relational or it is not relational.
Summary
The relational model developed as a result of recognized shortcomings of
hierarchical and network DBMSs. Codd created a strong theoretical base for the
relational model. Three objectives drove relational database research: data
independence, communicability, and set processing. The relational model has
domain structures, integrity rules, and operators used to retrieve, derive, or modify
data. A domain is a set of values all of the same data type. The practical value of a
domain is to define what comparisons are permissible. A relation is a table of n
columns and m rows. The cardinality of a relation is its number of rows. The degree
of a relation is the number of columns.
A relational database is a collection of relations. The distinguishing feature of the
relational model is that there are no explicit linkages between tables. A relation’s
primary key is its unique identifier. When there are multiple candidates for the
primary key, one is chosen as the primary key and the remainder are known as
alternate keys. The foreign key is the way relationships are represented and can be
thought of as the glue that binds a set of tables together to form a relational
database. The purpose of the entity integrity rule is to ensure that each entity
described by a relation is identifiable in some way. The referential integrity rule
ensures that you cannot define a foreign key without first defining its matching
primary key.
The relational model includes a set of operations, known as relational algebra, for
manipulating relations. There are eight operations that can be used with either one
or two relations to create a new relation. These operations are restrict, project,
product, union, intersect, difference, join, and divide. Only five operators, known as
primitives, are required to define all eight relational operations. An SQL statement
can be translated to relational algebra and vice versa. If a retrieval language can be
used to express every relational algebra operator, it is said to be relationally
complete. A fully relational database system provides complete support for
domains, integrity rules, and a manipulation language. Codd set forth 12 rules that
can be used to judge how well a database fits the relational model. He also added
rule 0—a DBMS is either totally relational or it is not relational.
Key terms and concepts
Alternate key
Candidate key
Cardinality
Catalog
Communicability objective
Data independence
Data structures
Data sublanguage rule
Degree
Difference
Distribution independence
Divide
Domain
Entity integrity
Foreign key
Fully relational database
Guaranteed access rule
Information rule
Integrity independence
Integrity rules
Intersect
Join
Logical data independence
Nonsubversion rule
Null
Operators
Physical data independence
Primary key
Product
Project
Query-by-example (QBE)
Referential integrity
Relational algebra
Relational calculus
Relational database
Relational model
Relationally complete
Relations
Restrict
Set processing objective
Union
View updating rule
References and additional readings
Codd, E. F. 1982. Relational database: A practical foundation for productivity.
Communications of the ACM 25 (2):109–117.
Exercises
1. What reasons does Codd give for adopting the relational model?
2. What are the major components of the relational model?
3. What is a domain? Why is it useful?
4. What are the meanings of cardinality and degree?
5. What is a simple relational database?
6. What is the difference between a primary key and an alternate key?
7. Why do we need an entity integrity rule and a referential integrity rule?
8. What is the meaning of “union compatible”? Which relational operations
require union compatibility?
9. What is meant by the term “a primitive set of operations”?
10. What is a fully relational database?
11. How well does OpenOffice’s database satisfy Codd’s rules.
12. Use relational algebra with the given data model to solve the following queries:
Column definition
The column definition block defines the columns in a table. Each column definition
consists of a column name, data type, and optionally the specification that the
column cannot contain null values. The general form is
(column-definition [, …])
where column-definition is of the form
column-name data-type [NOT NULL]
The NOT NULL clause specifies that the particular column must have a value
whenever a new row is inserted.
Constraints
A constraint is a rule defined as part of CREATE TABLE that defines valid sets of values
for a base table by placing limits on INSERT, UPDATE, and DELETE operations.
Constraints can be named (e.g., fk_stock_nation) so that they can be turned on or off and
modified. The four constraint variations apply to primary key, foreign key, unique
values, and range checks.
Primary key constraint
The primary key constraint block specifies a set of columns that constitute the
primary key. Once a primary key is defined, the system enforces its uniqueness by
checking that the primary key of any new row does not already exist in the table. A
table can have only one primary key. While it is not mandatory to define a primary
key, it is good practice always to define a table’s primary key, though it is not that
common to name the constraint. The general form of the constraint is
[primary-key-name] PRIMARY KEY(column-name [asc|desc] [, …])
The optional ASC or DESC clause specifies whether the values from this key are
arranged in ascending or descending order, respectively.
For example:
pk_stock PRIMARY KEY(stkcode)
Foreign key constraint
The referential constraint block defines a foreign key, which consists of one or
more columns in the table that together must match a primary key of the specified
table (or else be null). A foreign key value is null when any one of the columns in
the row constituting the foreign key is null. Once the foreign key constraint is
defined, the DBMS will check every insert and update to ensure that the constraint is
observed. The general form is
CONSTRAINT constraint-name FOREIGN KEY(column-name [,…])
REFERENCES table-name(column-name [,…])
[ON DELETE (RESTRICT | CASCADE | SET NULL)]
The constraint-name defines a referential constraint. You cannot use the same
constraint-name more than once in the same table. Column-name identifies the
column or columns that comprise the foreign key. The data type and length of
foreign key columns must match exactly the data type and length of the primary key
columns. The clause REFERENCES table-name specifies the name of an existing table,
and its primary key, that contains the primary key, which cannot be the name of the
table being created.
The ON DELETE clause defines the action taken when a row is deleted from the table
containing the primary key. There are three options:
1. RESTRICT prevents deletion of the primary key row until all corresponding rows
in the related table, the one containing the foreign key, have been deleted.
RESTRICT is the default and the cautious approach for preserving data integrity.
2. CASCADE causes all the corresponding rows in the related table also to be deleted.
3. SET NULLS sets the foreign key to null for all corresponding rows in the related
table.
For example:
CONSTRAINT fk_stock_nation FOREIGN KEY(natcode)
REFERENCES nation(natcode)
Unique constraint
A unique constraint creates a unique index for the specified column or columns. A
unique key is constrained so that no two of its values are equal. Columns appearing
in a unique constraint must be defined as NOT NULL. Also, these columns should not
be the same as those of the table’s primary key, which is guaranteed uniqueness by
its primary key definition. The constraint is enforced by the DBMS during execution
of INSERT and UPDATE statements. The general format is
UNIQUE constraint-name (column-name [ASC|DESC] [, …])
An example follows:
CONSTRAINT unq_stock_stkname UNIQUE(stkname)
Check constraint
A check constraint defines a set of valid values and comes in three forms: table,
column, or domain constraint.
Table constraints are defined in CREATE TABLE and ALTER TABLE statements. They can
be set for one or more columns. A table constraint, for example, might limit itemcode
to values less than 500.
CREATE TABLE item (
itemcode integer,
CONSTRAINT chk_item_itemcode check(itemcode <500));
A column constraint is defined in a CREATE TABLE statement and must be for a single
column. Once created, a column constraint is, in effect, a table constraint. The
following example shows that there is little difference between defining a table and
column constraint. In the following case, itemcode is again limited to values less than
500.
CREATE TABLE item (
itemcode INTEGER CONSTRAINT chk_item_itemcode CHECK(itemcode <500),
itemcolor VARCHAR(10));
A domain constraint defines the legal values for a type of object (e.g., an item’s
color) that can be defined in one or more tables. The following example sets legal
values of color.
CREATE domain valid_color AS char(10)
CONSTRAINT chk_qitem_color check(
IN ('Bamboo','Black','Brown','Green','Khaki','White'));
In a CREATE TABLE statement, instead of defining a column’s data type, you specify
the domain on which the column is defined. The following example shows that
itemcolor is defined on the domain VALID_COLOR.
CREATE table item (
itemcode INTEGER,
itemcolor VALID_COLOR);
Data types
Some of the variety of data types that can be used are depicted in the following
figure and described in more detail in the following pages.
Data types
BOOLEAN
Boolean data types can have the values true, false, or unknown.
SMALLINT and INTEGER
Most commercial computers have a 32-bit word, where a word is a unit of storage.
An integer can be stored in a full word or half a word. If it is stored in a full word
(INTEGER), then it can be 31 binary digits in length. If half-word storage is used
(SMALLINT), then it can be 15 binary digits long. In each case, one bit is used for the
sign of the number. A column defined as INTEGER can store a number in the range –
2 to 2 -1 or –2,147,483,648 to 2,147,483,647. A column defined as SMALLINT can
31 31
FLOAT
Scientists often deal with numbers that are very large (e.g., Avogadro’s number is
6.02252 × 10 ) or very small (e.g., Planck’s constant is 6.6262 × 10 joule sec). The
23 –34
FLOAT data type is used for storing such numbers, often referred to as floating-point
numbers. A single-precision floating-point number requires 32 bits of storage and
can represent numbers in the range –7.2 × 10 to –5.4 × 10 , 0, 5.4 × 10 to 7.2 ×
75 –79 –79
number requires 64 bits. The range is the same as for a single-precision floating-
point number. The extra 32 bits are used to increase precision to about 15 decimal
digits.
In the specification FLOAT(n), if n is between 1 and 21 inclusive, single-precision
floating-point is selected. If n is between 22 and 53 inclusive, the storage format is
double-precision floating-point. If n is not specified, double-precision floating-
point is assumed.
DECIMAL
Binary is the most convenient form of storing data from a computer ’s perspective.
People, however, work with a decimal number system. The DECIMAL data type is
convenient for business applications because data storage requirements are defined
in terms of the maximum number of places to the left and right of the decimal point.
To store the current value of an ounce of gold, you would possibly use DECIMAL(6,2)
because this would permit a maximum value of $9,999.99. Notice that the general
form is DECIMAL(p,q), where p is the total number of digits in the column, and q is the
number of digits to the right of the decimal point.
CHAR and VARCHAR
Nonnumeric columns are stored as character strings. A person’s family name is an
example of a column that is stored as a character string. CHAR(n) defines a column
that has a fixed length of n characters, where n can be a maximum of 255.
When a column’s length can vary greatly, it makes sense to define the field as
VARCHAR or TEXT. A column defined as VARCHAR consists of two parts: a header
indicating the length of the character string and the string. If a table contains a
column that occasionally stores a long string of text (e.g., a message field), then
defining it as VARCHAR makes sense. TEXT can store strings up to 65,535 characters
long.
Why not store all character columns as VARCHAR and save space? There is a price
for using VARCHAR with some relational systems. First, additional space is required
for the header to indicate the length of the string. Second, additional processing time
is required to handle a variable-length string compared to a fixed-length string.
Depending on the DBMS and processor speed, these might be important
considerations, and some systems will automatically make an appropriate choice.
For example, if you use both a data types in the same table, MySQL will
automatically change CHAR into VARCHAR for compatibility reasons.
There are some columns where there is no trade-off decision because all possible
entries are always the same length. U.S. Social Security numbers are always nine
characters.
Data compression is another approach to the space wars problem. A database can be
defined with generous allowances for fixed-length character columns so that few
values are truncated. Data compression can be used to compress the file to remove
wasted space. Data compression, however, is slow and will increase the time to
process queries. You save space at the cost of time, and save time at the cost of
space. When dealing with character fields, the database designer has to decide
whether time or space is more important.
Times and dates
Columns that have a data type of DATE are stored as yyyymmdd (e.g., 2012-11-04 for
November 4, 2012). There are two reasons for this format. First, it is convenient for
sorting in chronological order. The common American way of writing dates
(mmddyy) requires processing before chronological sorting. Second, the full form
of the year should be recorded for exactness.
For similar reasons, it makes sense to store times in the form hhmmss with the
understanding that this is 24-hour time (also known as European time and military
time). This is the format used for data type TIME.
Some applications require precise recording of events. For example, transaction
processing systems typically record the time a transaction was processed by the
system. Because computers operate at high speed, the TIMESTAMP data type records
date and time with microsecond accuracy. A timestamp has seven parts—year,
month, day, hour, minute, second, and microsecond. Date and time are defined as
previously described (i.e., yyyymmdd and hhmmss, respectively). The range of the
microsecond part is 000000 to 999999.
Although times and dates are stored in a particular format, the formatting facilities
that generally come with a DBMS usually allow tailoring of time and date output to
suit local standards. Thus for a U.S. firm, date might appear on a report in the form
mm/dd/yy; for a European firm following the ISO standard, date would appear as
yyyy-mm-dd.
SQL-99 introduced the INTERVAL data type, which is a single value expressed in
some unit or units of time (e.g., 6 years, 5 days, 7 hours).
BLOB (binary large object)
BLOB is a large-object data type that stores any kind of binary data. Binary data
typically consists of a saved spreadsheet, graph, audio file, satellite image, voice
pattern, or any digitized data. The BLOB data type has no maximum size.
CLOB (character large object)
CLOB is a large-object data type that stores any kind of character data. Text data
typically consists of reports, correspondence, chapters of a manual, or contracts.
The CLOB data type has no maximum size.
Skill builder
What data types would you recommend for the following?
1. A book’s ISBN
2. A photo of a product
3. The speed of light (2.9979 × 10 meters per second)
8
A vendor ’s additional functions can be very useful. Remember, though, that use of a
vendor ’s extensions might limit portability.
• How many days’ sales are stored in the sale table?
This sounds like a simple query, but you have to do a self-join and also know that
there is a function, DATEDIFF, to determine the number of days between any two dates.
Consult your DBMS manual to learn about other functions for dealing with dates
and times.
SELECT DATEDIFF(late.saledate,early.saledate) AS difference
FROM sale late, sale early
WHERE late.saledate =
(SELECT MAX(saledate) FROM sale)
AND early.saledate =
(SELECT MIN(saledate) FROM sale);
difference
1
The preceding query is based on the idea of joining sale with a copy of itself. The
matching column from late is the latest sale’s date (or MAX), and the matching
column from early is the earliest sale’s date (or MIN ). As a result of the join, each
row of the new table has both the earliest and latest dates.
Formatting
You will likely have noticed that some queries report numeric values with a varying
number of decimal places. The FORMAT function gives you control over the number
of decimal places reported, as illustrated in the following example where yield is
reported with two decimal places.
SELECT shrfirm, shrprice, shrqty, FORMAT(shrdiv/shrprice*100,2) AS yield
FROM share;
Altering a table
The ALTER TABLE statement has two purposes. First, it can add a single column to an
existing table. Second, it can add, drop, activate, or deactivate primary and foreign
key constraints. A base table can be altered by adding one new column, which
appears to the right of existing columns. The format of the command is
ALTER TABLE base-table ADD column data-type;
Notice that there is no optional NOT NULL clause for column-definition with ALTER
TABLE. It is not allowed because the ALTER TABLE statement automatically fills the
additional column with null in every case. If you want to add multiple columns, you
repeat the command. ALTER TABLE does not permit changing the width of a column
or amending a column’s data type. It can be used for deleting an unwanted column.
ALTER TABLE stock ADD stkrating CHAR(3);
ALTER TABLE is also used to change the status of referential constraints. You can
deactivate constraints on a table’s primary key or any of its foreign keys.
Deactivation also makes the relevant tables unavailable to all users except the table’s
owner or someone possessing database management authority. After the data are
loaded, referential constraints must be reactivated before they can be automatically
enforced again. Activating the constraints enables the DBMS to validate the
references in the data.
Dropping a table
A base table can be deleted at any time by using the DROP statement. The format is
DROP TABLE base-table;
The table is deleted, and any views or indexes defined on the table are also deleted.
Creating a view
A view is a virtual table. It has no physical counterpart but appears to the client as if
it really exists. A view is defined in terms of other tables that exist in the database.
The syntax is
CREATE VIEW view [column [,column] …)]
AS subquery;
There are several reasons for creating a view. First, a view can be used to restrict
access to certain rows or columns. This is particularly important for sensitive data.
An organization’s person table can contain both private data (e.g., annual salary) and
public data (e.g., office phone number). A view consisting of public data (e.g.,
person’s name, department, and office telephone number) might be provided to
many people. Access to all columns in the table, however, might be confined to a
small number of people. Here is a sample view that restricts access to a table.
CREATE VIEW stklist
AS SELECT stkfirm, stkprice FROM stock;
Handling derived data is a second reason for creating a view. A column that can be
computed from one or more other columns should always be defined by a view.
Thus, a stock’s yield would be computed by a view rather than defined as a column
in a base table.
CREATE VIEW stk
(stkfirm, stkprice, stkqty, stkyield)
AS SELECT stkfirm, stkprice, stkqty, stkdiv/stkprice*100
FROM stock;
A third reason for defining a view is to avoid writing common SQL queries. For
example, there may be some joins that are frequently part of an SQL query.
Productivity can be increased by defining these joins as views. Here is an example:
CREATE VIEW stkvalue
(nation, firm, price, qty, value)
AS SELECT natname, stkfirm, stkprice*exchrate, stkqty,
stkprice*exchrate*stkqty FROM stock, nation
WHERE stock.natcode = nation.natcode;
The preceding example demonstrates how CREATE VIEW can be used to rename
columns, create new columns, and involve more than one table. The column nation
corresponds to natname, firm to stkfirm, and so forth. A new column, price, is created that
converts all share prices from the local currency to British pounds.
Data conversion is a fourth useful reason for a view. The United States is one of the
few countries that does not use the metric system, and reports for American
managers often display weights and measures in pounds and feet, respectively. The
database of an international company could record all measurements in metric
format (e.g., weight in kilograms) and use a view to convert these measures for
American reports.
When a CREATE VIEW statement is executed, the definition of the view is entered in the
systems catalog. The subquery following AS, the view definition, is executed only
when the view is referenced in an SQL command. For example, the following
command would enable the subquery to be executed and the view created:
SELECT * FROM stkvalue WHERE price > 10;
In effect, the following query is executed:
SELECT natname, stkfirm, stkprice*exchrate, stkqty, stkprice*exchrate*stkqty
FROM stock, nation
WHERE stock.natcode = nation.natcode
AND stkprice*exchrate > 10;
Any table that can be defined with a SELECT statement is a potential view. Thus, it is
possible to have a view that is defined by another view.
Dropping a view
DROP VIEW is used to delete a view from the system catalog. A view might be
dropped because it needs to be redefined or is no longer used. It must be dropped
first before a revised version of the view is created. The syntax is
DROP VIEW view;
Remember, if a base table is dropped, all views based on that table are also dropped.
Creating an index
An index helps speed up retrieval (a more detailed discussion of indexing is
covered later in this book). A column that is frequently referred to in a WHERE
clause is a possible candidate for indexing. For example, if data on stocks were
frequently retrieved using stkfirm, then this column should be considered for an
index. The format for CREATE INDEX is
CREATE [UNIQUE] INDEX indexname
ON base-table (column [order] [,column, [order]] …)
[CLUSTER];
Dropping an index
Indexes can be dropped at any time by using the DROP INDEX statement. The general
form of this statement is
DROP INDEX index;
Data manipulation
SQL supports four DML statements—SELECT, INSERT, UPDATE, and DELETE. Each of
these will be discussed in turn, with most attention focusing on SELECT because of the
variety of ways in which it can be used. First, we need to understand why we must
qualify column names and temporary names.
Qualifying column names
Ambiguous references to column names are avoided by qualifying a column name
with its table name, especially when the same column name is used in several tables.
Clarity is maintained by prefixing the column name with the table name. The
following example demonstrates qualification of the natcode, which appears in both
stock and nation.
SELECT * FROM stock, nation
WHERE stock.natcode = nation.natcode;
Temporary names
A table or view can be given a temporary name, or alias, that remains current for a
query. Temporary names are used in a self-join to distinguish the copies of the table.
SELECT wrk.empfname
FROM emp wrk, emp boss
WHERE wrk.bossno = boss.empno;
A temporary name also can be used as a shortened form of a long table name. For
example, l might be used merely to avoid having to enter lineitem more than once. If a
temporary name is specified for a table or view, any qualified reference to a column
of the table or view must also use that temporary name.
SELECT
The SELECT statement is by far the most interesting and challenging of the four DML
statements. It reveals a major benefit of the relational model—powerful
interrogation capabilities. It is challenging because mastering the power of SELECT
requires considerable practice with a wide range of queries. The major varieties of
SELECT are presented in this section. The SQL Playbook, which follows this chapter,
reveals the full power of the command.
The general format of SELECT is
SELECT [DISTINCT] item(s) FROM table(s)
[WHERE condition]
[GROUP BY column(s)] [HAVING condition]
[ORDER BY column(s)];
Product
Product, or more strictly Cartesian product, is a fundamental operation of relational
algebra. It is rarely used by itself in a query; however, understanding its effect helps
in comprehending join. The product of two tables is a new table consisting of all
rows of the first table concatenated with all possible rows of the second table. For
example:
• Form the product of stock and nation.
SELECT * FROM stock, nation;
The new table contains 64 rows (16*4), where stock has 16 rows and nation has 4 rows.
It has 10 columns (7 + 3), where stock has 7 columns and nation has 3 columns. The
result of the operation is shown in the following table. Note that each row in stock is
concatenated with each row in nation.
Product of stock and nation
stkcode stkfirm stkprice stkqty stkdiv stkpe natcode natcode natname exchrate
FC Freedonia Copper 27.50 10529 1.84 16 UK UK United Kingdom 1.0000
FC Freedonia Copper 27.50 10529 1.84 16 UK US United States 0.6700
FC Freedonia Copper 27.50 10529 1.84 16 UK AUS Australia 0.4600
FC Freedonia Copper 27.50 10529 1.84 16 UK IND India 0.0228
PT Patagonian Tea 55.25 12635 2.50 10 UK UK United Kingdom 1.0000
PT Patagonian Tea 55.25 12635 2.50 10 UK US United States 0.6700
PT Patagonian Tea 55.25 12635 2.50 10 UK AUS Australia 0.4600
PT Patagonian Tea 55.25 12635 2.50 10 UK IND India 0.0228
AR Abyssinian Ruby 31.82 22010 1.32 13 UK UK United Kingdom 1.0000
AR Abyssinian Ruby 31.82 22010 1.32 13 UK US United States 0.6700
AR Abyssinian Ruby 31.82 22010 1.32 13 UK AUS Australia 0.4600
AR Abyssinian Ruby 31.82 22010 1.32 13 UK IND India 0.0228
SLG Sri Lankan Gold 50.37 32868 2.68 16 UK UK United Kingdom 1.0000
SLG Sri Lankan Gold 50.37 32868 2.68 16 UK US United States 0.6700
SLG Sri Lankan Gold 50.37 32868 2.68 16 UK AUS Australia 0.4600
SLG Sri Lankan Gold 50.37 32868 2.68 16 UK IND India 0.0228
ILZ Indian Lead & Zinc 37.75 6390 3.00 12 UK UK United Kingdom 1.0000
ILZ Indian Lead & Zinc 37.75 6390 3.00 12 UK US United States 0.6700
ILZ Indian Lead & Zinc 37.75 6390 3.00 12 UK AUS Australia 0.4600
ILZ Indian Lead & Zinc 37.75 6390 3.00 12 UK IND India 0.0228
BE Burmese Elephant 0.07 154713 0.01 3 UK UK United Kingdom 1.0000
BE Burmese Elephant 0.07 154713 0.01 3 UK US United States 0.6700
BE Burmese Elephant 0.07 154713 0.01 3 UK AUS Australia 0.4600
BE Burmese Elephant 0.07 154713 0.01 3 UK IND India 0.0228
BS Bolivian Sheep 12.75 231678 1.78 11 UK UK United Kingdom 1.0000
BS Bolivian Sheep 12.75 231678 1.78 11 UK US United States 0.6700
BS Bolivian Sheep 12.75 231678 1.78 11 UK AUS Australia 0.4600
BS Bolivian Sheep 12.75 231678 1.78 11 UK IND India 0.0228
NG Nigerian Geese 35.00 12323 1.68 10 UK UK United Kingdom 1.0000
NG Nigerian Geese 35.00 12323 1.68 10 UK US United States 0.6700
NG Nigerian Geese 35.00 12323 1.68 10 UK AUS Australia 0.4600
NG Nigerian Geese 35.00 12323 1.68 10 UK IND India 0.0228
CS Canadian Sugar 52.78 4716 2.50 15 UK UK United Kingdom 1.0000
CS Canadian Sugar 52.78 4716 2.50 15 UK US United States 0.6700
CS Canadian Sugar 52.78 4716 2.50 15 UK AUS Australia 0.4600
CS Canadian Sugar 52.78 4716 2.50 15 UK IND India 0.0228
ROF Royal Ostrich Farms 33.75 1234923 3.00 6 UK UK United Kingdom 1.0000
ROF Royal Ostrich Farms 33.75 1234923 3.00 6 UK US United States 0.6700
ROF Royal Ostrich Farms 33.75 1234923 3.00 6 UK AUS Australia 0.4600
ROF Royal Ostrich Farms 33.75 1234923 3.00 6 UK IND India 0.0228
MG Minnesota Gold 53.87 816122 1.00 25 US UK United Kingdom 1.0000
MG Minnesota Gold 53.87 816122 1.00 25 US US United States 0.6700
MG Minnesota Gold 53.87 816122 1.00 25 US AUS Australia 0.4600
MG Minnesota Gold 53.87 816122 1.00 25 US IND India 0.0228
GP Georgia Peach 2.35 387333 0.20 5 US UK United Kingdom 1.0000
GP Georgia Peach 2.35 387333 0.20 5 US US United States 0.6700
GP Georgia Peach 2.35 387333 0.20 5 US AUS Australia 0.4600
GP Georgia Peach 2.35 387333 0.20 5 US IND India 0.0228
NE Narembeen Emu 12.34 45619 1.00 8 AUS UK United Kingdom 1.0000
NE Narembeen Emu 12.34 45619 1.00 8 AUS US United States 0.6700
NE Narembeen Emu 12.34 45619 1.00 8 AUS AUS Australia 0.4600
NE Narembeen Emu 12.34 45619 1.00 8 AUS IND India 0.0228
QD Queensland Diamond 6.73 89251 0.50 7 AUS UK United Kingdom 1.0000
QD Queensland Diamond 6.73 89251 0.50 7 AUS US United States 0.6700
QD Queensland Diamond 6.73 89251 0.50 7 AUS AUS Australia 0.4600
QD Queensland Diamond 6.73 89251 0.50 7 AUS IND India 0.0228
IR Indooroopilly Ruby 15.92 56147 0.50 20 AUS UK United Kingdom 1.0000
IR Indooroopilly Ruby 15.92 56147 0.50 20 AUS US United States 0.6700
IR Indooroopilly Ruby 15.92 56147 0.50 20 AUS AUS Australia 0.4600
IR Indooroopilly Ruby 15.92 56147 0.50 20 AUS IND India 0.0228
BD Bombay Duck 25.55 167382 1.00 12 IND UK United Kingdom 1.0000
BD Bombay Duck 25.55 167382 1.00 12 IND US United States 0.6700
BD Bombay Duck 25.55 167382 1.00 12 IND AUS Australia 0.4600
BD Bombay Duck 25.55 167382 1.00 12 IND IND India 0.0228
• Find the percentage of Australian stocks in the portfolio.
To answer this query, you need to count the number of Australian stocks, count the
total number of stocks in the portfolio, and then compute the percentage. Computing
each of the totals is a straightforward application of count. If we save the results of
the two counts as views, then we have the necessary data to compute the percentage.
The two views each consist of a single-cell table (i.e., one row and one column). We
create the product of these two views to get the data needed for computing the
percentage in one row. The SQL is
CREATE VIEW austotal (auscount) AS
SELECT COUNT(*) FROM stock WHERE natcode = 'AUS';
CREATE VIEW total (totalcount) AS
SELECT COUNT(*) FROM stock;
SELECT auscount*100/totalcount as Percentage
FROM austotal, total;
The result of a COUNT is always an integer, and SQL will typically create an
integer data type in which to store the results. When two variables have a data type
of integer, SQL will likely use integer arithmetic for all computations, and all
results will be integer. To get around the issue of integer arithmetic, we first
multiply the number of Australian stocks by 100 before dividing by the total number
of stocks. Because of integer arithmetic, you might get a different answer if you
used the following SQL.
SELECT auscount/totalcount*100 as Percentage
FROM austotal, total;
The preceding example was used to show when you might find product useful. You
can also write the query as
SELECT (SELECT COUNT(*) FROM stock WHERE natcode = 'AUS')*100/
(SELECT COUNT(*) FROM stock) as Percentage;
Join
Join, a very powerful and frequently used operation, creates a new table from two
existing tables by matching on a column common to both tables. An equijoin is the
simplest form of join; in this case, columns are matched on equality. The shaded
rows in the preceding table indicate the result of the following join example.
SELECT * FROM stock, nation
WHERE stock.natcode = nation.natcode;
There are other ways of expressing join that are more concise. For example, we can
write
SELECT * FROM stock INNER JOIN nation USING (natcode);
The preceding syntax makes the join easier to detect when reading SQL and
implicitly recognizes the frequent use of the same column name for matching
primary and foreign keys.
A further simplification is to rely on the primary and foreign key definitions to
determine the join condition, so we can write
SELECT * FROM stock natural JOIN nation;
An equijoin creates a new table that contains two identical columns. If one of these is
dropped, then the remaining table is called a natural join.
As you now realize, join is really product with a condition clause. There is no
reason why this condition needs to be restricted to equality. There could easily be
another comparison operator between the two columns. This general version is
called a theta-join because theta is a variable that can take any value from the set [=,
<>, >, >=, <, <=].
As you discovered earlier, there are occasions when you need to join a table to
itself. To do this, make two copies of the table first and give each of these copies a
unique name.
• Find the names of employees who earn more than their boss.
SELECT wrk.empfname
FROM emp wrk, emp boss
WHERE wrk.bossno = boss.empno
AND wrk.empsalary > boss.empsalary;
Outer join
A traditional join, more formally known as an inner join, reports those rows where
the primary and foreign keys match. An outer join reports these matching rows and
others depending on which form is used, as the following examples illustrate for the
sample table.
t1 t2
id col1 id col2
1 a 1 x
2 b 3 y
3 c 5 z
A left outer join is an inner join plus those rows from t1 not included in the inner
join.
SELECT * FROM t1 LEFT JOIN t2 USING (id);
t1.id col1 t2.id col2
1 a 1 x
2 b null null
3 c 3 y
Here is an example to illustrate the use of a left join.
• For all brown items, report each sale. Include in the report those brown items that
have appeared in no sales.
SELECT itemname, saleno, lineqty FROM item
LEFT JOIN lineitem USING ( itemno )
WHERE itemcolor = 'Brown'
ORDER BY itemname;
A right outer join is an inner join plus those rows from t2 not included in the inner
join.
SELECT * FROM t1 RIGHT JOIN t2 USING (id);
t1.id col1 t2.id col2
1 a 1 x
3 c 3 y
null null 5 z
A full outer join is an inner join plus those rows from t1 and t2 not included in the
inner join.
SELECT * FROM t1 FULL JOIN t2 USING (id);
Simple subquery
A subquery is a query within a query. There is a SELECT statement nested inside
another SELECT statement. Simple subqueries were used extensively in earlier
chapters. For reference, here is a simple subquery used earlier:
SELECT stkfirm FROM stock
WHERE natcode IN
(SELECT natcode FROM nation
WHERE natname = 'Australia');
Correlated subquery
A correlated subquery differs from a simple subquery in that the inner query must
be evaluated more than once. Consider the following example described previously:
• Find those stocks where the quantity is greater than the average for that country.
SELECT natname, stkfirm, stkqty FROM stock, nation
WHERE stock.natcode = nation.natcode
AND stkqty >
(SELECT AVG(stkqty) FROM stock
WHERE stock.natcode = nation.natcode);
The HAVING clause is often associated with GROUP BY. It can be thought of as the
WHERE clause of GROUP BY because it is used to eliminate rows for a GROUP BY
condition. Both GROUP BY and HAVING are dealt with in-depth in Chapter 4.
REGEXP
The REGEXP clause supports pattern matching to find a defined set of strings in a
character column (CHAR or VARCHAR). Refer to Chapters 3 and 4 for more details.
INSERT
There are two formats for INSERT. The first format is used to insert one row into a
table.
Inserting a single record
The general form is
INSERT INTO table [(column [,column] …)]
VALUES (literal [,literal] …);
For example,
INSERT INTO stock
(stkcode,stkfirm,stkprice,stkqty,stkdiv,stkpe)
VALUES ('FC','Freedonia Copper',27.5,10529,1.84,16);
In this example, stkcode is given the value “FC,” stkfirm is “Freedonia Copper,” and so
on. In general, the nth column in the table is the nth value in the list.
When the value list refers to all field names in the left-to-right order in which they
appear in the table, then the columns list can be omitted. So, it is possible to write the
following:
INSERT INTO stock
VALUES ('FC','Freedonia Copper',27.5,10529,1.84,16);
If some values are unknown, then the INSERT can omit these from the list. Undefined
columns will have nulls. For example, if a new stock is to be added for which the
price, dividend, and PE ratio are unknown, the following INSERT statement would be
used:
INSERT INTO stock
(stkcode, stkfirm, stkqty)
VALUES ('EE','Elysian Emeralds',0);
Think of INSERT with a subquery as a way of copying a table. You can select the rows
and columns of a particular table that you want to copy into an existing or new table.
UPDATE
The UPDATE command is used to modify values in a table. The general format is
UPDATE table
SET column = scalar expression
[, column = scalar expression] …
[WHERE condition];
Permissible scalar expressions involve columns, scalar functions (see the section on
scalar functions in this chapter), or constants. No aggregate functions are allowable.
Updating a single row
UPDATE can be used to modify a single row in a table. Suppose you need to revise
your data after 200,000 shares of Minnesota Gold are sold. You would code the
following:
UPDATE stock
SET stkqty = stkqty - 200000
WHERE stkcode = 'MG';
DELETE
The DELETE statement erases one or more rows in a table. The general format is
DELETE FROM table
[WHERE condition];
SQL routines
SQL provides two types of routines—functions and procedures—that are created,
altered, and dropped using standard SQL. Routines add flexibility, improve
programmer productivity, and facilitate the enforcement of business rules and
standard operating procedures across applications.
SQL function
A function is SQL code that returns a value when invoked within an SQL statement.
It is used in a similar fashion to SQL’s built-in functions. Consider the case of an
Austrian firm with a database in which all measurements are in SI units (e.g.,
meters). Because its U.S. staff is not familiar with SI, it decides to implement a
17
series of user-defined functions to handle the conversion. Here is the function for
converting from kilometers to miles.
CREATE FUNCTION km_to_miles(km REAL)
RETURNS REAL
RETURN 0.6213712*km;
The preceding function can be used within any SQL statement to make the
conversion. For example:
SELECT km_to_miles(100);
Skill builder
Create a table containing the average daily temperature in Tromsø, Norway, then
write a function to convert Celsius to Fahrenheit (F = C*1.8 + 32), and test the
function by reporting temperatures in C and F.
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
ºC -4.7 -4.1 -1.9 1.1 5.6 10.1 12.7 11.8 7.7 2.9 -1.5 -3.7
SQL procedure
A procedure is SQL code that is dynamically loaded and executed by a CALL
statement, usually within a database application. We use an accounting system to
demonstrate the features of a stored procedure, in which a single accounting
transaction results in two entries (one debit and one credit). In other words, a
transaction has multiple entries, but an entry is related to only one transaction. An
account (e.g., your bank account) has multiple entries, but an entry is for only one
account. Considering this situation results in the following data model.
A simple accounting system
The following are a set of steps for processing a transaction (e.g., transferring
money from a checking account to a money market account):
1. Write the transaction to the transaction table so you have a record of the transaction.
2. Update the account to be credited by incrementing its balance in the account table.
3. Insert a row in the entry table to record the credit.
4. Update the account to be debited by decrementing its balance in the account table.
5. Insert a row in the entry table to record the debit.
Here is the code for a stored procedure to execute these steps. Note that the first line
sets the delimiter to // because the default delimiter for SQL is a semicolon (;),
which we need to use to delimit the multiple SQL commands in the procedure. The
last statement in the procedure is thus END // to indicate the end of the procedure.
DELIMITER //
CREATE PROCEDURE transfer (
IN cracct INTEGER,
IN dbacct INTEGER,
IN amt DECIMAL(9,2),
IN transno INTEGER)
LANGUAGE SQL
BEGIN
INSERT INTO transaction VALUES (transno, amt, current_date);
UPDATE account
SET acctbalance = acctbalance + amt
WHERE acctno = cracct;
INSERT INTO entry values(transno, cracct, 'cr');
UPDATE account
SET acctbalance = acctbalance - amt
WHERE acctno = dbacct;
INSERT INTO entry VALUES (transno, dbacct, 'db');
UPDATE account
SET acctbalance = acctbalance - amt
INSERT INTO entry VALUES (transno, dbacct, 'db');
END //
A CALL statement executes a stored procedure. The generic CALL statement for the
preceding procedure is
CALL transfer(cracct, dbacct, amt, transno);
Thus, imagine that transaction 1005 transfers $100 to account 1 (the credit account)
from account 2 (the debit account). The specific call is
CALL transfer(1,2,100,1005);
Skill builder
1. After verifying that the system you use for learning SQL supports stored
procedures, create the tables for the preceding data model and enter the code for
the stored procedure. Now, test the stored procedure and query the tables to verify
that the procedure has worked.
2. Write a stored procedure to add details of a gift to the donation database (see
exercises in Chapter 5).
Triggers
Triggers are a form of stored procedure that execute automatically when a table’s
rows are modified. Triggers can be defined to execute either before or after rows
are inserted into a table, when rows are deleted from a table, and when columns are
updated in the rows of a table. Triggers can include virtual tables that reflect the row
image before and after the operation, as appropriate. Triggers can be used to
enforce business rules or requirements, integrity checking, and automatic
transaction logging.
Consider the case of recording all updates to the stock table (see Chapter 4). First,
you must define a table in which to record details of the change.
CREATE TABLE stock_log (
stkcode CHAR(3),
old_stkprice DECIMAL(6,2),
new_stkprice DECIMAL(6,2),
old_stkqty DECIMAL(8),
new_stkqty DECIMAL(8),
update_stktime TIMESTAMP NOT NULL,
PRIMARY KEY(update_stktime));
The trigger writes a record to stock_log every time an update is made to stock. Use is
made of two virtual tables (old and new) to access the prior and current values of
stock price (old_row.stkprice and new_row.stkprice and stock quantity (old_row.stkprice and
new_row.stkprice). The INSERT statement also writes the stock’s identifying code and the
time of the transaction.
CREATE TRIGGER stock_update
AFTER UPDATE ON stock
FOR EACH ROW BEGIN
INSERT INTO stock_log VALUES
(OLD.stkcode, OLD.stkprice, NEW.stkprice, OLD.stkqty, NEW.stkqty, CURRENT_TIMESTAMP);
END
Skill builder
Why is the primary key of stock_log not the same as that of stock?
Nulls—much ado about missing information
Nulls are overworked in SQL because they can represent several situations. Null can
represent unknown information. For example, you might add a new stock to the
database, but lacking details of its latest dividend, you leave the field null. Null can
be used to represent a value that is inapplicable. For instance, the employee table
contains a null value in bossno for Alice because she has no boss. The value is not
unknown; it is not applicable for that field. In other cases, null might mean “no
value supplied” or “value undefined.” Because null can have multiple meanings, the
user must infer which meaning is appropriate to the circumstances.
Do not confuse null with blank or zero, which are values. In fact, null is a marker
that specifies that the value for the particular column is null. Thus, null represents
no value.
The well-known database expert Chris Date has been outspoken in his concern about
the confusion caused by nulls. His advice is that nulls should be explicitly avoided
by specifying NOT NULL for all columns and by using codes to make the meaning of
a value clear (e.g., “U” means “unknown,” “I” means “inapplicable,” “N” means
“not supplied”).
Security
Data are a valuable resource for nearly every organization. Just as an organization
takes measures to protect its physical assets, it also needs to safeguard its electronic
assets—its organizational memory, including databases. Furthermore, it often wants
to limit the access of authorized users to particular parts of a database and restrict
their actions to particular operations.
Two SQL features are used to administer security procedures. A view, discussed
earlier in this chapter, can restrict a client’s access to specified columns or rows in a
table, and authorization commands can establish a user ’s privileges.
The authorization subsystem is based on the concept of a privilege—the authority to
perform an operation. For example, a person cannot update a table unless she has
been granted the appropriate update privilege. The database administrator (DBA) is
a master of the universe and has the highest privilege. The DBA can perform any
legal operation. The creator of an object, say a base table, has full privileges for that
object. Those with privileges can then use GRANT and REVOKE, commands included
in SQL’s data control language (DCL) to extend privileges to or rescind them from
other users.
GRANT
The GRANT command defines a user ’s privileges. The general format of the
statement is
GRANT privileges ON object TO users [WITH GRANT OPTION];
where “privileges” can be a list of privileges or the keyword ALL PRIVILEGES, and
“users” is a list of user identifiers or the keyword PUBLIC. An “object” can be a base
table or a view.
The following privileges can be granted for tables and views: SELECT, UPDATE,
DELETE, and INSERT.
The UPDATE privilege specifies the particular columns in a base table or view that
may be updated. Some privileges apply only to base tables. These are ALTER and
INDEX.
The following examples illustrate the use of GRANT:
• Give Alice all rights to the stock table.
GRANT ALL PRIVILEGES ON stock TO alice;
• Permit the accounting staff, Todd and Nancy, to update the price of a stock.
GRANT UPDATE (stkprice) ON stock TO todd, nancy;
• Give all staff the privilege to select rows from item.
GRANT SELECT ON item TO PUBLIC;
• Give Alice all rights to view stk.
GRANT SELECT, UPDATE, DELETE, INSERT ON stk TO alice;
The WITH GRANT OPTION clause
The WITH GRANT OPTION command allows a client to transfer his privileges to
another client, as this next example illustrates:
• Give Ned all privileges for the item table and permit him to grant any of these to
other staff members who may need to work with item.
GRANT ALL PRIVILEGES ON item TO ned WITH GRANT OPTION;
This means that Ned can now use the GRANT command to give other staff privileges.
To give Andrew permission for select and insert on item, for example, Ned would
enter
GRANT SELECT, INSERT ON item TO andrew;
REVOKE
What GRANT granteth, REVOKE revoketh. Privileges are removed using the REVOKE
statement. The general format of this statement is
REVOKE privileges ON object FROM users;
These examples illustrate the use of REVOKE.
• Remove Sophie’s ability to select from item.
REVOKE SELECT ON item FROM sophie;
• Nancy is no longer permitted to update stock prices.
REVOKE UPDATE ON stock FROM nancy;
Cascading revoke
When a REVOKE statement removes a privilege, it can result in more than one
revocation. An earlier example illustrated how Ned used his WITH GRANT OPTION
right to authorize Andrew to select and insert rows on item. The following REVOKE
command
REVOKE INSERT ON item FROM ned;
automatically revokes Andrew’s insert privilege.
The system catalog
The system catalog describes a relational database. It contains the definitions of base
tables, views, indexes, and so on. The catalog itself is a relational database and can
be interrogated using SQL. Tables in the catalog are called system tables to
distinguish them from base tables, though conceptually these tables are the same. In
MySQL, the system catalog is called information_schema. Some important system tables
in this schema are TABLES, and COLUMNS, and these are used in the following
examples. Note that the names of the system catalog tables vary with DBMS
implementations, so while the following examples illustrate use of system catalog
tables, it is likely that you will have to change the table names for other DBMSs.
The table TABLES contains details of all tables in the database. There is one row for
each table in the database.
• Find the table(s) with the most columns.
SELECT TABLE_NAME, TABLE_ROWS
FROM Information_schema.TABLES
WHERE TABLE_ROWS = (SELECT MAX(TABLE_ROWS)
FROM Information_schema.TABLES);
The COLUMN table stores details about each column in the database.
• What columns in what tables store dates?
SELECT TABLE_NAME, COLUMN_NAME
FROM Information_schema.COLUMNS
WHERE DATA_TYPE = 'date'
ORDER BY TABLE_NAME, COLUMN_NAME;
As you can see, querying the catalog is the same as querying a database. This is a
useful feature because you can use SQL queries on the catalog to find out more
about a database.
Natural language processing
Infrequent inquirers of a relational database may be reluctant to use SQL because
they don’t use it often enough to remain familiar with the language. While the QBE
approach can make querying easier, a more natural approach is to use standard
English. In this case, natural language processing (NLP) is used to convert ordinary
English into SQL so the query can be passed to the relational database. The example
in the table below shows the successful translation of a query to SQL. A natural
language processor must translate a request to SQL and request clarification where
necessary.
An example of natural language processing
English SQL generated for MS Access
SELECT DISTINCT [Year], [Title] FROM [Awards] INNER JOIN
Which movies have won best foreign film sorted by year? [Movie ID] = [Awards].
[Movie ID] WHERE [Category]='Best Foreign Film' and [Status]='W
Connectivity and ODBC
Over time and because of differing needs, an organization is likely to purchase
DBMS software from a variety of vendors. Also, in some situations, mergers and
acquisitions can create a multivendor DBMS environment. Consequently, the SQL
Access Group developed SQL Call-Level Interface (CLI), a unified standard for
remote database access. The intention of CLI is to provide programmers with a
generic approach for writing software that accesses a database. With the appropriate
CLI database driver, any DBMS server can provide access to client programs that
use the CLI. On the server side, the DBMS CLI driver is responsible for translating
the CLI call into the server ’s access language. On the client side, there must be a CLI
driver for each database to which it connects. CLI is not a query language but a way
of wrapping SQL so it can be understood by a DBMS. In 1996, CLI was adopted as
an international standard and renamed X/Open CLI.
Open database connectivity (ODBC)
The de facto standard for database connectivity is Open Database Connectivity
(ODBC), an extended implementation of CLI developed by Microsoft. This
application programming interface (API) is cross-platform and can be used to
access any DBMS or DBMS server that has an ODBC driver.This enables a software
developer to build and distribute an application without targeting a specific DBMS.
Database drivers are then added to link the application to the client’s choice of
DBMS. For example, a microcomputer running under Windows can use ODBC to
access an Oracle DBMS running on a Unix box.
There is considerable support for ODBC. Application vendors like it because they
do not have to write and maintain code for each DBMS; they can write one API.
DBMS vendors support ODBC because they do not have to convince application
vendors to support their product. For database systems managers, ODBC provides
vendor and platform independence. Although the ODBC API was originally
developed to provide database access from MS Windows products, many ODBC
driver vendors support Linux and Macintosh clients.
Most vendors also have their own SQL APIs. The problem is that most vendors, as a
means of differentiating their DBMS, have a more extensive native API protocol
and also add extensions to standard ODBC. The developer who is tempted to use
these extensions threatens the portability of the database.
ODBC introduces greater complexity and a processing overhead because it adds
two layers of software. As the following figure illustrates, an ODBC-compliant
application has additional layers for the ODBC API and ODBC driver. As a result,
ODBC APIs can never be as fast as native APIs.
ODBC layers
Application
ODBC API
ODBC driver manager
Service provider API
Driver for DBMS server
DBMS server
Embedded SQL
SQL can be used in two modes. First, SQL is an interactive query language and
database programming language. SELECT defines queries; INSERT, UPDATE, and
DELETE maintain a database. Second, any interactive SQL statement can be embedded
in an application program.
This dual-mode principle is a very useful feature. It means that programmers need
to learn only one database query language, because the same SQL statements apply
for both interactive queries and application statements. Programmers can also
interactively examine SQL commands before embedding them in a program, a
feature that can substantially reduce the time to write an application program.
Because SQL is not a complete programming language, however, it must be used
with a traditional programming language to create applications. Common complete
programming languages, such as PHP and Java, support embedded SQL. If you are
need to write application programs using embedded SQL, you will need training in
both the application language and the details of how it communicates with SQL.
User-defined types
Versions of SQL prior to the SQL-99 specification had predefined data types, and
programmers were limited to selecting the data type and defining the length of
character strings. One of the basic ideas behind the object extensions of the SQL
standard is that, in addition to the normal built-in data types defined by SQL, user-
defined data types (UDTs) are available. A UDT is used like a predefined type, but
it must be set up before it can be used.
The future of SQL
Since 1986, developers of database applications have benefited from an SQL
standard, one of the most successful standardization stories in the software industry.
Although most database vendors have implemented proprietary extensions of SQL,
standardization has kept the language consistent, and SQL code is highly portable.
Standardization was relatively easy when focused on the storage and retrieval of
numbers and characters. Objects have made standardization more difficult.
Summary
Structured Query Language (SQL), a widely used relational database language, has
been adopted as a standard by ANSI and ISO. It is a data definition language (DDL),
data manipulation language (DML), and data control language (DCL). A base table
is an autonomous, named table. A view is a virtual table. A key is one or more
columns identified as such in the description of a table, an index, or a referential
constraint. SQL supports primary, foreign, and unique keys. Indexes accelerate data
access and ensure uniqueness. CREATE TABLE defines a new base table and specifies
primary, foreign, and unique key constraints. Numeric, string, date, or graphic data
can be stored in a column. BLOB and CLOB are data types for large fields. ALTER TABLE
adds one new column to a table or changes the status of a constraint. DROP TABLE
removes a table from a database. CREATE VIEW defines a view, which can be used to
restrict access to data, report derived data, store commonly executed queries, and
convert data. A view is created dynamically. DROP VIEW deletes a view. CREATE INDEX
defines an index, and DROP INDEX deletes one.
Ambiguous references to column names are avoided by qualifying a column name
with its table name. A table or view can be given a temporary name that remains
current for a query. SQL has four data manipulation statements—SELECT, INSERT,
UPDATE, and DELETE. INSERT adds one or more rows to a table. UPDATE modifies a
table by changing one or more rows. DELETE removes one or more rows from a
table. SELECT provides powerful interrogation facilities. The product of two tables is
a new table consisting of all rows of the first table concatenated with all possible
rows of the second table. Join creates a new table from two existing tables by
matching on a column common to both tables. A subquery is a query within a query.
A correlated subquery differs from a simple subquery in that the inner query is
evaluated multiple times rather than once.
SQL’s aggregate functions increase its retrieval power. GROUP BY supports grouping
of rows that have the same value for a specified column. The REXEXP clause
supports pattern matching. SQL includes scalar functions that can be used in
arithmetic expressions, data conversion, or data extraction. Nulls cause problems
because they can represent several situations—unknown information, inapplicable
information, no value supplied, or value undefined. Remember, a null is not a blank
or zero. The SQL commands, GRANT and REVOKE, support data security. GRANT
authorizes a user to perform certain SQL operations, and REVOKE removes a user ’s
authority. The system catalog, which describes a relational database, can be queried
using SELECT. SQL can be used as an interactive query language and as embedded
commands within an application programming language. Natural language
processing (NLP) and open database connectivity (ODBC) are extensions to
relational technology that enhance its usefulness.
Key terms and concepts
Aggregate functions
ALTER TABLE
ANSI
Base table
Complete database language
Complete programming language
Composite key
Connectivity
Correlated subquery
CREATE FUNCTION
CREATE INDEX
CREATE PROCEDURE
CREATE TABLE
CREATE TRIGGER
CREATE VIEW
Cursor
Data control language (DCL)
Data definition language (DDL)
Data manipulation language (DML)
Data types
DELETE
DROP INDEX
DROP TABLE
DROP VIEW
Embedded SQL
Foreign key
GRANT
GROUP BY
Index
INSERT
ISO
Join
Key
Natural language processing (NLP)
Null
Open database connectivity (ODBC)
Primary key
Product
Qualified name
Referential integrity rule
REVOKE
Routine
Scalar functions
Security
SELECT
Special registers
SQL
Subquery
Synonym
System catalog
Temporary names
Unique key
UPDATE
View
References and additional readings
Date, C. J. 2003. An introduction to database systems. 8th ed. Reading, MA: Addison-
Wesley.
Exercises
1. Why is it important that SQL was adopted as a standard by ANSI and ISO?
2. What does it mean to say “SQL is a complete database language”?
3. Is SQL a complete programming language? What are the implications of your
answer?
4. List some operational advantages of a DBMS.
5. What is the difference between a base table and a view?
6. What is the difference between a primary key and a unique key?
7. What is the purpose of an index?
8. Consider the three choices for the ON DELETE clause associated with the foreign
key constraint. What are the pros and cons of each option?
9. Specify the data type (e.g., DECIMAL(6,2)) you would use for the following
columns:
a. The selling price of a house
b. A telephone number with area code
c. Hourly temperatures in Antarctica
d. A numeric customer code
e. A credit card number
f. The distance between two cities
g. A sentence using Chinese characters
h. The number of kilometers from the Earth to a given star
i. The text of an advertisement in the classified section of a newspaper
j. A basketball score
k. The title of a CD
l. The X-ray of a patient
m. A U.S. zip code
n. A British or Canadian postal code
o. The photo of a customer
p. The date a person purchased a car
q. The time of arrival of an e-mail message
r. The number of full-time employees in a small business
s. The text of a speech
t. The thickness of a layer on a silicon chip
10. How could you add two columns to an existing table? Precisely describe the
steps you would take.
11. What is the difference between DROP TABLE and deleting all the rows in a table?
12. Give some reasons for creating a view.
13. When is a view created?
14. Write SQL codes to create a unique index on firm name for the share table
defined in Chapter 3. Would it make sense to create a unique index for PE ratio in
the same table?
15. What is the difference between product and join?
16. What is the difference between an equijoin and a natural join?
17. You have a choice between executing two queries that will both give the same
result. One is written as a simple subquery and the other as a correlated subquery.
Which one would you use and why?
18. What function would you use for the following situations?
a. Computing the total value of a column
b. Finding the minimum value of a column
c. Counting the number of customers in the customer table
d. Displaying a number with specified precision
e. Reporting the month part of a date
f. Displaying the second part of a time
g. Retrieving the first five characters of a city’s name
h. Reporting the distance to the sun in feet
19. Write SQL statements for the following:
a. Let Hui-Tze query and add to the nation table.
b. Give Lana permission to update the phone number column in the customer
table.
c. Remove all of William’s privileges.
d. Give Chris permission to grant other users authority to select from the
address table.
e. Find the name of all tables that include the word sale.
f. List all the tables created last year.
g. What is the maximum length of the column city in the ClassicModels
database? Why do you get two rows in the response?
h. Find all columns that have a data type of SMALLINT.
20. What are the two modes in which you can use SQL?
21. How do procedural programming languages and SQL differ in the way they
process data? How is this difference handled in an application program? What is
embedded SQL?
Reference 2: SQL Playbook
Play so that you may be serious.
Anacharsis (c. 600 b.c.e.)
The power of SQL
SQL is a very powerful retrieval language, and most novices underestimate its
capability. Furthermore, mastery of SQL takes considerable practice and exposure
to a wide variety of problems. This reference serves the dual purpose of revealing
the full power of SQL and providing an extensive set of diverse queries. To make
this reference easier to use, we have named each of the queries to make them easy to
remember.
Lacroix and Pirotte have defined 66 queries for testing a query language’s power,
and their work is the foundation of this chapter. Not all their queries are included;
some have been kept for the end-of-chapter exercises. Also, many of the queries
have been reworded to various extents. Because these queries are more difficult than
those normally used in business decision making, they provide a good test of your
SQL skills. If you can answer all of them, then you are a proficient SQL
programmer.
You are encouraged to formulate and test your answer to a query before reading the
suggested solution. A q has been prefixed to all entity and table names to distinguish
them from entities and tables used in earlier problems. The going gets tough after
the first few queries, and you can expect to spend some time formulating and testing
each one, but remember, this is still considerably faster than using a procedural
programming language, such as Java, to write the same queries.
The data model accompanying Lacroix and Pirotte’s queries is shown in the
following figure. You will notice that the complexity of the real world has been
reduced; a sale and a delivery are assumed to have only one item. The
corresponding relational model is shown in the following table, and some data that
can be used to test your SQL queries appear at the end of this chapter.
A slow full toss
In cricket, a slow full toss is the easiest ball to hit. The same applies to this simple
query.
• Find the names of employees in the Marketing department.
SELECT empfname FROM qemp WHERE deptname = 'Marketing';
Skinning a cat
Queries often can be solved in several ways. For this query, four possible solutions
are presented.
• Find the items sold by a department on the second floor.
A join of qsale and qdept is possibly the most obvious way to answer this query.
SELECT DISTINCT itemname FROM qsale, qdept
WHERE qdept.deptname = qsale.deptname
AND deptfloor = 2;
Another simple approach is to use an IN clause. First find all the departments on the
second floor, and then find a match in qsale. The subquery returns a list of all
departments on the second floor. Then use the IN clause to find a match on
department name in qsale. Notice that this is the most efficient form of the query.
SELECT DISTINCT itemname FROM qsale
WHERE deptname IN
(SELECT deptname FROM qdept WHERE deptfloor = 2);
Conceptually, you can think of this correlated query as stepping through qsale one
row at a time. The subquery is executed for each row of qsale, and if there is a match
for the current value of qsale.deptname, the current value of itemname is listed.
SELECT DISTINCT itemname FROM qsale
WHERE deptname IN
(SELECT deptname FROM qdept
WHERE qdept.deptname = qsale.deptname
AND deptfloor = 2);
Think of this query as stepping through each row of qsale and evaluating whether the
existence test is true. Remember, EXISTS returns true if there is at least one row in the
inner query for which the condition is true.
The difference between a correlated subquery and existence testing is that a
correlated subquery returns either the value of a column or an empty table, while
EXISTS returns either true or false. It may help to think of this version of the query
as, “Select item names from sale such that there exists a department relating them to
the second floor.”
SELECT DISTINCT itemname FROM qsale
WHERE EXISTS
(SELECT * FROM qdept
WHERE qsale.deptname = qdept.deptname
AND deptfloor = 2);
Dividing
• Find the items sold by all departments on the second floor.
This is the relational algebra divide or the SQL double NOT EXISTS encountered in
Chapter 5. You can think of this query as
Select items from sales such that there does not exist a department on the second
floor that does not sell this item.
However, it is easier to apply the SQL template described in Chapter 5, which is
SELECT target1 FROM target
WHERE NOT EXISTS
(SELECT * FROM source
WHERE NOT EXISTS
(SELECT * FROM target-source
WHERE target-source.target# = target.target#
AND target-source.source# = source.source#));
Some additional code needs to be added to handle the restriction to items on the
second floor. The query becomes
SELECT DISTINCT itemname FROM qitem
WHERE NOT EXISTS
(SELECT * FROM qdept WHERE deptfloor = 2
AND NOT EXISTS
(SELECT * FROM qsale
WHERE qsale.itemname = qitem.itemname
AND qsale.deptname = qdept.deptname));
A double divide!
• List the items delivered by every supplier that delivers all items of type N.
SQL programmers who tackle this query without blinking an eye are superheroes.
This is a genuinely true-blue, tough query because it is two divides. There is an
inner divide that determines the suppliers that deliver all items of type N. The SQL
for this query is
SELECT * FROM qspl
WHERE NOT EXISTS
(SELECT * FROM qitem WHERE itemtype = 'N'
AND NOT EXISTS
(SELECT * FROM qdel
WHERE qdel.itemname = qitem.itemname
AND qdel.splno = qspl.splno));
Then there is an outer divide that determines which of the suppliers (returned by the
inner divide) provide all these items. If we call the result of the inner query qspn, the
SQL for the outer query is
SELECT DISTINCT itemname FROM qdel del
WHERE NOT EXISTS
(SELECT * FROM qspn
WHERE NOT EXISTS
(SELECT * FROM qdel
WHERE qdel.itemname = del.itemname
AND qdel.splno = qspn.splno));
A slam dunk
• Find the suppliers that deliver compasses.
A simple query to recover from the double divide. It is a good idea to include
supplier number since splname could possibly be nonunique.
SELECT DISTINCT qspl.splno, splname FROM qspl, qdel
WHERE qspl.splno = qdel.splno AND itemname = 'Compass';
A more general approach is to find suppliers that have delivered compasses and
more than one item (i.e., COUNT(DISTINCT itemname) > 1). This means they deliver at
least an item other than compasses. Note that the GROUP BY clause includes supplier
number to cope with the situation where two suppliers have the same name. The
DISTINCT itemname clause must be used to guard against multiple deliveries of
compasses from the same supplier.
SELECT DISTINCT qdel.splno, splname FROM qspl, qdel
WHERE qdel.splno = qspl.splno
AND qdel.splno IN
(SELECT splno FROM qdel WHERE itemname = 'Compass')
GROUP BY qdel.splno, splname HAVING COUNT(DISTINCT itemname) > 1;
The more general approach enables you to solve queries, such as Find suppliers
that deliver three items other than compasses, by changing the HAVING clause to
COUNT(DISTINCT itemname > 3).
Because DISTINCT colname is not supported by MS Access, for that DBMS you must
first create a view containing distinct splno, itemname pairs and then substitute the name
of the view for qdel and drop the DISTINCT clause in the COUNT statement.
Minus and divide
• List the departments that have not recorded a sale for all the items of type N.
This query has two parts: an inner divide and an outer minus. The inner query finds
departments that have sold all items of type N. These are then subtracted from all
departments to leave only those that have not sold all items of type N.
SELECT deptname FROM qdept WHERE deptname NOT IN
(SELECT deptname FROM qdept
WHERE NOT EXISTS
(SELECT * FROM qitem WHERE itemtype = 'N'
AND NOT EXISTS
(SELECT * FROM qsale
WHERE qsale.deptname = qdept.deptname AND
qsale.itemname = qitem.itemname)));
This query can also be written as shown next. Observe how NOT IN functions like
NOT EXISTS. We will use this variation on divide with some of the upcoming queries.
SELECT DISTINCT deptname FROM qdel del1
WHERE NOT EXISTS
(SELECT * FROM qdel del2
WHERE del2.deptname = del1.deptname
AND itemname NOT IN
(SELECT itemname FROM qsale
WHERE deptname = del1.deptname));
A difficult pairing
• List the supplier-department pairs where the department sells all items delivered
to it by the supplier.
This query is yet another variation on divide. An additional complication is that you
have to match the department name and item name of sales and deliveries.
SELECT splname, deptname FROM qdel del1, qspl
WHERE del1.splno = qspl.splno
AND NOT EXISTS
(SELECT * FROM qdel
WHERE qdel.deptname = del1.deptname
AND qdel.splno = del1.splno
AND itemname NOT IN
(SELECT itemname FROM qsale
WHERE qsale.deptname = del1.deptname));
Restricted divide
• Who are the suppliers that deliver all the items of type N?
A slight variation on the standard divide to restrict consideration to the type N items.
SELECT splno, splname FROM qspl
WHERE NOT EXISTS
(SELECT * FROM qitem WHERE itemtype = 'N'
AND NOT EXISTS
(SELECT * FROM qdel
WHERE qdel.splno = qspl.splno
AND qdel.itemname = qitem.itemname));
A NOT IN variation on divide
• List the suppliers that deliver only the items sold by the Books department.
This query may be rewritten as Select suppliers for which there does not exist a
delivery that does not include the items sold by the Books department. Note the use
of the IN clause to limit consideration to those suppliers that have made a delivery.
Otherwise, a supplier that has never delivered an item will be reported as delivering
only the items sold by the Books department.
SELECT splname FROM qspl
WHERE splno IN (SELECT splno FROM qdel)
AND NOT EXISTS
(SELECT * FROM qdel
WHERE qdel.splno = qspl.splno
AND itemname NOT IN
(SELECT itemname FROM qsale
WHERE deptname = 'Books'));
Second, report all the suppliers that deliver all the items of type B to the departments
designated in the previously created view.
SELECT splname FROM qspl
WHERE NOT EXISTS (SELECT * FROM qitem
WHERE itemtype = 'B'
AND NOT EXISTS (SELECT * FROM qdel
WHERE qdel.itemname = qitem.itemname
AND qdel.splno = qspl.splno
AND deptname IN (SELECT deptname FROM v32)));
Triple divide with an intersection
• List the suppliers that deliver all the items of type B to the departments that also
sell all the items of type N.
Defeat this by dividing—that word again—the query into parts. First, identify the
departments that sell all items of type N, and save as a view.
CREATE VIEW v33a AS
(SELECT deptname FROM qdept
WHERE NOT EXISTS (SELECT * FROM qitem
WHERE itemtype = 'N'
AND NOT EXISTS (SELECT * FROM qsale
WHERE qsale.deptname = qdept.deptname
AND qsale.itemname = qitem.itemname)));
Next, select the departments to which all items of type B are delivered, and save as a
view.
CREATE VIEW v33b AS
(SELECT deptname FROM qdept
WHERE NOT EXISTS (SELECT * FROM qitem
WHERE itemtype = 'B'
AND NOT EXISTS (SELECT * FROM qdel
WHERE qdel.deptname = qdept.deptname
AND qdel.itemname = qitem.itemname)));
• Now, find the suppliers that supply all items of type B to the departments that
appear in both views.
SELECT splname FROM qspl
WHERE NOT EXISTS (SELECT * FROM qitem
WHERE itemtype = 'B'
AND NOT EXISTS (SELECT * FROM qdel
WHERE qdel.splno = qspl.splno
AND qdel.itemname = qitem.itemname
AND EXISTS
(SELECT * FROM v33a
WHERE qdel.deptname = v33a.deptname)
AND EXISTS
(SELECT * FROM v33b
WHERE qdel.deptname = v33b.deptname)));
A three-table join
• For each item, give its type, the departments that sell the item, and the floor
locations of these departments.
A three-table join is rather easy after all the divides.
SELECT qitem.itemname, itemtype, qdept.deptname, deptfloor
FROM qitem, qsale, qdept
WHERE qsale.itemname = qitem.itemname
AND qsale.deptname = qdept.deptname;
Something to all
• List the items that are delivered only by the suppliers that deliver something to all
the departments.
This is a variation on query 25 with the additional requirement, handled by the
innermost query, that the supplier delivers to all departments. Specifying that all
departments get a delivery means that the number of departments to which a supplier
delivers (GROUP BY splno HAVING COUNT (DISTINCT deptname)) must equal the number of
departments that sell items (SELECT COUNT(*) FROM qdept WHERE deptname NOT IN
('Management', 'Marketing', 'Personnel', 'Accounting', 'Purchasing')).
SELECT DISTINCT itemname FROM qdel del1
WHERE NOT EXISTS
(SELECT * FROM qdel del2
WHERE del2.itemname = del1.itemname
AND splno NOT IN
(SELECT splno FROM qdel
GROUP BY splno HAVING COUNT(DISTINCT deptname) =
(SELECT COUNT(*) FROM qdept
wHERE deptname NOT IN ('Management',
'Marketing', 'Personnel',
'Accounting', 'Purchasing'))));
Intersection (AND)
• List the items delivered by Nepalese Corp. and sold in the Navigation department.
The two parts to the query—the delivery and the sale—are intersected using AND.
SELECT DISTINCT itemname FROM qdel, qspl
WHERE qdel.splno = qspl.splno
AND splname = 'Nepalese Corp.'
AND itemname IN
(SELECT itemname FROM qsale
WHERE deptname = 'Navigation');
Union (OR)
• List the items delivered by Nepalese Corp. or sold in the Navigation department.
The two parts are the same as query 41, but the condition is OR rather than AND.
SELECT DISTINCT itemname FROM qdel, qspl
WHERE qdel.splno = qspl.splno
AND splname = 'Nepalese Corp.'
OR itemname IN
(SELECT itemname FROM qsale
WHERE deptname = 'Navigation');
Intersection/union
• List the departments selling items of type E that are delivered by Nepalese Corp.
and/or are sold by the Navigation department.
The inner query handles the and/or with OR. Remember OR can mean that the items
are in both tables or one table. The outer query identifies the departments that
receive the delivered items, which satisfy the inner query.
SELECT DISTINCT deptname FROM qsale
WHERE itemname IN
(SELECT qitem.itemname FROM qitem, qdel, qspl
WHERE qitem.itemname = qdel.itemname
AND qdel.splno = qspl.splno
AND splname = 'Nepalese corp.'
AND itemtype = 'E')
OR itemname in
(SELECT itemname FROM qsale
WHERE deptname = 'Navigation');
Complex counting
• What is the number of different items delivered by each supplier that delivers to all
departments?
First, determine the suppliers that deliver to each department, excluding
administrative departments. The inner query handles this part of the main query.
Second, count the number of different items delivered by the suppliers identified by
the inner query.
SELECT splname, COUNT(DISTINCT itemname)
FROM qdel del1, qspl
WHERE del1.splno = qspl.splno
AND NOT EXISTS
(SELECT * FROM qdept
WHERE deptname NOT IN
(SELECT deptname FROM qdel
WHERE qdel.splno = del1.splno)
AND deptname NOT IN
('Management', 'Marketing', 'Personnel',
'Accounting', 'Purchasing'))
GROUP BY splname;
Advanced summing
• List suppliers that deliver a total quantity of items of types C and N that is
altogether greater than 100.
The difficult part of this query, and it is not too difficult, is to write the condition for
selecting items of type C and N. Notice that the query says C and N, but don’t
translate this to (itemtype = 'C' AND itemtype = 'N') because an item cannot be both types
simultaneously. The query means that for any delivery, the item should be type C or
type N.
SELECT qdel.splno, splname FROM qspl, qdel, qitem
WHERE qspl.splno = qdel.splno
AND qitem.itemname = qdel.itemname
AND (itemtype = 'C' or itemtype = 'N')
GROUP BY qdel.splno, splname HAVING SUM(delqty) > 100;
Another possible interpretation is that you have to find the total average salary after
you determine the average salary for each department. To do this, you would first
create a view containing the average salary for each department (see query 52) and
then find the average of these average salaries.
SELECT AVG(dpavgsal) FROM v52;
Detailed averaging
• What is the average delivery quantity of items of type N delivered by each supplier
to each department (given that the supplier delivers items of type N to the
department)?
Now we take averaging to three levels—supplier, department, item. You can take
averaging to as many levels as you like.
SELECT qdel.splno, splname, deptname, qdel.itemname, AVG(delqty)
FROM qdel, qspl, qitem
WHERE qdel.splno = qspl.splno
AND qdel.itemname = qitem.itemname
AND itemtype = 'N'
GROUP BY qdel.splno, splname, deptname, qdel.itemname;
Counting pairs
• What is the number of supplier-department pairs in which the supplier delivers at
least one item of type E to the department?
First, find all the supplier-department pairs. Without DISTINCT, you would get
duplicates, which would make the subsequent count wrong.
CREATE VIEW v60 AS
(SELECT DISTINCT splno, deptname FROM qdel, qitem
WHERE qdel.itemname = qitem.itemname
AND itemtype = 'E');
No Booleans
• Is it true that all the departments that sell items of type C are located on the third
floor? (The result can be a Boolean 1 or 0, meaning yes or no.)
SQL cannot return true or false; it always returns a table. But you can get close to
Boolean results by using counts. If we get a count of zero for the following query,
there are no departments that are not on the third floor that sell items of type C.
SELECT COUNT(*) FROM qdept
WHERE deptfloor <> 3
AND exists
(SELECT * FROM qsale, qitem
WHERE qsale.itemname = qitem.itemname
AND qsale.deptname = qdept.deptname
AND itemtype = 'C');
Then you need to check that departments on the third floor sell items of type C. If
the second query returns a nonzero value, then it is true that departments that sell
items of type C are located on the third floor.
SELECT COUNT(*) FROM qdept
WHERE deptfloor = 3
AND EXISTS
(SELECT * FROM qsale, qitem
WHERE qsale.itemname = qitem.itemname
AND qsale.deptname = qdept.deptname
AND itemtype = 'C');
Summary
Although SQL is a powerful retrieval language, alas, formulation of common
business queries is not always trivial.
Key terms and concepts
SELECT
Exercises
Here is an opportunity to display your SQL mastery.
1. List the green items of type C.
2. Find the names of green items sold by the Recreation department.
3. Of those items delivered, find the items not delivered to the Books department.
4. Find the departments that have never sold a geopositioning system.
5. Find the departments that have sold compasses and at least two other items.
6. Find the departments that sell at least four items.
7. Find the employees who are in a different department from their manager ’s
department.
8. Find the employees whose salary is less than half that of their manager ’s.
9. Find the green items sold by no department on the second floor.
10. Find the items delivered by all suppliers.
11. Find the items delivered by at least two suppliers.
12. Find the items not delivered by Nepalese Corp.
13. Find the items sold by at least two departments.
14. Find the items delivered for which there have been no sales.
15. Find the items delivered to all departments except Administration.
16. Find the name of the highest-paid employee in the Marketing department.
17. Find the names of employees who make 10 percent less than the average salary.
18. Find the names of employees with a salary greater than the minimum salary
paid to a manager.
19. Find the names of suppliers that do not supply compasses or geopositioning
systems.
20. Find the number of employees with a salary under $10,000.
21. Find the number of items of type A sold by the departments on the third floor.
22. Find the number of units sold of each item.
23. Find the green items delivered by all suppliers.
24. Find the supplier that delivers no more than one item.
25. Find the suppliers that deliver to all departments.
26. Find the suppliers that deliver to all the departments that also receive deliveries
from supplier 102.
27. Find the suppliers that have never delivered a compass.
28. Find the type A items delivered by São Paulo Manufacturing.
29. Find, for each department, its floor and the average salary in the department.
30. If Nancy’s boss has a boss, who is it?
31. List each employee and the difference between his or her salary and the average
salary of his or her department.
32. List the departments on the second floor that contain more than one employee.
33. List the departments on the second floor.
34. List the names of employees who earn more than the average salary of
employees in the Shoe department.
35. List the names of items delivered by each supplier. Arrange the report by
supplier name, and within supplier name, list the items in alphabetical order.
36. List the names of managers who supervise only one person.
37. List the number of employees in each department.
38. List the green items delivered by exactly one supplier.
39. Whom does Todd manage?
40. List the departments that have not sold all green items.
41. Find the first name of Sophie’s boss.
42. Find the names of employees who make less than half their manager ’s salary.
43. List the names of each manager and their employees arranged by manager ’s
name and employee’s name within manager.
44. Who earns the lowest salary?
45. List the names of employees who earn less than the minimum salary of the
Marketing department.
46. List the items sold by every department to which all brown items have been
delivered.
47. List the department and the item where the department is the only seller of that
item.
48. List the brown items sold by the Books department and delivered by All
Seasons.
49. Which department has the highest average salary?
50. List the supplier that delivers all and only brown items.
Section 3: Advanced Data Management
Advancement only comes with habitually doing more than you are asked.
Gary Ryan Blair. 18
The wiring of the world has given us ubiquitous networks and broadened the scope
of issues that data management must now embrace. In everyday life, you might use a
variety of networks (e.g., the Internet, 4G, WiFi, and bluetooth) to gain access to
information wherever you might be and whatever time it is. As a result, data
managers need to be concerned with both the spatial and temporal dimensions of
data. In a highly connected world, massive amounts of data are exchanged every
minute between computers to enable a high level of global integration in economic
and social activity. XML has emerged as the foundation for data exchange across
many industries. It is a core technology for global economic development. On the
social side, every day people generate millions of messages, photos, and videos that
fuel services such as Twitter, Flickr, and YouTube. Many organizations are
interested in analyzing these data streams to learn about social trends, customers’
opinions, and entrepreneurial opportunities. Organizations need skills in collecting,
processing, and interpreting the myriad data flows that intersect with their everyday
business. Organizational or business intelligence is the general term for describing
an enterprise’s efforts to collect, store, process, and interpret data from internal and
external sources. It is the first stage of data-driven decision making. Once data have
been captured and stored in an organizational repository, there are several
techniques that can be applied.
In a world awash with data, visualization has become increasingly important for
enabling executives to make sense of the business environment, to identify
problems, and highlight potential new directions. Text mining is a popular tool for
trying to make sense of data streams emanating from tweets and blogs. The many
new sources of data and their high growth rate have made it more difficult to
support real time analysis of the torrents of data that might contain valuable insights
for an organization’s managers. Fortunately, Hadoop distributed file system
(HDFS) and MapReduce are a breakthrough in storing and processing data that
enable faster processing at lower cost. Furthermore, the open source statistics and
graphical package, R, provides a common foundation for handling text mining, data
visualization, HDFS, and MapReduce. It has become another component of the data
manager ’s toolkit.
The section covers the following topics.
• Spatial and temporal data management
• XML
• Organizational intelligence
• Introduction to R
• Data visualization
• Text mining
• HDFS and MapReduce
11. Spatial and Temporal Data Management
Nothing puzzles me more than time and space; and yet nothing troubles me less, as I
never think about them.
Charles Lamb, 1810.
Learning objectives
Students completing this chapter will
✤ be able to define and use a spatial database;
✤ be familiar with the issues surrounding the management of temporal data.
Introduction
The introduction of ubiquitous networks and smartphones has led to the advent of
location-based services. Customers expect information delivered based on, among
other things, where they are. For example, a person touring the historic German city
of Regensburg could receive information about its buildings and parks via her
mobile phone in the language of her choice. Her smartphone will determine her
location and then select from a database details of her immediate environment. Data
managers need to know how to manage the spatial data necessary to support
location-based services.
Some aspect of time is an important fact to remember for many applications. Banks,
for example, want to remember what dates customers made payments on their loans.
Airlines need to recall for the current and future days who will occupy seats on each
flight. Thus, the management of time-varying, or temporal, data would be assisted
if a database management system had built-in temporal support. As a result, there
has been extensive research on temporal data models and DBMSs for more than a
decade. The management of temporal data is another skill required of today’s data
management specialist.
The Open Geospatial Consortium, Inc. (OGC) is a nonprofit international
organization developing standards for geospatial and location-based services. Its
goal is to create open and extensible software application programming interfaces
for geographic information systems (GIS) and other geospatial technologies.
DBMS vendors (e.g., MySQL) have implemented some of OGC’s recommendations
for adding spatial features to SQL. MySQL is gradually adding further GIS features
as it develops its DBMS.
Managing spatial data
A spatial database is a data management system for the collection, storage,
manipulation, and output of spatially referenced information. Also known as a
geographic information system (GIS), it is an extended form of DBMS. Geospatial
modeling is based on three key concepts: theme, geographic object, and map.
A theme refers to data describing a particular topic (e.g., scenic lookouts, rivers,
cities) and is the spatial counterpart of an entity. When a theme is presented on a
screen or paper, it is commonly seen in conjunction with a map. Color may be used
to indicate different themes (e.g., blue for rivers and black for roads). A map will
usually have a scale, legend, and possibly some explanatory text.
A geographic object is an instance of a theme (e.g., a river). Like an instance of an
entity, it has a set of attributes. In addition, it has spatial components that can
describe both geometry and topology. Geometry refers to the location-based data,
such as shape and length, and topology refers to spatial relationships among
objects, such as adjacency. Management of spatial data requires some additional data
types to represent a point, line, and region.
Generic spatial data types
Data type Dimensions Example
Point 0 Scenic lookout
Line 1 River
Region 2 County
Consider the case where we want to create a database to store some details of
political units. A political unit can have many boundaries. The United States, for
example, has a boundary for the continental portion, one for Alaska, one for
Hawaii, and many more to include places such as American Samoa. In its computer
form, a boundary is represented by an ordered set of line segments (a path).
Data model for political units
MySQL has data types for storing geometric data. MySQL geometric data types
Type Representation Description
LINESTRING(x1 y1,x2 y2,
LineString …) A sequence of points with linear interpolation between points
Point POINT(x y) A point in space (e.g., a city’s center)
POLYGON((x1 y1,x2 y2,
Polygon …), (x1 y1,x2 y2,…)) A polygon (e.g., a boundary) which has a single exterior bou
The data model in the figure is mapped to MySQL using the statements listed in the
political unit database definition table. In the preceding definition of the database’s
tables, note two things. The boundpath column of boundary is defined with a data type
of POLYGON . The cityloc column of city is defined as a POINT. Otherwise, there is little
new in the set of statements to create the tables.
Political unit database definition
CREATE TABLE political_unit (
Unitname VARCHAR(30) NOT NULL,
Unitcode CHAR(2),
Unitpop DECIMAL(6,2),
PRIMARY KEY(unitcode));
CREATE TABLE boundary (
Boundid INTEGER,
Boundpath POLYGON NOT NULL,
Unitcode CHAR(2),
PRIMARY KEY(boundid),
CONSTRAINT fk_boundary_polunit FOREIGN KEY(unitcode)
REFERENCES political_unit(unitname));
CREATE TABLE city (
Cityname VARCHAR(30),
Cityloc POINT NOT NULL,
Unitcode CHAR(2),
PRIMARY KEY(unitcode,cityname),
CONSTRAINT fk_city_polunit FOREIGN KEY(unitcode)
REFERENCES political_unit(unitname));
The two sets of values for the column boundary define the boundaries of the Republic
of Ireland and Northern Ireland. Because of the coarseness of this sample mapping,
the Republic of Ireland has only one boundary. A finer-grained mapping would have
multiple boundaries, such as one to include the Arran Islands off the west coast near
Galway. Each city’s location is defined by a point or pair of coordinates.
GeomFromText is a MySQL function to convert text into a geometry data form.
Skill builder
Create the three tables for the example and insert the rows listed in the preceding
SQL code.
MySQL includes a number of geometry functions and operators for processing
spatial data that simplify the writing of queries. All calculations are done assuming a
flat or planar surface. In other words, MySQL uses Euclidean geometry. For
illustrative purposes, just a few of the geometric functions are described.
Some MySQL geometric functions
Function Description
X(Point) The x-coordinate of a point
Y(Point) The y-coordinate of a point
GLength(LineString) The length of a linestring
NumPoints(LineString) The number of points in a linestring
Area(Polygon) The area of a polygon
Once the database is established, we can do some queries to gain an understanding
of the additional capability provided by the spatial additions. Before starting,
examine the scale on the map and note that one grid unit is about 37.5 kilometers (23
miles). This also means that the area of one grid unit is 1406 km (526 square miles).
2
Area(km^2)
71706
• How far, as the crow flies, is it from Sligo to Dublin?
MySQL has not yet implemented a distance function to measure how far it is
between two points. However, we can get around this limitation by using the
function GLength to compute the length of a linestring. We will not get into the
complications of how the linestring is created, but you should understand that a one
segment linestring is created, with its end points being the locations of the two cities.
Notice also that the query is based on a self-join.
SELECT GLength(LineString(orig.cityloc,dest.cityloc))*37.5
AS 'Distance (kms)'
FROM city orig, city dest
WHERE orig.cityname = 'Sligo'
AND dest.cityname = 'Dublin';
Distance (kms)
167.71
cityname
Tipperary
• What is the westernmost city in Ireland?
The first thing to recognize is that by convention the west is shown on the left side
of the map, which means the westernmost city will have the smallest x-coordinate.
SELECT west.cityname FROM city west
WHERE NOT EXISTS
(SELECT * FROM city other WHERE X(other.cityloc) < X(west.cityloc));
cityname
Limerick
Galway
Skill builder
1. What is the area of Northern Ireland? Because Northern Ireland is part of the
United Kingdom and miles are still often used to measure distances, report the
area in square miles.
2. What is the direct distance from Belfast to Londonderry in miles?
3. What is the northernmost city of the Republic of Ireland?
Geocoding using Google Maps
To get the latitude and longitude of a location, you can use Google Maps by
following this procedure.
1. Go to maps.google.com.
2. Enter your address, zip code, airport code, or whatever you wish to geocode.
3. Click on the link that says ‘link to this page.’ It is on the right side, just above the
upper right corner of the map.
4. The address bar (URL) will change. Copy the full link. For example:
http://maps.google.com/maps?
f=q&source=s_q&hl=en&geocode=&q=ahn&aq=&sll=37.0625,-95.677068&sspn=48.82258
Ahn,+1010+Ben+Epps+Dr,+Athens,+Georgia+30605&ll=33.953791,-83.323746&spn=0.025
5. The latitude and longitude are contained in the URL following &ll. In this case,
latitude is: 33.953791 and longitude: -83.323746.
R-tree
Conventional DBMSs were developed to handle one-dimensional data (numbers and
text strings). In a spatial database, points, lines, and rectangles may be used to
represent the location of retail outlets, roads, utilities, and land parcels. Such data
objects are represented by sets of x, y or x, y, z coordinates. Other applications
requiring the storage of spatial data include computer-aided design (CAD),
robotics, and computer vision.
The B-tree, often used to store data in one-dimensional databases, can be extended
to n dimensions, where n ≥ 2. This extension of the B-tree is called an R-tree. As
well as storing pointers to records in the sequence set, an R-tree also stores
boundary data for each object. For a two-dimensional application, the boundary data
are the x and y coordinates of the lower left and upper-right corners of the minimum
bounding rectangle, the smallest possible rectangle enclosing the object. The index
set, which contains pointers to lower-level nodes as in a B-tree, also contains data
for the minimum bounding rectangle enclosing the objects referenced in the node.
The data in an R-tree permit answers to such problems as Find all pizza stores
within 5 miles of the dorm.
How an R-tree stores data is illustrated in the following figure, which depicts five
two-dimensional objects labeled A, B, C, D, and E. Each object is represented by its
minimum bounding rectangle (the objects could be some other form, such as a
circle). Data about these objects are stored in the sequence set. The index set
contains details of two intermediate rectangles: X and Y. X fully encloses A, B, and
C. Y fully encloses D and E.
An R-tree with sample spatial data
An example demonstrates how these data are used to accelerate searching. Using a
mouse, a person could outline a region on a map displayed on a screen. The
minimum bounding rectangle for this region would then be calculated and the
coordinates used to locate geographic objects falling within the minimum boundary.
Because an R-tree is an index, geographic objects falling within a region can be
found rapidly. In the following figure, the drawn region (it has a bold border)
completely covers object E. The R-tree software would determine that the required
object falls within intermediate region Y, and thus takes the middle node at the next
level of the R-tree. Then, by examining coordinates in this node, it would determine
that E is the required object.
Searching an R-tree
As the preceding example illustrates, an R-tree has the same index structure as a B-
tree. An R-tree stores data about n-dimensional objects in each node, whereas a B-
tree stores data about a one-dimensional data type in each node. Both also store
pointers to the next node in the tree (the index set) or the record (the sequence set).
This short introduction to spatial data has given you some idea of how the relational
model can be extended to support geometric information. Most of the major DBMS
vendors support management of spatial data.
Managing temporal data
With a temporal database, stored data have an associated time period indicating
when the item was valid or stored in the database. By attaching a timestamp to data,
it becomes possible to store and identify different database states and support
queries comparing these states. Thus, you might be able to determine the number of
seats booked on a flight by 3 p.m. on January 21, 2011, and compare that to the
number booked by 3 p.m. on January 22, 2011.
To appreciate the value of a temporal database, you need to know the difference
between transaction and valid time and that bitemporal data combines both valid and
transaction time.
• Transaction time is the timestamp applied by the system when data are
entered and cannot be changed by an application. It can be applied to a
particular item or row. For example, when changing the price of a product, the
usual approach would be to update the existing product row with the new price.
The old price would be lost unless it was stored explicitly. In contrast, with a
temporal database, the old and new prices would automatically have separate
timestamps. In effect, an additional row is inserted to store the new price and
the time when the insert occurred.
• Valid time is the actual time at which an item was or will be a valid or true
value. Consider the case where a firm plans to increase its prices on a specified
date. It might post new prices some time before their effective date. Valid time
can be changed by an application.
• Bitemporal data records both the valid time and transaction time for a fact.
It usually requires four extra columns to record the upper and lower bounds
for valid time and transaction time.
Valid time records when the change takes effect, and transaction time records when
the change was entered. Storing transaction time is essential for database recovery
because the DMBS can roll back the database to a previous state. Valid time provides
a historical record of the state of the database. Both forms of time are necessary for
a temporal database.
As you might expect, a temporal database will be somewhat larger than a traditional
database because data are never discarded and new timestamped values are inserted
so that there is a complete history of the values of an instance (e.g., the price of a
product since it was first entered in the database). Thus, you can think of most of the
databases we have dealt with previously as snapshots of a particular state of the
database, whereas a temporal database is a record of all states of the database. As
disk storage becomes increasingly cheaper and firms recognize the value of
business intelligence, we are likely to see more attention paid to temporal database
technology.
Times remembered
SQL supports several different data types for storing numeric values (e.g., integer
and float), and a temporal database also needs a variety of data types for storing
time values. The first level of distinction is to determine whether the time value is
anchored or unanchored. Anchored time has a defined starting point (e.g., October
15, 1582), and unanchored time is a block of time with no specified start (e.g., 45
minutes).
Types of temporal data 19
Planetary data
Rotational period Orbital period
Planet
(hours) (years)
Mercury 1407.51 0.24
Venus –5832.44 0.62
a
Recording the share’s price requires further consideration. If the intention is to
record every change in price, then a time stamp is required as there will be multiple
price changes in a day, and even in an hour, in busy trading. If there is less interest
in the volatility of the stock and only the closing price for each day is of interest,
then a date would be recorded.
You can add additional attributes to tables in a relational database to handle temporal
data, but doing so does not make it a temporal database. The problem is that current
relational implementations do not have built-in functions for querying time-varying
data. Such queries can also be difficult to specify in SQL.
A temporal database has additional features for temporal data definition, constraint
specification, data manipulation, and querying. A step in this direction is the
development of TSQL (Temporal Structured Query Language). Based on SQL,
TSQL supports querying of temporal databases without specifying time-varying
criteria. SQL:2011, the seventh revision of the SQL standard, has improved support
for temporal data.
Summary
Spatial database technology stores details about items that have geometric features.
It supports additional data types to describe these features, and has functions and
operators to support querying. The new data types support point, line, and region
values. Spatial technology is likely to develop over the next few years to support
organizations offering localized information services.
Temporal database technology provides data types and functions for managing
time-varying data. Transaction time and valid time are two characteristics of
temporal data. Times can be anchored or unanchored and measured as an instant or
as an interval.
Key terms and concepts
Anchored time
Geographic object
Geographic information system (GIS)
Interval
Map
R-tree
Spatial data
Temporal data
Theme
Transaction time
Valid time
References and additional readings
Gibson, Rich, and Schuyler Erle. 2006. Google maps hacks. Sebastopol, CA:
O’Reilly.
Gregersen, H., and C. S. Jensen. 1999. Temporal entity-relationship models—a
survey. IEEE Transactions on Knowledge and Engineering 11 (3):464–497.
Rigaux, P., M. O. Scholl, and A. Voisard. 2002. Spatial databases: with application
to GIS, The Morgan Kaufmann series in data management systems. San
Francisco: Morgan Kaufmann Publishers.
Exercises
1. What circumstances will lead to increased use of spatial data?
2. A national tourist bureau has asked you to design a database to record details of
items of interest along a scenic road. What are some of the entities you might
include? How would you model a road? Draw the data model.
3. Using the map of the Iberian peninsula in the following figure, populate the
spatial database with details of Andorra, Portugal, and Spain. Answer the
following questions.
a. What is the direct distance, or bee line, from Lisbon to Madrid?
b. What is the farthest Spanish city from Barcelona?
c. Imagine you get lost in Portugal and your geographic positioning system
(GPS) indicates that your coordinates are (3,9). What is the nearest city?
d. Are there any Spanish cities west of Braga?
e. What is the area of Portugal?
f. What is the southernmost city of Portugal?
4. Redesign the data model for political units assuming that your relational
database does not support point and polygon data types.
5. For more precision and to meet universal standards, it would be better to use
latitude and longitude to specify points and paths. You should also recognize that
Earth is a globe and not flat. How would you enter latitude and longitude in
MySQL?
6. When might you use transaction time and when might you use valid time?
7. Design a database to report basketball scores. How would you record time?
8. A supermarket chain has asked you to record what goods customers buy during
each visit. In other words, you want details of each shopping basket. It also wants
to know when each purchase was made. Design the database.
9. An online auction site wants to keep track of the bids for each item that a
supplier sells. Design the database.
10. Complete the Google maps lab exercise listed on the book’s Web site.
Iberian Peninsula
12. XML: Managing Data Exchange
Words can have no single fixed meaning. Like wayward electrons, they can spin away
from their initial orbit and enter a wider magnetic field. No one owns them or has a
proprietary right to dictate how they will be used.
David Lehman, End of the Word, 1991
Learning objectives
Students completing this chapter will be able to
✤ define the purpose of XML;
✤ create an XML schema;
✤ code data in XML format;
✤ create an XML stylesheet;
✤ discuss data management options for XML documents.
Introduction
There are four central problems in data management: capture, storage, retrieval,
and exchange. The focus for most of this book has been on storage (i.e., data
modeling) and retrieval (i.e., SQL). Now it is time to consider capture and exchange.
Capture has always been an important issue, and the guiding principle is to capture
data once in the cheapest possible manner.
SGML
The Standard Generalized Markup Language (SGML) was designed to reduce the
cost and increase the efficiency of document management. Its child, XML, has
essentially replaced SGML. For example, the second edition of the Oxford English
Dictionary was specified in SGML, and the third edition is stored in XML format. 20
XML enables a shift of processing from the server to the browser. At present, most
processing has to be done by the server because that is where knowledge about the
data is stored. The browser knows nothing about the data and therefore can only
present but not process. However, when XML is implemented, the browser can take
on processing that previously had to be handled by the server.
Imagine that you are selecting a shirt from a mail-order catalog. The merchant’s
Web server sends you data on 20 shirts (100 Kbytes of text and images) with prices
in U.S. dollars. If you want to see the prices in euros, the calculation will be done by
the server, and the full details for the 20 shirts retransmitted (i.e., another 100 Kbytes
are sent from the server to the browser). However, once XML is in place, all that
needs to be sent from the server to the browser is the conversion rate of U.S. dollars
to euros and a program to compute the conversion at the browser end. In most
cases, less data will be transmitted between a server and browser when XML is in
place. Consequently, widespread adoption of XML will reduce network traffic.
Execution of HTML and XML code
HTML XML
Retrieve shirt data with prices in USD.
Retrieve shirt data with prices in USD. Retrieve conversion rate of USD to EUR.
Retrieve shirt data with prices in EUR. Retrieve Java program to convert currencies.
Compute prices in EUR.
XML can also make searching more efficient and effective. At present, search
engines look for matching text strings, and consequently return many links that are
completely irrelevant. For instance, if you are searching for details on the Nomad
speaker system, and specify “nomad” as the sought text string, you will get links to
many items that are of no interest (e.g., The Fabulous Nomads Surf Band).
Searching will be more precise when you can specify that you are looking for a
product name that includes the text “nomad.” The search engine can then confine its
attention to text contained with the tags <productname> and </productname>, assuming these
tags are the XML standard for representing product names.
The major expected gains from the introduction of XML are
• Store once and format many ways—Data stored in XML format can be
extracted and reformatted for multiple presentation styles (e.g., printed report,
DVD).
• Hardware and software independence—One format is valid for all systems.
Capture once and exchange many times—Data are captured as close to the
source as possible and never again (i.e., no rekeying).
• Accelerated targeted searching—Searches are more precise and faster
because they use XML tags.
• Less network congestion—The processing load shifts from the server to the
browser.
XML language design
XML lets developers design application-specific vocabularies. To create a new
language, designers must agree on three things:
• The allowable tags
• The rules for nesting tagged elements
• Which tagged elements can be processed
The first two, the language’s vocabulary and structure, are typically defined in an
XML schema. Developers use the XML schema to understand the meaning of tags so
they can write software to process an XML file.
XML tags describe meaning, independent of the display medium. An XML
stylesheet, another set of rules, defines how an XML file is automatically formatted
for various devices. This set of rules is called an Extensible Stylesheet Language
(XSL). Stylesheets allow data to be rendered in a variety of ways, such as Braille or
audio for visually impaired people.
XML schema
An XML schema (or just schema for brevity) is an XML file associated with an XML
document that informs an application how to interpret markup tags and valid
formats for tags. The advantage of a schema is that it leads to standardization.
Consistently named and defined tags create conformity and support organizational
efficiency. They avoid the confusion and loss of time when the meaning of data is
not clear. Also, when validation information is built into a schema, some errors are
detected before data are exchanged.
XML does not require the creation of a schema. If a document is well formed, XML
will interpret it correctly. A well-formed document follows XML syntax and has
tags that are correctly nested.
A schema is a very strict specification, and any errors will be detected when parsing.
A schema defines:
• The names and contents of all elements that are permissible in a certain
document
• The structure of the document
• How often an element may appear
• The order in which the elements must appear
• The type of data the element can contain
DOM
The Document Object Model (DOM) is the model underlying XML. It is based on a
tree (i.e., it directly supports one-to-one and one-to-many, but not many-to-many
relationships). A document is modeled as a hierarchical collection of nodes that
have parent/child relationships. The node is the primary object and can be of
different types (such as document, element, attribute, text). Each document has a
single document node, which has no parent, and zero or more children that are
element nodes. It is a good practice to create a visual model of the XML document
and then convert this to a schema, which is XML’s formal representation of the
DOM.
At this point, an example is the best way to demonstrate XML, schema, and DOM
concepts. We will use the familiar CD problem that was introduced in Chapter 3. In
keeping with the style of this text, we define a minimal amount of XML to get you
started, and then more features are added once you have mastered the basics.
CD library case
The CD library case gradually develops, over several chapters, a data model for
recording details of a CD collection, culminating in the model at the end of Chapter
6. Unfortunately, we cannot quickly convert this final model to an XML document
model, because a DOM is based on a tree model. Thus, we must start afresh.
The model , in this case, is based on the observation that a CD library has many
CDs, and a CD has many tracks.
CD library tree data model
5
6
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>
14 <!--CD library-->
<xsd:element name="cdlibrary">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="cd" type="cdType" minOccurs="1"
maxOccurs="unbounded"/>
15 </xsd:sequence>
</xsd:complexType>
</xsd:element>
<!--CD-->
<xsd:complexType name="cdType">
<xsd:sequence>
16 <xsd:element name="cdid" type="xsd:string"/>
<xsd:element name="cdlabel" type="xsd:string"/>
<xsd:element name="cdtitle" type="xsd:string"/>
<xsd:element name="cdyear" type="xsd:integer"/>
<xsd:element name="track" type="trackType" minOccurs="1"
maxOccurs="unbounded"/>
17 </xsd:sequence>
</xsd:complexType>
<!--Track-->
<xsd:complexType name="trackType">
<xsd:sequence>
<xsd:element name="trknum" type="xsd:integer"/>
18 <xsd:element name="trktitle" type="xsd:string"/>
<xsd:element name="trklen" type="xsd:time"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
19
20
21
22
23
24
25
26
27
28
29
30
31
attribute (i.e., encoding="UTF-8") specifies what form of Unicode is used (in this
case the 8-bit form).
• The XSD Schema namespace is declared {2}.
24
• Comments are placed inside the tag pair <!-- and --> {3}.
• The CD library is defined {4–10} as a complex element type, which
essentially means that it can have embedded elements, which are a sequence of
CDs in this case.
• A sequence is a series of child elements embedded in a parent, as illustrated
by a CD library containing a sequence of CDs {7}, and a CD containing
elements of CD identifier, label, and so forth {15–20}. The order of a sequence
must be maintained by any XML document based on the schema.
• A sequence can have a specified range of elements. In this case, there must
be at least one CD (minOccurs="1") but there is no upper limit (maxOccurs=
"unbounded") on how many CDs there can be in the library {7}.
• An element that has a child (e.g., cdlibrary, which is at the 1 end of a 1:m) or
possesses attributes (e.g., track) is termed a complex element type.
• A CD is represented by a complex element type {13–20}, and has the name
cdType {13}.
• The element cd is defined by specifying the name of the complex type (i.e.,
cdType) containing its specification {7}.
• A track is represented by a complex type because it contains elements of
track number, title, and length {24–30}. The name of this complex type is
trackType {24}.
• Notice the reference within the definition of cd to the complex type trackType,
used to specify the element track {19}.
• Simple types (e.g., cdid and cdyear) do not contain any elements, and thus the
type of data they store must be defined. Thus, cdid is a text string and cdyear is an
integer.
The purpose of a schema is to define the contents and structure of an XML file. It is
also used to verify that an XML file has a valid structure and that all elements in the
XML file are defined in the schema.
If you use an editor, you can possibly create a visual view of the schema.
A visual depiction of a schema as created by Oxygen
Some common data types are shown in the following table. The meaning is obvio
Some common data types
Data type
string
boolean
anyURI
decimal
float
integer
time
date
We can now use the recently defined CDlibrary schema to describe a small CD
library containing the CD information given in the following table.
Data for a small CD library
Id A2 1325 D136705
Label Atlantic Verve
Title Pyramid Ella Fitzgerald
Year 1960 2000
Track Title Length Title Length
1 Vendome 2:30 A tisket, a tasket 2:37
2 Pyramid 10:46 Vote for Mr. Rhythm 2:25
3 Betcha nickel 2:52
The XML for describing the CD library follows. There are several things to
observe:
• All XML documents begin with an XML declaration.
• The declaration immediately following the XML declaration identifies the
root element of the document (i.e., cdlibrary) and the schema (i.e., cdlib.xsd).
• Details of a CD are enclosed by the tags <cd> and </cd>.
• Details of a track are enclosed by the tags <track> and </track>.
XML for describing a CD (cdlib.xml)
<?xml version="1.0" encoding="UTF-8"?>
<cdlibrary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="cdlib.xsd">
<cd>
<cdid>A2 1325</cdid>
<cdlabel>Atlantic</cdlabel>
<cdtitle>Pyramid</cdtitle>
<cdyear>1960</cdyear>
<track>
<trknum>1</trknum>
<trktitle>Vendome</trktitle>
<trklen>00:02:30</trklen>
</track>
<track>
<trknum>2</trknum>
<trktitle>Pyramid</trktitle>
<trklen>00:10:46</trklen>
</track>
</cd>
<cd>
<cdid>D136705</cdid>
<cdlabel>Verve</cdlabel>
<cdtitle>Ella Fitzgerald</cdtitle>
<cdyear>2000</cdyear>
<track>
<trknum>1</trknum>
<trktitle>A tisket, a tasket</trktitle>
<trklen>00:02:37</trklen>
</track>
<track>
<trknum>2</trknum>
<trktitle>Vote for Mr. Rhythm</trktitle>
<trklen>00:02:25</trklen>
</track>
<track>
<trknum>3</trknum>
<trktitle>Betcha nickel</trktitle>
<trklen>00:02:52</trklen>
</track>
</cd>
</cdlibrary>
and click on cdlib.xml. You will see how the browser displays XML. Investigate
what happens when you click minus (–) and then plus (+).
2. Again, using FireFox, save the displayed XML code (View Page Source > Save
As >) by your browser and open it into a text editor.
3. Now, add details of the CD displayed in the following CD data table to this XML
code. Save it as cdlibv2.xml, and open it with Firefox.
CD data
Id 314 517 173-2
Label Verve
Title The essential Charlie Parker
Year 1992
Track Title Length
Track Title Length
1 Now’s the time 3:01
2 If I should lose you 2:46
3 Mango mangue 2:53
4 Bloomdido 3:24
5 Star eyes 3:28
XSL
As you now know from the prior exercise, the browser display of XML is not
particularly useful. What is missing is a stylesheet that tells the browser how to
display an XML file. The eXtensible Stylesheet Language (XSL) is a language for
defining the rendering of an XML file. An XSL document defines the rules for
presenting an XML document’s data. XSL is an application of XML, and an XSL file
is also an XML file.
The power of XSL is demonstrated by applying the stylesheet that follows to the
preceding XML.
Result of applying a stylesheet to CD library data
Complete List of Songs
Pyramid, Atlantic, 1960.5 [A2 1325]
1 Vendome 00:02:30
2 Pyramid 00:10:46
Ella Fitzgerald, Verve, 2000 [D136705]
1 A tisket, a tasket 00:02:37
2 Vote for Mr. Rhythm 00:02:25
3 Betcha nickel 00:02:52
Stylesheet for displaying an XML file of CD data (cdlib.xsl)
1 <?xml version="1.0" encoding="UTF-8"?>
2 <xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
3 <xsl:output encoding="UTF-8" indent="yes" method="html" version="1.0" />
4 <xsl:template match="/">
5 <html>
6 <head>
7 <title> Complete List of Songs </title>
8 </head>
9 <body>
10 <h1> Complete List of Songs </h1>
11 <xsl:apply-templates select="cdlibrary" />
12 </body>
13 </html>
14 </xsl:template>
15 <xsl:template match="cdlibrary">
16 <xsl:for-each select="cd">
17 <br/>
18 <font color="maroon">
19 <xsl:value-of select="cdtitle" />
20 ,
21 <xsl:value-of select="cdlabel" />
22 ,
23 <xsl:value-of select="cdyear" />
24 [
25 <xsl:value-of select="cdid" />
26 ] </font>
27 <br/>
28 <table>
29 <xsl:for-each select="track">
30 <tr>
31 32 <td align="left">
33 <xsl:value-of select="trknum" />
34 </td>
35 <td>
36 <xsl:value-of select="trktitle" />
37 </td>
38 <td align="center">
39 <xsl:value-of select="trklen" />
40 </td>
41 </tr>
42 </xsl:for-each>
43 </table>
44 <br/>
45 </xsl:for-each>
46 </xsl:template>
47 </xsl:stylesheet>
To use a stylesheet with an XML file, you must add a line of code to point to the
stylesheet file. In this case, you add the following:
<?xml-stylesheet type=”text/xsl” href=”cdlib.xsl” media=”screen”?>
as the second line of cdlib.xml (i.e., it appears before <cdlibrary … >). The added line
of code points to cdlib.xsl as the stylesheet. This means that when the browser loads
cdlib.xml, it uses the contents of cdlib.xsl to determine how to render the contents of
cdlib.xml.
We now need to examine the contents of cdlib.xsl so that you can learn some basics
of creating XSL commands. You will soon notice that all XSL commands are
preceded by xsl:.
• Tell the browser it is processing an XML file {1}
• Specify that the file is a stylesheet {2}
• Specify a template, which identifies which elements should be processed and
how they are processed. The match attribute {4} indicates the template applies
to the source node. Process the template {11} defined in the file {15–45}. A
stylesheet can specify multiple templates to produce different reports from the
same XML input.
• Specify a template to be applied when the XSL processor encounters the
<cdlibrary> node {15}.
• Create an outer loop for processing each CD {16–44}.
• Define the values to be reported for each CD (i.e., title, label, year, and id)
{19, 21, 23, 25}. The respective XSL commands select the values. For example,
<xsl:value-of select=”cdtitle” /> specifies selection of cdtitle.
• Create an inner loop for processing the tracks on a particular CD {29–41}.
• Present the track data in tabular form using HTML table commands
interspersed with XSL {28–42}.
Skill builder
1. Use a browser to access this book’s Web site, link to the XML section, and
download cdlib.xsl into a directory on your machine.
2. Edit a copy of cdlib.xml by inserting the following as the second line of
cdlib.xml:
<?xml-stylesheet type=”text/xsl” href=”cdlib.xsl” media=”screen”?> .
3. Save the file as cdlibv3.xml in the same directory as cdlib.xsl and open it with
your browser.
4. Make a similar change to cdlibv2.xml and open it with your browser.
Converting XML
There are occasions when there is a need to convert an XML file:
• Transformation: conversion from one XML vocabulary to another (e.g.,
between financial languages FPML and finML)
• Manipulation: reordering, filtering, or sorting parts of a document
• Rendering in another language: rendering the XML file using another
format
You have already seen how XSL can be used to transform XML for rendering as
HTML. The original XSL has been split into three languages:
• XSLT for transformation and manipulation
• XSLT for rendition
• XPath for accessing the structure of an XML file
For a data management course, this is as far as you need to go with learning about
XSL. Just remember that you have only touched the surface. To become proficient
in XML, you will need an entire course on the topic.
XPath for navigating an XML document
XPath is a navigation language for an XML document. It defines how to select nodes
or sets of nodes in a document. The first step to understanding XPath is to know
about the different types of nodes. In the following XML document, the document
node is <cdlibrary> {1}, <trktitle>Vendome</trktitle> {9} is an example of an element node,
and <track length="00:02:30"> {7} is an instance of an attribute node.
An XML document
<cdlibrary>
<cd>
1
2 <cdid>A2 1325</cdid>
3 <cdlabel>Atlantic</cdlabel>
4 <cdtitle>Pyramid</cdtitle>
5
<cdyear>1960</cdyear>
6 <track length="00:02:30">
7 <trknum>1</trknum>
8
9 <trktitle>Vendome</trktitle>
10 </track>
11 <track length="00:10:46">
12
13 <trknum>2</trknum>
14 <trktitle>Pyramid</trktitle>
15 </track>
16
</cd>
</cdlibrary>
A family of nodes
Each element and attribute has one parent node. In the preceding XML document, cd
is the parent of cdid, cdlabel, cdyear, and track. Element nodes may have zero or more
children nodes. Thus cdid, cdlabel, cdyear, and track are the children of cd. Ancestor
nodes are the parent, parent’s parent, and so forth of a node. For example, cd and
cdlibrary are ancestors of cdtitle. Similarly, we have descendant nodes, which are the
children, children’s children, and so on of a node. The descendants of cd include cdid,
cdtitle, track, trknum, and trktitle. Finally, we have sibling nodes, which are nodes that share
a parent. In the sample document, cdid, cdlabel, cdyear, and track are siblings.
Navigation examples
The examples in the following table give you an idea of how you can use XPath to
extract data from an XML document. Our preference is to answer such queries using
SQL, but if your data are in XML format, then XPath provides a means of
interrogating the file.
XPath examples
Example Result
/cdlibrary/cd[1] Selects the first CD
//trktitle Selects all the titles of all tracks
/cdlibrary/cd[last() -1] Selects the second last CD
/cdlibrary/cd[last()]/track[last()]/trklen The length of the last track on the last CD
/cdlibrary/cd[cdyear=1960] Selects all CDs released in 1960
/cdlibrary/cd[cdyear>1950]/cdtitle Titles of CDs released after 1950
XQuery for querying an XML document
XQuery is a query language for XML, and thus it plays a similar role that SQL plays
for a relational database. It is used for finding and extracting elements and attributes
of an XML document. It builds on XPath's method for specifying navigation through
the elements of an XML document. As well as being used for querying, XQuery can
also be used for transforming XML to XHTML.
The first step is to specify the location of the XML file. In this case, we will use the
cdlib.xml file, which is stored on the web site for this book. Using XPath notation it
is relatively straightforward to list some values in an XML file
• List the titles of CDs.
doc("http://richardtwatson.com/xml/cdlib.xml")/cdlibrary/cd/cdtitle
XQuery commands can also be written using an SQL like structure, which is called
FLWOR, an easy way to remember 'For, Let, Where, Order by, Return.' Here is an
example.
• List the titles of tracks longer than 5 minutes
for $x in doc("http://richardtwatson.com/xml/cdlib.xml")/cdlibrary/cd/track
where $x/'track length' > "00:05:00"
order by $x/'trktitle'
return $x
If you want to just report the data without the tags, use return data($x).
XML and databases
XML is more than a document-processing technology. It is also a powerful tool for
data management. For database developers, XML can be used to facilitate middle-
tier data integration and schemas. Most of the major DBMS producers have XML-
centric extensions for their product lines.
Many XML documents are stored for the long term, because they are an important
repository of organizational memory. A data exchange language, XML is a means
of moving data between databases, which means a need for tools for exporting and
importing XML.
XML documents can be stored in the same format as you would store a word
processing or HTML file: You just place them in an appropriately named folder.
File systems, however, have limitations that become particularly apparent when a
large number of files need to be stored, as in the corporate setting.
What is needed is a DBMS for storing, retrieving, and manipulating XML
documents. Such a DBMS should:
• Be able to store many documents
• Be able to store large documents
• Support access to portions of a document (e.g., the data for a single CD in a
library of 20,000 CDs)
• Enable concurrent access to a document but provide locking mechanisms to
prevent the lost update problem, which is discussed later in this book
• Keep track of different versions of a document
• Integrate data from other sources (e.g., insert the results of an SQL query
formatted as an XML fragment into an XML document)
Two possible solutions for XML document management are a relational database
management system (RDBMS) or an XML database.
RDBMS
An XML document can stored within an RDBMS. Storing an intact XML document
as a CLOB is a sensible strategy if the XML document contains static content that
will only be updated by replacing the entire document. Examples include written text
such as articles, advertisements, books, or legal contracts. These document-centric
files (e.g., articles and legal contracts) are retrieved and updated in their entirety.
For more dynamic data-centric XML files (e.g., orders, price lists, airline
schedules), the RDBMS must be extended to support the structure of the data so that
portions of the document (e.g., an element such as the price for a product) can be
retrieved and updated.
XML database
A second approach is to build a special-purpose XML database. Tamino is an 26
Querying XML
ExtractValue
is the MySQL function for retrieving data from a node. It requires
specification of the XML fragment to be retrieved and the XPath expression to
locate the required node.
• Report the title of the first CD in the library
SELECT ExtractValue(doc,'/cd/cdtitle[1]') FROM cdlib;
ExtractValue
Pyramid
• Report the title of the first track on the first CD.
SELECT ExtractValue(doc,'//cd[1]/track[1]/trktitle') FROM cdlib;
ExtractValue
Vendome
Updating XML
UpdateXML replaces a single fragment of XML with another fragment. It requires
specification of the XML fragment to be replaced and the XPath expression to locate
the required node.
• Change the title for the CD with identifier A2 1325 to Stonehenge
UPDATE cdlib SET doc = UpdateXML(doc,'//cd[cdid="A2 1325"]/cdtitle', '<cdtitle>Stonehenge</cdtitle>');
You can repeat the previous query to find the title of the first CD to verify the
update.
Generating XML
A set of user defined functions (UDF) has been developed to convert the results of
27
an SQL query into XML. The xql_element function is used to define the name and
value of the XML element to be reported. Here is an example.
• List in XML format the name of all stocks with a PE ratio greater than 14.
SELECT xql_element ('firm',shrfirm) FROM share WHERE shrpe > 14
Query output
<firm>Canadian Sugar</firm>
<firm>Freedonia Copper</firm>
<firm>Sri Lankan Gold</firm>
The preceding is just a brief example of what can be done. Read the documentation
on UDF to learn how to handle more elaborate transformations.
Conclusion
XML has two main roles. The first is to facilitate the exchange of data between
organizations and within those organizations that do not have integrated systems. Its
second purpose is to support exchange between servers.
Mastery of XML is well beyond the scope of a single chapter. Indeed, it is a book-
length topic, and hundreds of books have been written on XML. It is important to
remember that the prime goal of XML is to support data interchange. If you would
like to continue learning about XML, then consider the open content textbook
(en.wikibooks.org/wiki/XML), which was created by students and is under continual
revision. You might want to contribute to this book.
Summary
Electronic data exchange became more important with the introduction of the
Internet. SGML, a precursor of XML, defines the structure of documents. SGML’s
value derives from its reusability, flexibility, support for revision, and format
independence. XML, a derivative of SGML, is designed to support electronic
commerce and overcome some of the shortcomings of SGML. XML supports data
exchange by making information self-describing. It is a metalanguage because it is a
language for generating other languages (e.g., finML). It provides substantial gains
for the management and distribution of data. The XML language consists of an XML
schema, document object model (DOM), and XSL. A schema defines the structure of
a document and how an application should interpret XML markup tags. The DOM is
a tree-based data model of an XML document. XSL is used to specify a stylesheet for
displaying an XML document. XML documents can be stored in either a RDBMS or
XML database.
Key terms and concepts
Document object model (DOM)
Document type definition (DTD)
Electronic data interchange (EDI)
Extensible markup language (XML)
Extensible stylesheet language (XSL)
Hypertext markup language (HTML)
Markup language
Occurrence indicators
Standard generalized markup language (SGML)
XML database
XML schema
References and additional readings
Anderson, R. 2000. Professional XML. Birmingham, UK; Chicago: Wrox Press.
Watson, R. T., and others. 2004. XML: managing data exchange:
http://en.wikibooks.org/wiki/XML_-_Managing_Data_Exchange.
Exercises
1. A business has a telephone directory that records the first and last name,
telephone number, and e-mail address of everyone working in the firm.
Departments are the main organizing unit of the firm, so the telephone directory
is typically displayed in department order, and shows for each department the
contact phone and fax numbers and e-mail address.
a. Create a hierarchical data model for this problem.
b. Define the schema.
c. Create an XML file containing some directory data.
d. Create an XSL file for a stylesheet and apply the transformation to the XML
file.
2. Create a schema for your university or college’s course bulletin.
3. Create a schema for a credit card statement.
4. Create a schema for a bus timetable.
5. Using the portion of ClassicModels that has been converted to XML, answer the
28
Year
Region Data 2010 2011 2012 Grand total
Asia Sum of hardware 97 115 102 314
Sum of software 83 78 55 216
Europe Sum of hardware 23 28 25 76
Sum of software 41 65 73 179
North America Sum of hardware 198 224 259 681
Sum of software 425 410 497 1,332
Total sum of hardware 318 367 386 1,071
Total sum of software 549 553 625 1,727
Drill-down
Region Sales variance
Africa 105%
Nation Sales variance
Asia 57% -----→ China 123%
Europe 122%
Japan 52%
North America 97%
India 87%
Pacific 85%
Singapore 95%
South America 163%
The hypercube
From the analyst’s perspective, a fundamental difference between MDDB and
RDBMS is the representation of data. As you know from data modeling, the
relational model is based on tables, and analysts must think in terms of tables when
they manipulate and view data. The relational world is two-dimensional. In contrast,
the hypercube is the fundamental representational unit of a MDDB. Analysts can
move beyond two dimensions. To envisage this change, consider the difference
between the two-dimensional blueprints of a house and a three-dimensional model.
The additional dimension provides greater insight into the final form of the
building.
A hypercube
Of course, on a screen or paper only two dimensions can be shown. This problem is
typically overcome by selecting an attribute of one dimension (e.g., North region)
and showing the other two dimensions (i.e., product sales by year). You can think of
the third dimension (i.e., region in this case) as the page dimension—each page of
the screen shows one region or slice of the cube.
A three-dimensional hypercube display
Page Columns
Logistic regression
Nominal or ordinal
Customer response (yes or no) to the level of advertisi
This brief introduction to multidimensionality modeling has demonstrated the
importance of distinguishing between types of dimensions and considering how the
form of a dimension (e.g., nominal or continuous) will affect the choice of analysis
tools. Because multidimensional modeling is a relatively new concept, you can
expect design concepts to evolve. If you become involved in designing an MDDB,
then be sure to review carefully current design concepts. In addition, it would be
wise to build some prototype systems, preferably with different vendor
implementations of the multidimensional concept, to enable analysts to test the
usefulness of your design.
Skill builder
A national cinema chain has commissioned you to design a multidimensional
database for its marketing department. What identifier and variable dimensions
would you select?
Multidimensional expressions (MDX)
MDX is a language for reporting data stored in a multidimensional database. On the
surface, it looks like SQL because the language includes SELECT, FROM, and
WHERE. The reality is that MDX is quite different. MDX works with cubes of data,
and the result of an MDX query is a cube, just as SQL produces a table.The basic
syntax of MDX statement is
SELECT {member selection} ON columns
FROM [cube name]
Measures
Unit Sales
266,773
The first step in using MDX is to define a cube, which has dimensions and measures.
Dimensions are the categories for reporting measures. The dimensions of product
for a supermarket might be food, drink, and non-consumable. Measures are the
variables that typically report outcomes. For the supermarket example, the measures
could be unit sales, cost, and revenue. Once a cube is defined, you can write MDX
queries for it. If we want to see sales broken down by the product dimensions, we
would write
SELECT {[measures].[unit sales] ON columns,
{[product].[all products].[food]} ON rows
FROM [sales]
In this section, we use the foodmart dataset to illustrate MDX. To start simple, we
just focus on a few dimensions and measures. We have a product dimension
containing three members: food, drink, and non-consumable, and we will use the
measures of unit sales and store sales. Here is the code to report the data for all
products.
SELECT {[measures].[unit sales],[measures].[store sales]} ON columns,
{[product].[all products]} ON rows
FROM [sales]
Measures
Product Unit Sales Store Sales
All Products 266,773 565,238.13
To examine the unit sales and store sales by product, use the following code:
SELECT {[measures].[unit sales], [measures].[store sales]} ON columns,
{[product].[all products],
[product].[all products].[drink],
[product].[all products].[food],
[product].[all products].[non-consumable]} ON rows
FROM [sales]
Measures
Product Unit Sales Store Sales
All Products 266,773 565,238.13
Drink 24,597 48,836.21
Food 191,940 409,035.59
Non-consumable 50,236 107,366.33
These few examples give you some idea of how to write an MDX query and the type
of output it can produce. In practice, most people will use a GUI for defining
queries. The structure of a MDDB makes it well suited to the select-and-click
generation of an MDX query.
Data mining
Data mining is the search for relationships and global patterns that exist in large
databases but are hidden in the vast amounts of data. In data mining, an analyst
combines knowledge of the data with advanced machine learning technologies to
discover nuggets of knowledge hidden in the data. Data mining software can find
meaningful relationships that might take years to find with conventional techniques.
The software is designed to sift through large collections of data and, by using
statistical and artificial intelligence techniques, identify hidden relationships. The
mined data typically include electronic point-of-sale records, inventory, customer
transactions, and customer records with matching demographics, usually obtained
from an external source. Data mining does not require the presence of a data
warehouse. An organization can mine data from its operational files or independent
databases. However, data mining in independent files will not uncover relationships
that exist between data in different files. Data mining will usually be easier and more
effective when the organization accumulates as much data as possible in a single
data store, such as a data warehouse. Recent advances in processing speeds and
lower storage costs have made large-scale mining of corporate data a reality.
Database marketing, a common application of data mining, is also one of the best
examples of the effective use of the technology. Database marketers use data mining
to develop, test, implement, measure, and modify tailored marketing programs. The
intention is to use data to maintain a lifelong relationship with a customer. The
database marketer wants to anticipate and fulfill the customer ’s needs as they
emerge. For example, recognizing that a customer buys a new car every three or
four years and with each purchase gets an increasingly more luxurious car, the car
dealer contacts the customer during the third year of the life of the current car with a
special offer on its latest luxury model.
Data mining uses
There are many applications of data mining:
• Predicting the probability of default for consumer loan applications. Data
mining can help lenders reduce loan losses substantially by improving their
ability to predict bad loans.
• Reducing fabrication flaws in VLSI chips. Data mining systems can sift
through vast quantities of data collected during the semiconductor fabrication
process to identify conditions that cause yield problems.
• Predicting audience share for television programs. A market-share
prediction system allows television programming executives to arrange show
schedules to maximize market share and increase advertising revenues.
• Predicting the probability that a cancer patient will respond to radiation
therapy. By more accurately predicting the effectiveness of expensive medical
procedures, health care costs can be reduced without affecting quality of care.
• Predicting the probability that an offshore oil well is going to produce oil.
An offshore oil well may cost $30 million. Data mining technology can
increase the probability that this investment will be profitable.
• Identifying quasars from trillions of bytes of satellite data. This was one of
the earliest applications of data mining systems, because the technology was
first applied in the scientific community.
Data mining functions
Based on the functions they perform, five types of data mining functions exist:
Associations
An association function identifies affinities existing among the collection of items
in a given set of records. These relationships can be expressed by rules such as “72
percent of all the records that contain items A, B, and C also contain items D and E.”
Knowing that 85 percent of customers who buy a certain brand of wine also buy a
certain type of pasta can help supermarkets improve use of shelf space and
promotional offers. Discovering that fathers, on the way home on Friday, often grab
a six-pack of beer after buying some diapers, enabled a supermarket to improve
sales by placing beer specials next to diapers.
Sequential patterns
Sequential pattern mining functions identify frequently occurring sequences from
given records. For example, these functions can be used to detect the set of
customers associated with certain frequent buying patterns. Data mining might
discover, for example, that 32 percent of female customers within six months of
ordering a red jacket also buy a gray skirt. A retailer with knowledge of this
sequential pattern can then offer the red-jacket buyer a coupon or other enticement
to attract the prospective gray-skirt buyer.
Classifying
Classifying divides predefined classes (e.g., types of customers) into mutually
exclusive groups, such that the members of each group are as close as possible to
one another, and different groups are as far as possible from one another, where
distance is measured with respect to specific predefined variables. The classification
of groups is done before data analysis. Thus, based on sales, customers may be first
categorized as infrequent, occasional, and frequent. A classifier could be used to
identify those attributes, from a given set, that discriminate among the three types of
customers. For example, a classifier might identify frequent customers as those with
incomes above $50,000 and having two or more children. Classification functions
have been used extensively in applications such as credit risk analysis, portfolio
selection, health risk analysis, and image and speech recognition. Thus, when a new
customer is recruited, the firm can use the classifying function to determine the
customer ’s sales potential and accordingly tailor its market to that person.
Clustering
Whereas classifying starts with predefined categories, clustering starts with just the
data and discovers the hidden categories. These categories are derived from the
data. Clustering divides a dataset into mutually exclusive groups such that the
members of each group are as close as possible to one another, and different groups
are as far as possible from one another, where distance is measured with respect to
all available variables. The goal of clustering is to identify categories. Clustering
could be used, for instance, to identify natural groupings of customers by
processing all the available data on them. Examples of applications that can use
clustering functions are market segmentation, discovering affinity groups, and
defect analysis.
Prediction
Prediction calculates the future value of a variable. For example, it might be used to
predict the revenue value of a new customer based on that person’s demographic
variables.
These various data mining techniques can be used together. For example, a sequence
pattern analysis could identify potential customers (e.g., red jacket leads to gray
skirt), and then classifying could be used to distinguish between those prospects who
are converted to customers and those who are not (i.e., did not follow the sequential
pattern of buying a gray skirt). This additional analysis should enable the retailer to
refine its marketing strategy further to increase the conversion rate of red-jacket
customers to gray-skirt purchasers.
Data mining technologies
Data miners use technologies that are based on statistical analysis and data
visualization.
Decision trees
Tree-shaped structures can be used to represent decisions and rules for the
classification of a dataset. As well as being easy to understand, tree-based models
are suited to selecting important variables and are best when many of the predictors
are irrelevant. A decision tree, for example, can be used to assess the risk of a
prospective renter of an apartment.
A decision tree
Genetic algorithms
Genetic algorithms are optimization techniques based on the concepts of biological
evolution, and use processes such as genetic combination, mutation, and natural
selection. Possible solutions for a problem compete with each other. In an
evolutionary struggle of the survival of the fittest, the best solution survives the
battle. Genetic algorithms are suited for optimization problems with many candidate
variables (e.g., candidates for a loan).
K-nearest-neighbor method
The nearest-neighbor method is used for clustering and classification. In the case of
clustering, the method first plots each record in n-dimensional space, where n
attributes are used in the analysis. Then, it adjusts the weights for each dimension to
cluster together data points with similar goal features. For instance, if the goal is to
identify customers who frequently switch phone companies, the k-nearest-neighbor
method would adjust weights for relevant variables (such as monthly phone bill and
percentage of non-U.S. calls) to cluster switching customers in the same
neighborhood. Customers who did not switch would be clustered some distance
apart.
Neural networks
A neural network, mimicking the neurophysiology of the human brain, can learn
from examples to find patterns in data and classify data. Although neural networks
can be used for classification, they must first be trained to recognize patterns in a
sample dataset. Once trained, a neural network can make predictions from new data.
Neural networks are suited to combining information from many predictor
variables; they work well when many of the predictors are partially redundant. One
shortcoming of a neural network is that it can be viewed as a black box with no
explanation of the results provided. Often managers are reluctant to apply models
they do not understand, and this can limit the applicability of neural networks.
Data visualization
Data visualization can make it possible for the analyst to gain a deeper, intuitive
understanding of data. Because they present data in a visual format, visualization
tools take advantage of our capability to discern visual patterns rapidly. Data mining
can enable the analyst to focus attention on important patterns and trends and
explore these in depth using visualization techniques. Data mining and data
visualization work especially well together.
SQL-99 and OLAP
SQL-99 includes extensions to the GROUP BY clause to support some of the data
aggregation capabilities typically required for OLAP. Prior to SQL-99, the
following questions required separate queries:
1. Find the total revenue.
2. Report revenue by location.
3. Report revenue by channel.
4. Report revenue by location and channel.
Skill builder
Write SQL to answer each of the four preceding queries using the exped table,
which is a sample of 1,000 sales transactions from The Expeditioner.
Writing separate queries is time-consuming for the analyst and is inefficient because
it requires multiple passes of the table. SQL-99 introduced GROUPING SETS, ROLLUP,
and CUBE as a means of getting multiple answers from a single query and
addressing some of the aggregation requirements necessary for OLAP.
Grouping sets
The GROUPING SETS clause is used to specify multiple aggregations in a single query
and can be used with all the SQL aggregate functions. In the following SQL
statement, aggregations by location and channel are computed. In effect, it combines
questions 2 and 3 of the preceding list.
SELECT location, channel, SUM(revenue)
FROM exped
GROUP BY GROUPING SETS (location, channel);
In the columns with null for location and channel, the preceding query reports a
total revenue of 483,465. It also reports the total revenue for each location and
revenue for each combination of location and channel. For example, Tokyo Web
revenue totaled 2,003.
Cube
CUBE reports all possible values for a set of reporting variables. If SUM is used as the
aggregating function, it will report a grand total, a total for each variable, and totals
for all combinations of the reporting variables.
SELECT location, channel, SUM(revenue)
FROM exped
GROUP BY cube (location, channel);
Skill builder
Using MySQL’s ROLLUP capability, report the sales by location and channel for
each item.
The SQL-99 extensions to GROUP BY are useful, but they certainly do not give SQL
the power of a multidimensional database. It would seem that CUBE could be used as
the default without worrying about the differences among the three options.
Conclusion
Data management is a rapidly evolving discipline. Where once the spotlight was
clearly on TPSs and the relational model, there are now multiple centers of
attention. In an information economy, the knowledge to be gleaned from data
collected by routine transactions can be an important source of competitive
advantage. The more an organization can learn about its customers by studying their
behavior, the more likely it can provide superior products and services to retain
existing customers and lure prospective buyers. As a result, data managers now
have the dual responsibility of administering databases that keep the organization in
business today and tomorrow. They must now master the organizational intelligence
technologies described in this chapter.
Summary
Organizations recognize that data are a key resource necessary for the daily
operations of the business and its future success. Recent developments in hardware
and software have given organizations the capability to store and process vast
collections of data. Data warehouse software supports the creation and management
of huge data stores. The choice of architecture, hardware, and software is critical to
establishing a data warehouse. The two approaches to exploiting data are
verification and discovery. DSS and OLAP are mainly data verification methods.
Data mining, a data discovery approach, uses statistical analysis techniques to
discover hidden relationships. The relational model was not designed for OLAP,
and MDDB is the appropriate data store to support OLAP. MDDB design is based on
recognizing variable and identifier dimensions. MDX is a language for
interrogating an MDDB. SQL-99 includes extensions to GROUP BY to improve
aggregation reporting.
Key terms and concepts
Association
Classifying
Cleaning
Clustering
CUBE
Continuous variable
Database marketing
Data mining
Data visualization
Data warehouse
Decision support system (DSS)
Decision tree
Discovery
Drill-down
Drill-through
Extraction
Genetic algorithm
GROUPING SETS
Hypercube
Identifier dimension
Information systems cycle
K-nearest-neighbor method
Loading
Management information system (MIS)
Metadata
Multidimensional database (MDDB)
Multidimensional expressions (MDX)
Neural network
Nominal variable
Object-relational
Online analytical processing (OLAP)
Operational data store (ODS)
Ordinal variable
Organizational intelligence
Prediction
Relational OLAP (ROLAP)
ROLLUP
Rotation
Scheduling
Sequential pattern
Star model
Transaction processing system (TPS)
Transformation
Variable dimension
Verification
References and additional readings
Codd, E. F., S. B. Codd, and C. T. Salley. 1993. Beyond decision support.
Computerworld, 87–89.
Thomsen, E. 1997. OLAP solutions: Building multidimensional information systems.
New York, NY: Wiley.
Spofford, George. 2001. MDX solutions. New York, NY: Wiley,
Exercises
1. Identify data captured by a TPS at your university. Estimate how much data are
generated in a year.
2. What data does your university need to support decision making? Does the data
come from internal or external sources?
3. If your university were to implement a data warehouse, what examples of dirty
data might you expect to find?
4. How frequently do you think a university should revise its data warehouse?
5. Write five data verification questions for a university data warehouse.
6. Write five data discovery questions for a university data warehouse.
7. Imagine you work as an analyst for a major global auto manufacturer. What
techniques would you use for the following questions?
a. How do sports car buyers differ from other customers?
b. How should the market for trucks be segmented?
c. Where does our major competitor have its dealers?
d. How much money is a dealer likely to make from servicing a customer who
buys a luxury car?
e. What do people who buy midsize sedans have in common?
f. What products do customers buy within six months of buying a new car?
g. Who are the most likely prospects to buy a luxury car?
h. What were last year ’s sales of compacts in Europe by country and quarter?
i. We know a great deal about the sort of car a customer will buy based on
demographic data (e.g., age, number of children, and type of job). What is a
simple visual aid we can provide to sales personnel to help them show
customers the car they are most likely to buy?
8. An international airline has commissioned you to design an MDDB for its
marketing department. Choose identifier and variable dimensions. List some of
the analyses that could be performed against this database and the statistical
techniques that might be appropriate for them.
9. A telephone company needs your advice on the data it should include in its
MDDB. It has an extensive relational database that captures details (e.g., calling
and called phone numbers, time of day, cost, length of call) of every call. As well,
it has access to extensive demographic data so that it can allocate customers to
one of 50 lifestyle categories. What data would you load into the MDDB? What
aggregations would you use? It might help to identify initially the identifier and
variable dimensions.
10. What are the possible dangers of data mining? How might you avoid these?
11. Download the file exped.xls from the book’s web site and open it as a
spreadsheet in LibreOffice. This file is a sample of 1,000 sales transactions for
The Expeditioner. For each sale, there is a row recording when it was sold, where
it was sold, what was sold, how it was sold, the quantity sold, and the sales
revenue. Use the PivotTable Wizard (Data>PivotTable) to produce the following
report:
Sum of REVENUE HOW
WHERE Catalog Store Web Grand total
London 50,310 151,015 13,009 214,334
New York 8,712 28,060 2,351 39,123
Paris 32,166 104,083 7,054 143,303
Sydney 5,471 21,769 2,749 29,989
Tokyo 12,103 42,610 2,003 56,716
Grand Total 108,762 347,537 27,166 483,465
Continue to use the PivotTable Wizard to answer the following questions:
a. What was the value of catalog sales for London in the first quarter?
b. What percent of the total were Tokyo web sales in the fourth quarter?
c. What percent of Sydney’s annual sales were catalog sales?
d. What was the value of catalog sales for London in January? Give details of
the transactions.
e. What was the value of camel saddle sales for Paris in 2002 by quarter?
f. How many elephant polo sticks were sold in New York in each month of
2002?
14. Introduction to R
Statistics are no substitute for judgment
Henry Clay, U.S. congressman and senator
Learning objectives
Students completing this chapter will:
✤ Be able to use R for file handling and basic statistics;
✤ Be competent in the use of R Studio.
The R project
The R project supports ongoing development of R, a free software environment for
statistical computing, data visualization, and data analytics. It is a highly-extensible
platform, the R programming language is object-oriented, and R runs on the
common operating systems. There is evidence to indicate that adoption of R has
grown in recent years, and is now the most popular analytics platform. 29
R Studio interface
Creating a project
It usually makes sense to store all R scripts and data in the same folder or directory.
Thus, when you first start R Studio, create a new project.
Project > Create Project…
R Studio remembers the state of each window, so when you quit and reopen, it will
restore the windows to their prior state. You can also open an existing project,
which sets the path to the folder for that project. As a result, all saved scripts and
files during a session are stored in that folder.
Scripting
A script is a set of R commands. You can also think of it as a short program.
# CO2 parts per million for 2000-2009
co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78)
year <- (2000:2009) # a range of values
# show values
co2
year
# compute mean and standard deviation
mean(co2)
sd(co2)
plot(year,co2)
Smart editing
It is not uncommon to find that a dataset you would like to use is in electronic
format, but not in a format that matches your need. In most cases, you can use a
word processor, spreadsheet, or text editor to reformat the data. In the case of the
data in the previous table, here is a recipe for reformatting the data.
1. Copy each column to a word processor
2. Use the convert table to text command
3. Search and replace commas with nulls (i.e, “”)
4. Search and replace returns with commas
5. Edit to put R code around numbers
In some cases, you might find it quicker to copy a table to a spreadsheet, select each
column within the spreadsheet, and then proceed as described above. This technique
works will when the original table is in a pdf document.
Datasets
An R dataset is the familiar table of the relational model. There is one row for each
observation, and the columns contain the observed values or facts about each
observation. R supports multiple data structures and data types.
Vector
A vector is a single row table where data are all of the same type (e.g., character,
logical, numeric). In the following sample code, two numeric vectors are created.
co2 <- c(369.40,371.07,373.17,375.78,377.52,379.76,381.85,383.71,385.57,384.78)
year <- (2000:2009)
Matrix
A matrix is a table where all data are of the same type. Because it is a table, a matrix
has two dimensions, which need to be specified when defining the matrix. The
sample code creates a matrix with 4 rows and 3 columns, as the results of the
executed code illustrate.
m <- matrix(1:12, nrow=4,ncol=3)
Skill builder
Create a matrix with 6 rows and 3 columns containing the numbers 1 through 18.
Array
An array is a multidimensional table. It extends a matrix beyond two dimensions.
Review the results of running the following code by displaying the array created.
a <- array(letters[seq(1:24)], c(4,3,2))
Data frame
While vectors, matrices, and arrays are all forms of a table, they are restricted to
data of the same type (e.g., numeric). A data frame, like a relational table, can have
columns of different data types. The sample code creates a data frame with character
and numeric data types.
gender <- c("m","f","f")
age <- c(5,8,3)
df <- data.frame(gender,age)
List
The most general form of data storage is a list, which is an ordered collection of
objects. It permits you to store a variety of objects together under a single name. In
other words, a list is an object that contains other objects.
l <- list(co2,m,df)
Object
In R, an object is anything that can be assigned to a variable. It can be a constant, a
function, a data structure, a graph, and so on. You find that the various packages in R
support the creation of a wide range of objects and provide functions to work with
these and other objects. A variable is a way of referring to an object. Thus, we might
use the variable named l to refer to the list object defined in the preceding
subsection.
Types of data
R can handle the four types of data: nominal, ordinal, interval, and ratio. Nominal
data, typically character strings, are used for classification (e.g., high, medium, or
low). Ordinal data represent an ordering and thus can be sorted (e.g., the seeding or
ranking of players for a tennis tournament). The intervals between ordinal data are
not necessarily equal. Thus, the top seeded tennis play (ranked 1) might be far
superior to the second seeded (ranked 2), who might be only marginally better than
the third seed (ranked 3). Interval and ratio are forms of measurement data. The
interval between the units of measure for interval data are equal. In other words, the
distance between 50cm and 51cm is the same as the distance between 105cm and
106cm. Ratio data have equal intervals and a natural zero point. Time, distance, and
mass are examples of ratio scales. Celsius and Fahrenheit are interval data types, but
not ratio, because the zero point is arbitrary. As a result, 10º C is not twice as hot as
5º C. Whereas, Kelvin is a ratio data type because nothing can be colder than 0º K, a
natural zero point.
In R, nominal and ordinal data types are also known as factors. Defining a column
as a factor determines how its data are analyzed and presented. By default, factor
levels for character vectors are created in alphabetical order, which is often not
what is desired. To be precise, specify using the levels option.
rating <- c('high','medium','low')
rating <- factor(rating, order=T, levels = c('high','medium','low'))
Thus, the preceding code will result in changing the default reporting of factors
from alphabetical order (i.e., high, low, and medium) to listing them in the specified
order (i.e., high, medium, and low).
Missing values
Missing values are indicated by NA, meaning “not available.”
v <- c(1,2,3,NA)
Reading a file
Files are the usual form of input for R scripts. Fortunately, R can handle a wide
variety of input formats, including text (e.g., CSV), statistical package (e.g., SAS),
and XML. A common approach is to use a spreadsheet to prepare a data file, export
it as CSV, and then read it into R. Files can be read from the local computer on
which R is installed or the Internet, as the following sample code illustrates.
# read a local file (this will not work on your computer)
t <- read.table("Documents/R/Data/centralparktemps.txt", header=T, sep=",")
# read using a URL
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
You should indicate whether the first row is a header containing column names (e.g.,
header=T) and define the separator for data fields (e.g., for a tab use sep="\t").
Learning about a file
After reading a file, you might want to learn about its contents. First, you can click
on the file’s name in the top-right window. This will open the file in the top-left
window. If it is a long file, only a portion, such as the first 1000 rows, will be
displayed. Second, you can execute some R commands, as shown in the following
code, to show the first few and last few rows. You can also report the dimensions of
the file, its structure, and the type of object.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
head(t) # first few rows
tail(t) # last few rows
dim(t) # dimension
str(t) # structure of a dataset
class(t) #type of object
Packages
A major advantage of R is that the basic software can be easily extended by
installing additional packages, of which more than 4,000 exist. You can consult the
R package directory to help find a package that has the functions you need. R 30
Studio has an interface for finding and installing packages. See the Packages tab on
the lower-right window.
Before running a script, you need to indicate which packages it needs, beyond the
default packages that are automatically loaded. The require or library statement
specifies that a package is required for execution. The following example uses the
weathermetrics package to handle the conversion of Fahrenheit to Celsius.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
require(weathermetrics) #previously installed
# compute Celsius
t$Ctemp = fahrenheit.to.celsius(t$temperature,round=1)
Skill builder
Install the weathermetrics package and run the preceding code.
Reshaping data
Data are not always in the shape that you want. For example, you might have a
spreadsheet that, for each year, lists the monthly observations (i.e., year and 12
monthly values in one row). For analysis, you typically need to have a row for each
distinct observation (e.g., year, month, observation). Melting converts a document
from what is commonly called wide to narrow format. It is akin to normalization in
that the new table has year and month as the identifier of each observation.
Reshaping data with melting and casting
In the following example of melting, you specify which column contains the
identifier.
require(reshape)
s <- read.table('http://dl.dropbox.com/u/6960256/data/meltExample.csv',sep=',')
colnames(s) <- c('year', 1:12)
m <- melt(s,id='year')
colnames(m) <- c('year','month','co2')
Casting takes a narrow file and converts it to wide format.
c <- cast(m,year~month, value='co2')
Skill builder
Create a CSV file of annual CO2 measurements (PPM) for 1969 through 2011 taken
at the Mauna Loa Observatory.31 You should set up the file so that it contains three
columns: year, month, and CO2 measurements.
Read the file into R, do some basic stats, and plot year versus CO2.
Referencing columns
Columns within a table are referenced by using the format tablename$columname.
This is similar to the qualification method used in SQL. The following code shows
a few examples. It also illustrates how to add a new column to an existing table.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
# qualify with table name to reference a column
mean(t$temperature)
max(t$year)
range(t$month)
# create a new column with the temperature in Celsius
t$Ctemp = (t$temperature-32)*5/9
Writing files
R can write data in a few file formats. We just focus on text format in this brief
introduction. The following code illustrates how to create a new column containing
the temperature in C and renaming an existing column. The revised table is then
written as a text file to the project’s folder.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
# compute Celsius and round to one decimal place
t$Ctemp = round((t$temperature-32)*5/9,1)
colnames(t)[3] <- 'Ftemp' # rename third column to indicate Fahrenheit
write.table(t,"centralparktempsCF.txt")
Subsetting
You can select individual rows or columns from a table specifying row and column
selection criteria. To select a particular set of rows, you specify a row selection
criteria and leave the column criterion null. This is equivalent to the WHERE clause in
SQL. The following code selects all rows where the year is equal to 1999. Note the
use of two consecutive equal signs (==) to indicate an equality match.
t[t$year == 1999,]
To select a particular set of columns, you specify a column selection criteria and
leave the row criterion null. This is equivalent to listing columns to be displayed in
SQL. The following code selects columns 1,2, and 4.
t[,c(1:2,4)]
You can combine row and column selection criterion. In the next example, three
columns are selected where the year is greater than 1989 and less than 2000. Notice
the use of ‘&’ to indicate AND (see the following table for the other logical
operators) and the use of parentheses around the column selection specification.
t[(t$year > 1989 & t$year < 2000) ,c(1:2,4)]
Recoding
Some analyses might be facilitated by the recoding of data. For instance, you might
want to split a continuous measure into two categories. Imagine you decide to
classify all temperatures greater than or equal to 90ºF as ‘hot’ and the rest as ‘other.’
Here are the R commands to create a new variable in table m called Cut. We first set
Cut to ‘Other ’ and then set Cut for those observations greater than or to equal 90 to
‘Hot.’ The result is a new column that has one of two values.
m$Cut <- 'Other'
m$Cut[m$Temperature >= 90] <- 'Hot'
Aggregation
The aggregate function is used for summarizing data in a specified way (e.g., mean,
minimum, standard deviation). In the sample code, a file containing the mean
temperature for each year is created.
# average temperate for each year
a <- aggregate(t$temperature, by=list(t$year), FUN=mean)
# name columns
colnames(a) <- c('year', 'mean')
Skill builder
Create a file with the maximum temperature for each year.
Merging files
If there is a common column in two files, then they can be merged. This is like
joining two tables using a primary key and foreign key match. In the following
code, a file is created with the mean temperature for each year, and it is merged with
CO2 readings for the same set of years.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
# average monthly temp for each year
a <- aggregate(t$temperature, by=list(t$year), FUN=mean)
# name columns
colnames(a) = c('year', 'meanTemp')
# read yearly carbon data (source: http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html)
carbon <- read.table("http://dl.dropbox.com/u/6960256/data/carbon1959-2011.txt", sep='\t',header=T)
m <- merge(carbon,a,by="year")
Correlation coefficient
A correlation coefficient is a measure of the covariation between two sets of
observations. In this case, we are interested in whether there is a statistically
significant covariation between temperature and the level of atmospheric CO . 2
cor.test(m$meanTemp,m$CO2)
MySQL access
The package RJDBC provides a convenient way for direct connection between R
and MySQL. In order to use this package, you need to load the MySQL JDBC
driver and specify its path in the R script. Note that the driver is also known as the
32
MySQL Connector/J. In the following sample code, the driver is loaded and the
connection is made to a database. Once the connection has been made, you can run
an SQL query and load the results into an R table.
require(RJDBC)
# Load the driver
drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.23-
bin.jar")
# connect to the database
# change user and pwd for your values
conn <- dbConnect(drv, "jdbc:mysql://richardtwatson.com/Weather", "db2", "student")
# Query the database and create file t for use with R
t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record;")
head(t)
timestamp airTemp
1 2010-01-01 01:00:00.0 44
2 2010-01-01 02:00:00.0 44
3 2010-01-01 03:00:00.0 44
4 2010-01-01 04:00:00.0 44
5 2010-01-01 05:00:00.0 43
6 2010-01-01 06:00:00.0 42
You should consult the R website to learn about support for other database
management systems.
R resources
The vast number of packages makes the learning of R a major challenge. The basics
can be mastered quite quickly, but you will soon find that many problems require
special features or the data need some manipulation. Once you learn how to solve a
particular problem, make certain you save the script, with some embedded
comments, so you can reuse the code for a future problem. There are books that you
might find useful to have in electronic format on your computer or tablet, and two
of these are listed at the end of the chapter. There are, however, many more books, 33
both of a general and specialized nature. The R Reference Card is handy to keep
nearby when you are writing scripts. I printed and laminated a copy, and it’s in the
top drawer of my desk. A useful website is Quick-R, which is an online reference
for common tasks, such as those covered in this chapter. For a quick tutorial, you
can Try R.
R and data analytics
R is a platform for a wide variety of data analytics, including
• Statistical analysis
• Data visualization
• HDFS and MapReduce
• Text mining
• Energy Informatics
You have probably already completed an introductory statistical analysis course,
and you can now use R for all your statistical needs. In subsequent chapters, we will
discuss data visualization, HDFS and MapReduce, and text mining. Energy
Informatics is concerned with optimizing energy flows, and R is an appropriate tool
for analysis and optimization of energy systems. R is being used extensively to
analyze climate change data. 34
Learning objectives
✤ Students completing this chapter will:
✤ Understand the principles of the grammar of graphics;
✤ Be able to use ggplot2 to create common business graphics;
✤ Be able to select data from a database for graphic creation;
✤ Be able to depict locations on a Google map.
Visual processing
Humans are particularly skilled at processing visual information because it is an
innate capability, compared to reading which is a learned skill. When we evolved on
the Savannah of Africa, we had to be adept at processing visual information (e.g.,
recognizing danger) and deciding on the right course of action (fight or flight). Our
ancestors were those who were efficient visual processors and quickly detected
threats and used this information to make effective decisions. They selected actions
that led to survival. Those who were inefficient visual information processors did
not survive to reproduce. Even those with good visual skills failed to propagate if
they made poor decisions relative to detected dangers. Consequently, we should seek
opportunities to present information visually and play to our evolved strength. As
people vary in their preference for visual and textual information, it often makes
sense to support both types of reporting.
In order to learn how to visualize data, you need to become familiar with the
grammar of graphics and ggplot2 (an R extension for graphics). In line with the
learning of data modeling and SQL, we will take an intertwined spiral approach.
First we will tackle the grammar of graphics (the abstract) and then move to ggplot2
(the concrete). You will also learn how to take the output of an SQL query and feed
it directly into ggplot2. The end result will be a comprehensive set of practical skills
for data visualization.
The grammar of graphics
A grammar is a system of rules for generating valid statements in a language. A
grammar makes it possible for people to communicate accurately and concisely.
English has a rather complex grammar, as do most languages. In contrast,
computer-related languages, such as SQL, have a relatively simple grammar
because they are carefully designed for expressive power, consistency, and ease of
use. Each of the dialects of data modeling also has a grammar, and each of these
grammars is quite similar in that they all have the same foundational elements:
entities, attributes, identifiers, and relationships.
A grammar has been designed by Wilkinson for creating graphics to enhance their
37
appeal and pleasing appearance. Consequently, the default choices often result in
attractive graphs. You can also make some representation choices (e.g., the size of a
line representing a continuous variable or the shape of an object representing a
category) to increase a graphic’s aesthetic qualities.
Data
Most structured data, which is what we require for graphing, are typically stored in
spreadsheets or databases. In the prior chapter introducing R, you learned how to
read a spreadsheet exported as a CSV file and access a database. These are also the
skills you need for reading a file containing data to be visualized.
Transformation
A transformation converts data into a format suitable for the intended visualization.
In this case, we want to depict the relative change in carbon levels since pre-
industrial periods, when the value was 280 ppm. Here are sample R commands.
# compute a new column in carbon containing the relative change in CO2
carbon$relCO2 = (carbon$CO2-280)/280
There are many ways that you might want to transform data. The preceding example
just illustrates the general nature of a transformation. You can also think of SQL as a
transformation process as it can select columns from a database.
Coord
A coordinate system describes where things are located. A geopositioning system
(GPS ) reading of latitude and longitude describes where you are on the surface of
the globe. It is typically layered onto a map so you can see where you are relative to
your current location. Most graphs are plotted on a two-dimensional (2D) grid with
x (horizontal) and y (vertical) coordinates. ggplot2 currently supports six 2D
coordinate systems, as shown in the following table. The default coordinate system
is Cartesian.
Coordinate systems
Name Description
cartesian Cartesian coordinates
equal Equal scale Cartesian coordinates
flip Flipped Cartesian coordinates
trans Transformed Cartesian coordinates
map Map projections
polar Polar coordinates
As you will see later, a bar chart (horizontal bars) is simply a histogram (vertical
bars), rotated clockwise 90º. This means you can create a histogram and then add
one command coord_flip() to flip the coordinates and make it a bar chart.
Element
An element is a graph and its aesthetic attributes. Let’s start with a simple scatterplot
of year against CO emissions. We do this in two steps applying the ggplot approach
2
(billions)
Some recipes
Learning the full power of ggplot2 is quite an undertaking, so here are a few recipes
for visualizing data.
Histogram
Histograms are useful for showing the distribution of values in a single column. A
histogram has a vertical orientation. The number of occurrences of a particular
value in a column are automatically counted by the ggplot function. In this
following sample code, we show the distribution of temperatures in Celsius for the
Central Park temperature data. Notice that the Celsius conversion is rounded to an
integer for plotting.
require(weathermetrics)
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
t$C <- fahrenheit.to.celsius(t$temperature,0)
ggplot(t,aes(x=t$C)) + geom_histogram(fill='light blue') + xlab('Celsius')
In the following example, we query a database to get data for plotting a histogram.
require(ggplot2)
require(RJDBC)
# Load the driver
drv <- JDBC("com.mysql.jdbc.Driver", "Macintosh HD/Library/Java/Extensions/mysql-connector-java-5.1.23-
bin.jar")
# connect to the database
conn <- dbConnect(drv, "jdbc:mysql://richardtwatson.com/ClassicModels", "db1", "student")
# Query the database and create file for use with R
d <- dbGetQuery(conn,"SELECT productLine from Products;")
# Plot the number of product lines by specifying the appropriate column name
# Internal fill color is red
ggplot(d,aes(x=productLine)) + geom_histogram(fill='red')
Notice how the titles for various categories overlap. If this is the case, it is a good
idea to convert to a bar chart, which is just a case of flipping the coordinates.
Bar chart
A bar chart can be thought of as a histogram turned on its side with some
reorientation of the labels. The coord_flip() function reorients a histogram to create
a bar chart.
d <- dbGetQuery(conn,"SELECT productLine from Products;")
# Plot the number of product lines by specifying the appropriate column name
ggplot(d,aes(x=productLine)) + geom_histogram(fill='gold') + coord_flip()
Radar plot
Here is another way of presenting the same data graphically with a radar plot. This
is generally not recommended because it is hard for humans to compare areas when
presented as a sector rather than a rectangle. Similarly, pie charts are not favored.
Notice the use of ggtitle() to give a title for the plot, and the use of expand_limits()
to ensure the labels for each sector are shown. Try the plot without expand_limits()
to understand its effect.
d <- dbGetQuery(conn,"SELECT productLine from Products;")
# Plot the number of product lines by specifying the appropriate column name
ggplot(d,aes(x=productLine)) + geom_histogram(fill='bisque') +
coord_polar() + ggtitle("Number of products in each product line") +
expand_limits(x=c(0,10))
Skill builder
Create a bar chart using the data in the following table. Set population as the weight
for each observation.
Year 1804 1927 1960 1974 1987 1999 2012 2027 2046
Population
1 2 3 4 5 6 7 8 9
(billions)
Scatterplot
A scatterplot shows points on an x-y grid, which means you need to have an x and y
with numeric values.
# Get the monthly value of orders
d <- dbGetQuery(conn,"SELECT MONTH(orderDate) AS orderMonth, sum(quantityOrdered*priceEach) AS orderValue FRO
# Plot data orders by month
# Show the points and the line
ggplot(d,aes(x=orderMonth,y=orderValue)) + geom_point(color='red') + geom_line(color='blue')
It is sometimes helpful to show multiple scatterplots on the one grid. In ggplot2, you
can create groups of points for plotting. Sometimes, you might find it convenient to
recode data so you can use different colors or lines to distinguish values in set
categories. Let’s examine grouping by year.
require(scales)
require(ggplot2)
# Get the value of orders by year and month
d <- dbGetQuery(conn,"SELECT YEAR(orderDate) AS orderYear, MONTH(orderDate) AS Month, sum((quantityOrdered*pr
# Plot data orders by month and grouped by year
# ggplot expects grouping variables to be character, so convert
# load scales package for formatting as dollars
d$Year <- as.character(d$orderYear)
ggplot(d,aes(x=Month,y=Value,group=Year)) +
geom_line(aes(color=Year)) +
# Format as dollars
scale_y_continuous(label = dollar)
Sometimes the data that you want to graph will be in multiple files. Here is a case
where we want to show ClassicCars orders and payments by month on the same plot.
require(scales)
require(ggplot2)
orders <- dbGetQuery(conn,"SELECT MONTH(orderDate) AS month, sum((quantityOrdered*priceEach)) AS orderValue FRO
payments <- dbGetQuery(conn,"SELECT MONTH(paymentDate) AS month, SUM(amount) AS payValue FROM Payments W
ggplot(orders,aes(x=month)) +
geom_line(aes(y=orders$orderValue, color='Orders')) + geom_line(aes(y=payments$payValue, color='Payments')) +
xlab('Month') + ylab('') +
# Format as dollars and show each month
scale_y_continuous(label = dollar) + scale_x_continuous(breaks=c(1:12)) +
# Remove the legend
theme(legend.title=element_blank())
Here is another example showing the power of combining SQL and ggplot2 to
depict air temperature at 5pm in 2011 for the weather data. Because of the many
values of timestamp, the individual values cannot be discerned and appear as dark
bar on the x-axis.
conn <- dbConnect(drv, "jdbc:mysql://richardtwatson.com/Weather", "db2", "student")
t <- dbGetQuery(conn,"SELECT timestamp, airTemp from record WHERE year(timestamp)=2011 and hour(timestamp)=17;")
ggplot(t,aes(x=timestamp, y=airTemp)) + geom_point(color='blue')
Smooth
The geom_smooth() function helps to detect a trend in a line plot. The following
example shows the average mean temperatures for August for the Central Park data.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T, sep=",")
# select the August data
t1 <- t[t$month==8,]
ggplot(t1,aes(x=year,y=temperature)) + geom_line(color="red") + geom_smooth()
Skill builder
Using the Central Park data, plot the temperatures for February and fit a straight line
to the points (geom_smooth(method=lm)). What do you observe?
Box Plot
A box plot is an effective means of displaying information about a single variable. It
shows minimum and maximum, range, lower quartile, median, and upper quartile.
In the sample code, possible outliers are shown in red.
conn <- dbConnect(drv, "jdbc:mysql://richardtwatson.com/ClassicModels", "db1", "student")
d <- dbGetQuery(conn,"SELECT * from Payments;")
# Boxplot of amounts paid
ggplot(d,aes(factor(0),amount)) + geom_boxplot(outlier.colour='red') + xlab("") + ylab("Check")
Fluctuation plots
A fluctuation plot visualizes tabular data. It is useful when you have two categorical
variables cross tabulated. Consider the case for the ClassicModels database where
we want to get an idea of the different model scales in each product line. We can get
a quick picture with the following code. 40
Skill builder
Create the following maps.
1. Offices in Europe
2. Office location in Paris
3. Customers in the US
4. Customers in Sydney
R resources
The developer of ggplot2, Hadley Wickham, has written a book on the package and
also maintains a web site. Another useful web site is FlowingData, which has an
associated book, titled Visualize This. Both books are listed in the reference section
at the end of this chapter. If you become heavily involved in visualization, we
suggest you read some of the works of Edward Tufte. One of his books is listed in
the references.
Summary
Because of evolutionary pressure, humans became strong visual processors.
Consequently, graphics can be very effective for presenting information. The
grammar of graphics provides a logical foundation for the construction of a
graphic by adding successive layers of information. ggplot2 is a package
implementing the grammar of graphics in R, the open source statistics platform.
Data can be extracted from an MySQL database for processing and graphical
presentation in R. Spatial data can be selected from a database and displayed on a
Google map.
Key terms and concepts
Bar chart
Fluctuation plot
Grammar of graphics
Graph
Graphic
Heat map
Histogram
Parallel coordinates
Scatterplot
Smooth
References
Kabacoff, R. I. (2009). R in action: data analysis and graphics with R. Greenwich,
CT: Manning Publications.
Kahle, D., & Wickham, H. (forthcoming). ggmap : Spatial Visualization with
ggplot2. The R Journal.
Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT:
Graphics Press.
Wickham, H. (2009). ggplot2: elegant graphics for data analysis. New York:
Springer-Verlag New York Inc.
Yau, N. (2011). Visualize this: the flowingdata guide to design, visualization, and
statistics. Indianapolis, IN: Wiley.
Wilkinson, L. (2005). The grammar of graphics (2nd ed.). New York: Springer.
Exercises
SQL input
1. Visualize in blue the number of items for each product scale.
2. Prepare a line plot with appropriate labels for total payments for each month in
2004.
3. Create a histogram with appropriate labels for the value of orders received from
the Nordic countries (Denmark, Finland, Norway, Sweden).
4. Create a heatmap for product lines and Norwegian cities.
5. Show on a Google map the customers in Japan.
6. Show on a Google map the European customers who have never placed an
order.
File input
7. Access http://dl.dropbox.com/u/6960256/data/manheim.txt, which contains
details of the sales of three car models: X, Y, and Z.
a. Create a histogram for sales of each model.
b. Create bar chart for sales by each form of sale (online or auction).
8. Access https://dl.dropbox.com/u/6960256/data/electricityprices2010-12.csv,
which contains details of electricity prices for a major city. Do a Box plot of
41
can be a good servant, but enter its realm with realistic expectations of what is
achievable with the current state-of-the-art.
Levels of processing
There are three levels to consider when processing language.
Semantics
Semantics focuses on the meaning of words and the interactions between words to
form larger units of meaning (such as sentences). Words in isolation often provide
little information. We normally need to read or hear a sentence to understand the
sender ’s intent. One word can change the meaning of a sentence (e.g., “Help needed
versus Help not needed”). It is typically an entire sentence that conveys meaning. Of
course, elaborate ideas or commands can require many sentences.
Discourse
Building on semantic analysis, discourse analysis aims to determine the
relationships between sentences in a communication, such as a conversation,
consisting of multiple sentences in a particular order. Most human communications
are a series of connected sentences that collectively disclose the sender ’s goals.
Typically, interspersed in a conversation are one or more sentences from one or
more receivers as they try to understand the sender ’s purpose and maybe interject
their thoughts and goals into the discussion. The points and counterpoints of a blog
are an example of such a discourse. As you might imagine, making sense of
discourse is frequently more difficult, for both humans and machines, than
comprehending a single sentence. However, the braiding of question and answer in
a discourse, can sometimes help to reduce ambiguity.
Pragmatics
Finally, pragmatics studies how context, world knowledge, language conventions,
and other abstract properties contribute to the meaning of human conversation. Our
shared experiences and knowledge often help us to make sense of situations. We
derive meaning from the manner of the discourse, where it takes place, its time and
length, who else is involved, and so forth. Thus, we usually find it much easier to
communicate with those with whom we share a common culture, history, and
socioeconomic status because the great collection of knowledge we jointly share
assists in overcoming ambiguity.
Tokenization
Tokenization is the process of breaking a document into chunks (e.g., words), which
are called tokens. Whitespaces (e.g., spaces and tabs) are used to determine where a
break occurs. Tokenization typically creates a bag of words for subsequent
processing. Many text mining functions use words as the foundation for analysis.
Counting words
To count words, text is split into an array of words. Thus, the number of words is
the length of the array. In the following example, str_split() uses a space (“ “) to
separate a text string into words (the tokens in this case).
require(stringr)
# split a string into words into a list of words
y <- str_split("The dead batteries were given out free of charge", " ")
# report length of the vector
length(y[[1]]) # double square bracket "[[]]" to reference a list member
Sentiment analysis
Sentiment analysis is a popular and simple method of measuring aggregate feeling.
In its simplest form, it is computed by giving a score of +1 to each “positive” word
and -1 to each “negative” word and summing the total to get a sentiment score. A
text is decomposed into words. Each word is then checked against a list to find its
score (i.e., +1 or -1), and if the word is not in the list, it doesn’t score. A list of
positive and negative words has been created and is available online. 44
splits input text into sentences, and then sentences into words. It first does some
cleaning up and converts everything to lower case. It then looks up each word to get
its score and computes a score for each sentence.
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)
# split sentence into words
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare words to the list of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
The preceding function needs to be compiled before use in an R session. Copy the
code into a new R script window, select all of it, and then select run. Save the script
as a separate file.
The following script is an example of using the function for sentiment analysis. The
positive and negative words have been color coded for ease of understanding.
sample = c("You're awesome and I love you", "I hate and hate and hate. So angry. Die!", "Impressed and amazed: you are
hu.liu.pos = scan('http://dl.dropbox.com/u/6960256/data/positive-words.txt', what='character', comment.char=';')
hu.liu.neg = scan('http://dl.dropbox.com/u/6960256/data/negative-words.txt', what='character', comment.char=';')
pos.words = c(hu.liu.pos)
neg.words = c(hu.liu.neg)
result = score.sentiment(sample, pos.words, neg.words)
# reports score by sentence
result$score
sum(result$score)
mean(result$score)result$score
You can see from the following output that the scores for each sentence are 2, -5,
and 4, respectively, the overall score for the text is 1, and the mean score per
sentence is 0.33. Notice that the negative tone of the third sentence is not recognized
in its score.
[1] 2 -5 4
> sum(result$score)
[1] 1
> mean(result$score)
[1] 0.3333333
available in html or pdf, were converted to separate text files using Abbyy Fine
Reader. Tables, page numbers, and graphics were removed during the conversion.
The following R code sets up a loop to read each of the letters and add it to a data
frame. When all the letters have been read, they are turned into a corpus.
require(stringr)
require(tm)
#set up a data frame to hold up to 100 letters
df <- data.frame(num=100)
begin <- 1998 # date of first letter
i <- begin
# read the letters
while (i < 2013) {
y <- as.character(i)
# create the file name
f <- str_c('http://www.richardtwatson.com/BuffettLetters/',y, 'ltr.txt',sep='')
# read the letter as on large string
d <- readChar(f,nchars=1e6)
# add letter to the data frame
df[i-begin+1,] <- d
i <- i + 1
}
# create the corpus
letters <- Corpus(DataframeSource(as.data.frame(df), encoding = "UTF-8"))
Readability
There are several approaches to estimating the readability of a selection of text.
They are usually based on counting the words in each sentence and the number of
syllables in each word. For example, the Flesch-Kincaid method uses the formula:
(11.8 * syllables_per_word) + (0.39 * words_per_sentence) - 15.59
It estimates the grade-level or years of education required of the reader. The bands
are:
13-16 Undergraduate
16-18 Masters
19- PhD
The R package koRpus has a number of methods for calculating readability scores.
You first need to tokenize the text using the package’s tokenize function. Then
complete the calculation.
require(koRpus)
# tokenize the first letter in the corpus
tagged.text <- tokenize(letter[[1]], format="obj",lang="en")
# score
readability(tagged.text, "Flesch.Kincaid", hyphen=NULL,force.lang="en")
Preprocessing
Before commencing analysis, a text file typically needs to be prepared for
processing. Several steps are usually taken.
Case conversion
For comparison purposes, all text should be of the same case. Conventionally, the
choice is to convert to all lower case.
clean.letters <- tm_map(letters,tolower)
Punctuation removal
Punctuation is usually removed when the focus is just on the words in a text and not
on higher level elements such as sentences and paragraphs.
clean.letters <- tm_map(clean.letters,removePunctuation)
Number removal
You might also want to remove all numbers.
clean.letters <- tm_map(clean.letters,removePunctuation)
expect, each language has a set of stop words. Stop word lists are typically all
lowercase, thus you should convert to lowercase before removing stop words.
clean.letters <- tm_map(clean.letters,removeWords,stopwords("english"))
Specific word removal
There can also specify particular words to be removed via a character vector. For
instance, you might not be interested in tracking references to Berkshire Hathaway
in Buffett’s letters. You can set up a dictionary with words to be removed from the
corpus.
dictionary <- c("berkshire","hathaway", "million", "billion", "dollar")
clean.letters <- tm_map(clean.letters,removeWords,dictionary)
• A document-term matrix contains one row for each document and one
column for each term.
dtm <- DocumentTermMatrix(clean.letters,control = list(minWordLength=3))
dim(dtm)
The function dtm() reports the number of distinct terms, the vocabulary, and the
number of documents in the corpus.
Term frequency
Words that occur frequently within a document are usually a good indicator of the
document’s content. Term frequency (tf) measures word frequency.
tf = number of times term t occurs in document d.
td
A word cloud
Skill builder
Start with the original letters corpus (i.e., prior to preprocessing) and identify the 20
most common words and create a word cloud for these words.
Co-occurrence and association
Co-occurrence measures the frequency with which two words appear together. In the
case of document level association, if the two words both appear or neither appears,
then the correlation or association is 1. If two words never appear together in the
same document, their association is -1.
A simple example illustrates the concept. The following code sets up a corpus of
five elementary documents.
data <- c("word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5
frame <- data.frame(data)
test <- Corpus(DataframeSource(frame, encoding = "UTF-8"))
tdm <- TermDocumentMatrix(test)
as.matrix(tdm)
Docs
Terms 1 2 3 4 5
word1 1 1 1 1 1
word2 0 1 1 1 1
word3 0 0 1 1 1
word4 0 0 0 1 1
word5 0 0 0 0 1
We compute the correlation of rows to get a measure of association across
documents.
# Correlation between word2 and word3, word4, and word5
cor(c(0,1,1,1,1),c(0,0,1,1,1))
cor(c(0,1,1,1,1),c(0,0,0,1,1))
cor(c(0,1,1,1,1),c(0,0,0,0,1))
> cor(c(0,1,1,1,1),c(0,0,1,1,1))
[1] 0.6123724
> cor(c(0,1,1,1,1),c(0,0,0,1,1))
[1] 0.4082483
> cor(c(0,1,1,1,1),c(0,0,0,0,1))
[1] 0.25
Alternatively, use the findAssocs function, which computes all correlations between
a given term and all terms in the term-document matrix and reports those higher
than the correlation threshold.
# find associations greater than 0.1
findAssocs(tdm,"word2",0.1)
Now that you have an understanding of how association works across documents,
here is an example for the corpus of Buffett letters.
# Select the first ten letters
tdm <- TermDocumentMatrix(clean.letters[1:10])
# compute the associations
findAssocs(tdm, "investment",0.80)
The cluster is shown in the following figure as a dendrogram, a tree diagram, with
a leaf for each year. The diagram suggests four main clusters, namely (2010, 2008,
2002), (2006, 2005, 2004, 2003, 2009, 2007, 2001), (2000, 1999, 1998), and (2012,
2011).
Dendrogram for Buffett letters from 1998-2012
To learn more about why there might be a difference, we can use termFreq() to
generate a frequency vector for specific documents. Here is the coding for the
letters of 2010 and 2011, which as the dendrogram suggests are far apart relative to
other letters.
t2010 <- as.matrix(termFreq(clean.letters[[13]])) # 2010 letter
t2011 <- as.matrix(termFreq(clean.letters[[14]])) # 2011 letter
s2010 <- t2010[order(-t2010[,1]),]
head(s2010)
s2011 <- t2011[order(-t2011[,1]),]
head(s2011)
You will probably notice that the most frequent words (e.g., value, business) don’t
indicate much difference between the two sets of letters. You will need to use the
dictionary capability to remove some of these words to see if you can discern which
words make a difference between letters assigned to different clusters.
Skill builder
Use the dictionary feature of text mining to remove selected words from the Buffett
letters’ corpus to see if you can determine what differentiates the letters of 2010 and
2011.
Topic modeling
Topic modeling is a set of statistical techniques for identifying the themes that occur
in a document set. The key assumption is that a document on a particular topic will
contain words that identify that topic. For example, a report on gold mining will
likely contain words such as “gold” and “ore.” Whereas, a document on France,
would likely contain the terms “France,” “French,” and “Paris.”
The package topicmodels implements topic modeling techniques within the R
framework. It extends tm to provide support for topic modeling. It implements two
methods: latent Dirichlet allocation (LDA), which assumes topics are uncorrelated;
and correlated topic models (CTM), an extension of LDA that permits correlation
between topics. Both LDA and CTM require that the number of topics to be
49
extracted is determined a priori. For example, you might decide in advance that five
topics gives a reasonable spread and is sufficiently small for the diversity to be
understood. 50
Words that occur frequently in many documents are not good at distinguishing
among documents. The weighted term frequency inverse document frequency (tf-
idf) is a measure designed for determining which terms discriminate among
documents. It is based on the term frequency (tf), defined earlier, and the inverse
document frequency.
Inverse document frequency
The inverse document frequency (idf) measures the frequency of a term across
documents.
Where
m = number of documents (i.e., rows in the case of a term-document matrix);
df = number of documents containing term t.
t
If a term occurs in every document, then its idf = 0, whereas if a term occurs in only
one document out of 15, its idf = 3.91.
To calculate and display the idf for the letters corpus, we can use the following R
script.
# calculate idf for each term
idf <- log2(nDocs(dtm)/col_sums(dtm > 0))
# create dataframe for graphing
df.idf <- data.frame(idf)
ggplot(df.idf,aes(idf)) + geom_histogram(fill="chocolate") + xlab("Inverse document frequency")
The preceding graphic shows that about 5,000 terms occur in only one document
(i.e., the idf = 3.91) and less than 500 terms occur in every document. The terms with
an idf in the range 1 to 2 are likely to be the most useful in determining the topic of
each document.
Term frequency inverse document frequency (tf-idf)
The weighted term frequency inverse document frequency (tf-idf or ω ) is td
Where
tf = frequency of term t in document d.
td
As mentioned earlier, the topic modeling method assumes a set number of topics,
and it is the responsibility of the analyst to estimate the correct number of topics to
extract. It is common practice to fit models with a varying number of topics, and use
the various results to establish a good choice for the number of topics. The analyst
will typically review the output of several models and make a judgment on which
model appears to provide a realistic set of distinct topics. Here is some code that
starts with five topics. For your convenience, we provide a complete set of code for
setting up the corpus prior to topic modeling. Note the use of a dictionary to remove
some common words.
require(topicmodels)
require(tm)
#set up a data frame to hold up to 100 letters
df <- data.frame(num=100)
begin <- 1998 # date of first letter
i <- begin
# read the letters
while (i < 2013) {
y <- as.character(i)
# create the file name
f <- str_c('http://www.richardtwatson.com/BuffettLetters/',y, 'ltr.txt',sep='')
# read the letter as on large string
d <- readChar(f,nchars=1e6)
# add to letter to the data frame
df[i-begin+1,] <- d
i <- i + 1
}
# create the corpus
letters <- Corpus(DataframeSource(as.data.frame(df), encoding = "UTF-8"))
# convert to lower case
clean.letters <- tm_map(clean.letters,tolower)
# remove punctuation
clean.letters <- tm_map(clean.letters,removePunctuation)
# remove numbers
clean.letters <- tm_map(clean.letters,removeNumbers)
# remove extra white space
clean.letters <- tm_map(clean.letters,stripWhitespace)
# remove stop words
clean.letters <- tm_map(clean.letters,removeWords,stopwords("en"))
dictionary <- c("berkshire","hathaway", "charlie", "million", "billion", "dollar","ceo","earnings","company","business","directo
clean.letters <- tm_map(clean.letters,removeWords,dictionary)
dtm <- DocumentTermMatrix(clean.letters,control = list(minWordLength=3))
# set number of topics to extract
k <- 5 # number of topics
SEED <- 2010 # seed for initializing the model rather than the default random
# try multiple methods – takes a while for a big corpus
TM <- list(VEM = LDA(dtm, k = k, control = list(seed = SEED)),
VEM_fixed = LDA(dtm, k = k, control = list(estimate.alpha = FALSE, seed = SEED)),
Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000, thin = 100, iter = 1000)), CTM = CTM
topics(TM[["VEM"]], 1)
terms(TM[["VEM"]], 5)
> topics(TM[["VEM"]], 1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 5 1 2 5 3 3 5 3 2 2 4 4 4 4
> terms(TM[["VEM"]], 5)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "value" "insurance" "insurance" "value" "value"
[2,] "cost" "businesses" "value" "businesses" "shareholders"
[3,] "float" "value" "shareholders" "insurance" "businesses"
[4,] "berkshires" "losses" "berkshires" "float" "managers"
[5,] "businesses" "companies" "time" "berkshires" "insurance"
The output indicates that the first letter (1998) is about topic 1, the second (1999)
topic 5, and so on. The last four letters (2009-2012) relate to topic 4.
Topic 1 is defined by the following terms: value, cost, float, berkshires, and
businesses. Observe that topic 2 has some overlap, but it differs in that it contains
insurance, losses, and companies. As we have seen previously, some of these words
(e.g., value and cost) are not useful differentiators, and the dictionary could be
extended to remove them from consideration and topic modeling repeated. For this
particular case, it might be that Buffett’s letters don’t vary much from year to year,
and he returns to the same topics in each annual report.
Skill builder
Experiment with the topicmodels package to identify the topics in Buffett’s letters.
You might need to use the dictionary feature of text mining to remove selected
words from the corpus to develop a meaningful distinction between topics.
Named-entity recognition (NER)
Named-entity recognition (NER) places terms in a corpus into predefined categories
such as the names of persons, organizations, locations, expressions of times,
quantities, monetary values, and percentages. It identifies some or all mentions of
these categories, as shown in the following figure, where an organization, place,
and date are recognized.
Named-entity recognition example
Tags are added to the corpus to denote the category of the terms identified.
The <organization>Olympics</organization> were in <place>London</place> in <date>2012</date>.
based toolkit for the processing of natural language in text format. It is a collection
of natural language processing tools, including a sentence detector, tokenizer, parts-
of-speech(POS)-tagger, syntactic parser, and named-entity detector. The NER tool
can recognize people, locations, organizations, dates, times. percentages, and
money. You will need to write a Java program to take advantage of the toolkit.
KNIME (Konstanz Information Miner) is a general purpose data management and
52
will need to add the community contributed plugin, Palladian, to run the analyzer.
See Preferences > Install/Update > Available Software Sites and the community
contribution page <http://tech.knime.org/community>.
3. Merge the annual reports for Berkshire Hathaway (i.e., Buffett’s letters) and UPS
into a single corpus.
a. Undertake a cluster analysis and identify which reports are similar in nature.
b. Build a topic model for the combined annual reports.
c. Do the cluster analysis and topic model suggest considerable differences in
the two sets of reports?
17. HDFS & MapReduce
We concatenated the files to bring them close to and less than 64Mb and the
difference was huge without changing anything else we went from 214 minutes to 3
minutes !
Elia Mazzaw 55
Learning objectives
Students completing this chapter will:
✤ Understand the paradigm shift in decision-oriented data processing;
✤ Be able to write MapReduce programs using R.
A paradigm shift
There is much talk of big data, and much of it is not very informative. Rather a lot
of big talk but not much smart talk. Big data is not just about greater variety and
volumes of data at higher velocity, which is certainly occurring. The more
important issue is the paradigm shift in data processing so that large volumes of
data can be handled in a timely manner to support decision making. The foundations
of this new model are MapReduce and the Hadoop distributed file system (HDFS).
We start by considering what is different between the old and new paradigms for
decision data analysis. Note that we are not considering transaction processing, for
which the relational model is a sound solution. Rather, we are interested in the
processing of very large volumes of data at a time, and the relational model was not
designed for this purpose. It is suited for handling transactions, which typically
involve only a few records. The multidimensional database (MDDB) is the “old”
approach for large datasets and MapReduce is the “new.” The logical model of the
MDDB is the hypercube, which is an extension of the two-dimensional table of the
relational model to n-dimensions. MapReduce works with a key-value pair, which is
simply an identifier (the key) and some related facts (the value).
Another difference is the way data are handled. The old approach is to store data on
a high speed disk drive and load it into computer memory for processing. To speed
up processing, the data might be moved to multiple computers to enable parallel
processing and the results merged into a single file. Because data files are typically
much larger than programs, moving data from disks to computers is time
consuming. Also, high performance disk storage devices are expensive. The new
method is to spread a large data file across multiple commodity computers using
HDFS and then send each computer a copy of the program to run in parallel. The
results from the individual jobs are then merged. While data still need to be moved
to be processed, they are moved across a high speed data channel within a computer
rather than the lower speed cables of a storage network.
Old New
Multidimensional database MapReduce
Data to the program Program to the data
Mutable data Immutable data
Special purpose hardware Commodity hardware
The drivers
Exploring the drivers promoting the paradigm shift is a good starting point for
understanding this important change in data management. First, you will recall that
you learned in Chapter 1 that decision making is the central organizational activity.
Furthermore, because data-driven decision making increases organizational
performance, many executives are now demanding data analytics to support their
decision making.
Second, as we also explained in Chapter 1, there is a societal shift in dominant logic
as we collectively recognize that we need to focus on reducing environmental
degradation and carbon emissions. Service and sustainability dominant logics are
both data intensive. Customer service decisions are increasingly based on the
analysis of large volumes of operational and social data. Sustainability oriented
decisions also require large volumes of operational data, which are combined with
environmental data collected by massive sensor networks to support decision
making that reduces an organization’s environmental impact.
Third, the world is in the midst of a massive data generating digital transformation.
Large volumes of data are collected about the operation on an aircraft’s jet engines,
how gamers play massively online games, how people interact in social media
space, and the operation of cell phone networks, for example. The digital
transformation of life and work is creating a bits and bytes tsunami.
The bottleneck and its solution
In a highly connected and competitive world, speedy high quality decisions can be a
competitive advantage. However, large data sets can take some time and expense to
process, and so as more data are collected, there is a the danger that decision
making will gradually slow down and its quality lowered. Data analytics becomes a
bottleneck when the conversion of data to information is too slow. Second, decision
quality is lowered when there is a dearth of skills for determining what data should
be converted to information and interpreting the resulting conversion. We capture
these problems in the elaboration of the diagram that was introduced in Chapter 1,
which now illustrates the causes of the conversion, request, and interpretation
bottlenecks.
Data analytics bottleneck
The people skills problem is being addressed by the many universities that have
added graduate courses in data analytics. The Lambda Architecture is a proposed
56
1. During the processing of batch , increment was created from the records
n-1 n-1
2. As the data for increment were received, speed layer (Sresults ) were
n-1 n-1
dynamically recomputed.
3. A current report can be created by combining speed layer and batch layer results
(i.e., Sresults and Bresults ).
n-1 n-1
1. Sresults are computed from the data collected (increment ) while batch is being
n n n
processing.
2. Current reports are based on Bresults , Sresults , and Sresults .
n-1 n-1 n
In summary, the batch layer pre-computes reports using all the currently available
data. The serving layer indexes the results of the batch layers and creates views that
are the foundation for rapid responses to queries. The speed layer does incremental
updates as data are received. Queries are handled by merging data from the serving
and speed layers.
Lambda Architecture (source: (Marz and Warren 2012))
Benefits of the Lambda Architecture
The Lambda Architecture provides some important advantages for processing large
datasets, and these are now considered.
Robust and fault-tolerant
Programming for batch processing is relatively simple and also it can easily be
restarted if there is a problem. Replication of the batch layer dataset across
computers increases fault tolerance. If a block is unreadable, the batch processor
can shift to the identical block on another node in the cluster. Also, the redundancy
intrinsic to a distributed file system and distributed processors provides fault-
tolerance.
The speed layer is the complex component of the Lambda Architecture. Because
complexity is isolated to this layer, it does not impact other layers. In addition, since
the speed layer produced temporary results, these can be discarded in the event of an
error. Eventually the system will right itself when the batch layer produces a new set
of results, though intermediate reports might be a little out of date.
Low latency reads and updates
The speed layer overcomes the long delays associated with batch processing. Real-
time reporting is possible through the combination of batch and speed layer outputs.
Scalable
Scalability is achieved using a distributed file system and distributed processors. To
scale, new computers are added.
Support a wide variety of applications
The general architecture can support reporting for a wide variety of situations
provided that a key, value pair can be extracted from a dataset.
Extensible
New data types can be added to the master dataset or new master datasets created.
Furthermore, new computations can be added to the batch and speed layers to create
new views.
Ad hoc queries
On the fly queries can be run on the output of the batch layer provided the required
data are available in a view.
Minimal maintenance
The batch and serving layers are relatively simple programs because they don’t deal
with random updates or writes. Simple code requires less maintenance.
Debuggable
Batch programs are easier to debug because you can have a clear link between the
input and output. Data immutability means that debugging is easier because no
records have been overwritten during batch processing.
Relational and Lambda Architectures
Relational technology can support both transaction processing and data analytics. As
a result, it needs to be more complex than the Lambda Architecture. Separating out
data analytics from transaction processing simplifies the supporting technology and
makes it suitable for handling large volumes of data efficiently. Relational systems
can continue to support transaction processing and, as a byproduct, produce data
that are fed to Lambda Architecture based business analytics.
Hadoop
Hadoop, an Apache project, supports distributed processing of large data sets
57
On each node, blocks are stored sequentially to minimize disk head movement.
Blocks are grouped into files, and all files for a dataset are grouped into a single
folder. As part of the simplification to support batch processing, there is no random
access to records and new data are added as a new file.
Scalability is facilitated by the addition of new nodes, which means adding a few
more pieces of inexpensive commodity hardware to the cluster. Appending new data
as files on the cluster also supports scalability.
HDFS also supports partitioning of data into folders for processing at the folder
level. For example, you might want all employment related data in a single folder.
MapReduce
MapReduce is a distributed computing method that provides primitives (standard
functions) for scalable and fault-tolerant batch computation. It supports the batch
processing component of the Lambda Architecture by distributing computation
across multiple nodes. MapReduce is typically used with HDFS, and these are
bundled together in Hadoop.
MapReduce, as the name implies, consists of a map and reduce component. Map is
key to enabling distributed processing because it creates a computing task that can
be executed in parallel on many computers. Reduce puts the results of the parallel
computing together.
An input file is received from HDFS, processed by the mapper, and sent to the
reducer, and then output. The following diagram is a general overview of the data
flow.
MapReduce elements
Reduce
A Reduce function is applied, for each input key, to its associated list of values. The
result of that application is a new pair consisting of the key and whatever is
produced by the Reduce function when applied to the list of values. The output of the
entire MapReduce job is what results from the application of the Reduce function to
each key and its list.
The following Reduce function reports the number of items in a list. It is useful for
counting the number of observations of a key. It produces a new key-value pair
consisting of the key and its length, which is also the frequency of occurrence of the
key.
reducer <- function(k,v) {
key <- k
value <- length(v)
keyval(key,value)
}
When a MapReduce function is executed, multiple Map and Reduce tasks are
created. For a node in the cluster, a Map task executes the Map function for a portion
of the input key-value pairs. Similarly, a Reduce task executes the Reduce function
for a subset of key-value pairs received from the Map phase. Because both Map and
Reduce tasks can be executed in parallel, the level of parallelism is determined by
the number of nodes used for each task.
R and Hadoop
There are a number of programming options for writing map and reduce functions.
Hadoop is Java-based, so one choice is to use Java. There is also a PigLatin, which
requires less coding than Java because it is designed to work with Hadoop. We have
opted to go with R because it offers the easiest way to get started with Hadoop, and
also because R is used in other parts of this book. Specifically, we are using the
Revolutionary Analytics open source tool RHadoop, which integrates R and
58
[1] 1 4 9 16 25 36 49 64 81 100
MapReduce
The MapReduce approach requires the creation of a HDFS file containing the list of
numbers to be squared and the definition of the key-value pair. Notice in the second
line of the following code the use of rmr.options(). It is a good idea to debug your
MapReduce task in local mode (backend = "local") with a small dataset before running it
on a Hadoop cluster (backend= "hadoop").
require(rmr2)
rmr.options(backend = "local") # local or hadoop
# load a list of 10 integers into HDFS
hdfs.ints = to.dfs(1:10)
# mapper for the key-value pairs to compute squares
mapper <- function(k,v) {
key <- v
value <- key^2
keyval(key,value)
}
# run MapReduce
out = mapreduce(input = hdfs.ints, map = mapper)
# convert to a data frame
df1 = as.data.frame(from.dfs(out))
colnames(df1) = c('n', 'n^2')
#display the results
df1
n n^2
1 1 1
2 2 4
3 3 9
4 4 16
5 5 25
6 6 36
7 7 49
8 8 64
9 9 81
10 10 100
The heart of the code is the map function. It receives as input the hdfs file containing
the key value pairs, where the key is null in each case and the value is an integer.
Thus, the second key value pair received is (null,2).
The map function creates a key-value pair for each integer consisting of the integer
and its square. For example, the second key-value pair is (2,4). In this case there is
no reduce element, and mapreduce() reports just the mapping.
(key,value) pairs
Input Map
Reduce Output
(null,1) (1,1)
(1,1)
(null,2) (2,4)
(2,4)
… …
…
(null,10) (10,100)
(10,100)
Skill builder
Use the map component of the mapreduce function to create the reciprocals of the
integers from 1 to 25.
Have a look at the output from map reduce by entering from.dfs(out).
Tabulation with MapReduce
Let us now learn about the Reduce component of MapReduce. In this example, we
have a list of average monthly temperatures for New York’s Central Park and we 59
The results indicate that -7ºC occurred once, -6ºC once, …, and 27ºC eight times.
MapReduce
We now apply the Reduce element of MapReduce to the same problem. The code
reads the file and extracts the temperature column to create a separate HDFS file.
The map function receives a key-value pair of the form (null,F) and generates a
key-value pair of the form (C,1) for each data point. Because we want to use the
reduce component to compute the frequency of occurrence for each temperature, we
use a weight of 1. In essence, map passes the converted data to reduce for tallying
frequencies.
The reduce function receives a key-value pair of the form (C, length(v)). The length
function tells you how many values there are in a list. The results are then converted
from HDFS to a data frame and reported. For the reader ’s convenience, some of the
rows of output are not shown, and you can confirm that they agree with those
produced by R’s table function.
(key,value) pairs
Input Map (F to C conversion)
Reduce Output
(null,35.1) (2,1)
(-7,c(1)) (-7,1)
(null,37.5) (3,1)
(-6,c(1)) (-6,1)
… …
… …
(null,43.3) (6,1)
(27,c(1,1,1,1,1,1,1,1)) (27,8)
require(rmr2)
rmr.options(backend = "local") # local or hadoop
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T,sep=',')
# save temperature in hdfs file
hdfs.temp <- to.dfs(t$temperature)
# mapper for conversion to C
mapper <- function(k,v) {
key <- round((v-32)*5/9,0)
value <- 1
keyval(key,value)
}
#reducer to count frequencies
reducer <- function(k,v) {
key <- k
value = length(v)
keyval(key,value)
}
out = mapreduce(
input = hdfs.temp,
map = mapper,
reduce = reducer)
df2 = as.data.frame(from.dfs(out))
colnames(df2) = c('temperature', 'count')
df2
temperature count
1 -7 1
2 -6 1
…
34 26 26
35 27 8
Skill builder
Redo the tabulation example with temperatures in Fahrenheit.
Basic statistics with MapReduce
We now take the same temperature dataset and calculate mean, min, and max
monthly average temperatures for each year and put the results in a single file.
R
The aggregate function is used to calculate maximum, mean, and minimum monthly
average temperatures for each year. The aggregate method can work with only one
function at a time, so we create three files and then stack them together. Finally, we
reshape this stacked file so that each observation consist of a year, max, mean, and
min for one year. Use the head function to look at the contents of each file.
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T,sep=',')
a1 <- aggregate(t$temperature,by=list(t$year),FUN=max)
colnames(a1) = c('year', 'value')
a1$measure = 'max'
a2 <- aggregate(t$temperature,by=list(t$year),FUN=mean)
colnames(a2) = c('year', 'value')
a2$value = round(a2$value,1)
a2$measure = 'mean'
a3 <- aggregate(t$temperature,by=list(t$year),FUN=min)
colnames(a3) = c('year', 'value')
a3$measure = 'min'
# stack the results
stack <- rbind(a1,a2,a3)
library(reshape)
# reshape with year, max, mean, min in one row
stats <- cast(stack,year ~ measure,value="value")
head(stats)
MapReduce
The mapper function receives a key-value pair consisting of a null key and a record
containing three fields, year, month, and average temperature. The first of these
fields becomes the output key, and the third is the output value. In other words,
mapper creates key-value pairs consisting of a year and an average temperature for
a month in that year (e.g., (1869,35.1)).
The input to the reducer function is key-value pair consisting of a year and a list of
temperatures for that year (e.g., (1869, c(35.1, 34.5, 34.8, 49.2, 57.7, 69.3, 72.8, 71.8,
65.6, 50.9, 40.3, 34.7, 51.4)). The output is a set of key-value pairs containing the
year and a computed statistics for that year (e.g., (1869, 72.8), (1869,51.4),
(1869,34.5)).
The output of MapReduce is reshaped in a similar manner to that described for the R
code.
(key,value) pairs
Input Map
Reduce Output
(year, max)
(null, record) (year, temperature) (year, vector of temperatures)
(year, mean)
(year, min)
require(rmr2)
require(reshape)
rmr.options(backend = "local") # local or hadoop
t <- read.table("http://dl.dropbox.com/u/6960256/data/centralparktemps.txt", header=T,sep=',')
# save temperature in hdfs file
hdfs.temp <- to.dfs(t)
# mapper for computing temperature measures for each year
mapper <- function(k,v) {
key <- v[[1]] # year
value <- v[[3]] # temperature
keyval(key,value)
}
#reducer to report stats
reducer <- function(k,v) {
key <- k #year
value <- c(max(v),round(mean(v),1),min(v)) #v is list of values for a year
keyval(key,value)
}
out = mapreduce(
input = hdfs.temp,
map = mapper,
reduce = reducer)
# convert to a frame
df3 = as.data.frame(from.dfs(out))
# add measure identifiers
df3$measure <- c('max','mean','min')
# reshape with year, max, mean, min in one row
stats2 <- cast(df3,key ~ measure,value="val")
head(stats2)
Skill builder
Using a file of electricity prices for 2010-2012 for a major city,the hourly average.
60
The file contains a timestamp and price, separated by a comma. Use the lubridate
package to extract the hour.
Word counting with MapReduce
Counting the number of words in a document is commonly used to illustrate
MapReduce programming. We use the data shown in the earlier diagram illustrating
the MapReduce approach.
R
When dealing with text, you frequently want to read the entire file as a character
string and then process this string. Thus, in this example, we uses readChar() to load
a text file. We indicate the maximum length of the file is a million characters (1e6).
Before counting words, it is necessary to convert all text to lower case (so that Word
and word are seen as identical) and remove punctuation (such as apostrophes and
periods). Text can then be split into words by using one or more consecutive white
spaces (spaces or tabs) for distinguishing between words. The list returned by string
splitting is converted to a vector for ease of subsequent processing. Finally, the table
function can be used to tabulate the frequency of occurrence of words.
require(stringr)
# read as a single character string
t <- readChar("http://dl.dropbox.com/u/6960256/data/yogiquotes.txt", nchars=1e6)
t1 <- tolower(t[[1]]) # convert to lower case
t2 <- str_replace_all(t1,"[[:punct:]]","") # get rid of punctuation
wordList <- str_split(t2, "\\s") #split into strings
wordVector <- unlist(wordList) # convert list to vector
table(wordVector)
MapReduce
The mapper function is identical to the first part of the R version of word counting.
It converts to lower case, removes all punctuation, and splits into words. The key
value pair created consists of (word,1) for every word in the input text. The reducer
simply outputs the count for each word by reporting the length of its vector.
(key,value) pairs
Input Map
Reduce Output
(word,1)
(null,text)
(word,vector)
word, length(vector)
(word,1)
…
…
…
require(rmr2)
require(stringr)
rmr.options(backend = "local") # local or hadoop
# read as a single character string
t <- readChar("http://dl.dropbox.com/u/6960256/data/yogiquotes.txt", nchars=1e6)
text.hdfs <- to.dfs(t)
mapper=function(k,v){
t1 <- tolower(v) # convert to lower case
t2 <- str_replace_all(t1,"[[:punct:]]","") # get rid of punctuation
wordList <- str_split(t2, "\\s") #split into words
wordVector <- unlist(wordList) # convert list to vector
keyval(wordVector,1)
}
reducer = function(k,v) {
keyval(k,length(v))
}
out <- mapreduce (input = text.hdfs,
map = mapper,
reduce = reducer)
# convert output to a frame
df1 = as.data.frame(from.dfs(out))
colnames(df1) = c('word', 'count')
#display the results
df1
word count
1 a 1
2 about 1
3 arent 1
4 come 1
5 fork 1
6 half 1
7 i 2
8 if 1
9 in 1
10 it 1
11 lies 1
12 me 1
13 most 1
14 never 1
15 of 1
16 road 1
17 said 2
18 take 1
19 tell 1
20 the 3
21 they 1
22 things 1
23 to 1
24 true 1
25 you 1
Summary
Big data is a paradigm shift to a new file structure (HDFS) and algorithm
(MapReduce) for processing large volumes of data. The new file structure approach
is to spread a large data file across multiple commodity computers using HDFS and
then send each computer a copy of the program to run in parallel. The key-value
pair is the foundation of the MapReduce approach. The drivers of the
transformation are the need for high quality data-driven decisions, a societal shift in
dominant logic, and digital transformation. The speed and cost of converting data to
information is a critical bottleneck as is a dearth of skills for determining what data
should be converted to information and interpreting the resulting conversion. The
people skills problem is being addressed by universities’ graduate courses in data
analytics. The Lambda Architecture, a solution for handling the speed and cost
problem, consists of three layers: speed, serving, and batch. The batch layer is used
to precompute queries by running them with the most recent version of the dataset.
The serving layer processes views computed by the batch layer so they can be
queried. The purpose of the speed layer is to process incremental data as they arrive
so they can be merged with the latest batch data report to give current results. The
Lambda Architecture provides some important advantages for processing large
datasets. Relational systems can continue to support transaction processing and, as a
byproduct, produce data that are fed to Lambda Architecture based business
analytics.
Hadoop supports distributed processing of large data sets across a cluster of
computers. A Hadoop cluster consists of many standard processors, nodes, with
associated main memory and disks. The Hadoop distributed file system (HDFS) and
MapReduce are the two major components of Hadoop. HDFS is a highly scalable,
fault-toleration, distributed file system. MapReduce is a distributed computing
method for scalable and fault-tolerant batch computation. It supports the batch
processing component of the Lambda Architecture. MapReduce is typically used
with HDFS, and these are bundled together in Hadoop. Hadoop moves the
MapReduce program to the data rather than the data to the program.
Key terms and concepts
Batch layer
Bottleneck
Hadoop
HDFS
Immutable data
Key-value pair
Lambda Architecture
Map
Parallelism
Reduce
Serving layer
Speed layer
References and additional readings
Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large
clusters. Communications of the ACM, 51(1), 107-113.
Lam, C. (2010). Hadoop in action: Manning Publications Co.
Marz, N., & Warren, J. (2012). Big Data: Manning Publications.
Ullman, J. D. (2012). Designing good MapReduce algorithms. XRDS: Crossroads,
The ACM Magazine for Students, 19(1), 30-34.
Exercises
1. Write a MapReduce function for the following situations. In each case, write the
corresponding R code to understand how MapReduce and conventional
programming differ.
a. Compute the square and cube of the numbers in the range 1 to 25. Display
the results in a data frame.
b. Using the average monthly temperatures for New York’s Central Park,
compute the maximum, mean, and average temperature in Celsius for each
month.
c. Using the average monthly temperatures for New York’s Central Park,
compute the max, mean, and min for August.
d. Using the electricity price data, compute the average hourly cost.
e. Read the national GDP file, which records GDP in millions, and count how
61
many countries have a GDP greater than or less than 10,000 million.
Section 4: Managing Organizational Memory
Every one complains of his memory, none of his judgment.
François Duc de La Rochefoucauld “Sentences et Maximes,” Morales No. 89 1678
In order to manage data, you need a database architecture, a design for the storage
and processing of data. Organizations strive to find an architecture that
simultaneously achieves multiple goals:
1. It responds to queries in a timely manner.
2. It minimizes the cost of processing data.
3. It minimizes the cost of storing data.
4. It minimizes the cost of data delivery.
The first part of this section deals with the approaches that can be used to achieve
each of these goals. It covers database structure and storage alternatives. It
provides the knowledge necessary to determine an appropriate data storage
structure and device for a given situation. It also addresses the fundamental
questions of where to store the data and where they should be processed.
Java has become a popular choice for writing programs that operate on multiple
operating systems. Combining the interoperability of Java with standard SQL
provides software developers with an opportunity to develop applications that can
run on multiple operating systems and multiple DBMSs. Thus, for a comprehensive
understanding of data management, it is important to learn how Java and SQL
interface, which is the subject of the third chapter in this section.
As you now realize, organizational memory is an important resource requiring
management. An inaccurate memory can result in bad decisions and poor customer
service. Some aspects of organizational memory (e.g., chemical formulae,
marketing strategy, and R&D plans) are critical to the well-being of an
organization. The financial consequences can be extremely significant if these
memories are lost or fall into the hands of competitors. Consequently, organizations
must develop and implement procedures for maintaining data integrity. They need
policies to protect the existence of data, maintain its quality, and ensure its
confidentiality. Some of these procedures may be embedded in organizational
memory technology, and others may be performed by data management staff. Data
integrity is also addressed in this section.
When organizations recognize a resource as important to their long-term viability,
they typically create a formal mechanism to manage it. For example, most
companies have a human resources department responsible for activities such as
compensation, recruiting, training, and employee counseling. People are the major
resource of nearly every company, and the human resources department manages
this resource. Similarly, the finance department manages a company’s financial
assets.
In the information age, data—the raw material of information—need to be managed.
Consequently, data administration has become a formal organizational structure in
many enterprises.
18. Data Structure and Storage
The modern age has a false sense of superiority because it relies on the mass of
knowledge that it can use, but what is important is the extent to which knowledge is
organized and mastered.
Goethe, 1810
Learning objectives
Students completing this chapter will, for a given situation, be able to recommend
✤ understand the implications of the data deluge;
✤ a data storage structure;
✤ a storage device.
Every quarter, The Expeditioner’s IS group measures the quality of its service. It
As though some grumpy clients were not enough, Alice dumped another problem
on Ned’s desk. The Marketing department had complained to her that the product
database had been down for 30 minutes during a peak selling period. What was Ned
going to do to prevent such an occurrence in the future?
Ned had just finished reading Alice’s memo about the database problem when the
Chief Accountant poked his head in the door. Somewhat agitated, he was waving an
article from his favorite accounting journal that claimed that data stored on
magnetic tape decayed with time. So what was he to do with all those financial
records on magnetic tapes stored in the fireproof safe in his office? Were the
magnetic bits likely to disappear tonight, tomorrow, or next week? “This business
had lasted for centuries with paper ledgers. Why, you can still read the financial
transactions for 1527,” which he did whenever he had a few moments to spare. “But,
if what I read is true, I soon won’t be able to read the balance sheet from last year!”
It was just after 10 a.m. on a Monday, and Ned was faced with finding a way to
improve response time, ensure that databases were continually available during
business hours, and protect the long-term existence of financial records. It was
going to be a long week.
The data deluge
With petabytes of new data being created daily, and the volume continuing to grow,
many IS departments and storage vendors struggle to handle this data flood. “Big
Data,” as the deluge is colloquially known, arises from the flow of data created by
Internet searches, Web site visits, social networking activity, streaming of videos,
electronic health care records, sensor networks, large-scale simulations, and a host
of other activities that are part of everyday business in today’s world. The deluge
requires a continuing investment in the management and storage of data.
A byte size table
Abbreviation Prefix Factor Equivalent to
k kilo 10 3
M mega 10 6
G giga 10 9
A digital audio recording of a symphony
T tera 10 12
P peta 10 15
50 percent of all books in U.S. academic libraries
E exa 10 18
5 times all the world’s printed material
Z zetta 10 21
Y yotta 10 24
The following pages explore territory that is not normally the concern of
application programmers or database users. Fortunately, the relational model keeps
data structures and data access methods hidden. Nevertheless, an overview of what
happens under the hood is part of a well-rounded education in data management
and will equip you to work on some of the problems of the data deluge.
Data structures and access methods are the province of the person responsible for
physically designing the database so that it responds in a timely manner to both
queries and maintenance operations. Of course, there may be installations where
application programmers have such responsibilities, and in these situations you will
need to know physical database design.
Data structures
An in-depth consideration of the internal level of database architecture provides an
understanding of the basic structures and access mechanisms underlying database
technology. As you will see, the overriding concern of the internal level is to
minimize disk access. In dealing with this level, we will speak in terms of files,
records, and fields rather than the relational database terms of tables, rows, and
columns. We do this because the discussion extends beyond the relational model to
file structures in general.
The time required to access data on a magnetic disk, the storage device for many
databases, is relatively long compared to that for main memory. Disk access times
are measured in milliseconds (10 ), and main memory access times are referred to
–3
difference between disk and main memory access—it takes about 10 times longer. 5
1. The DBMS determines which record is required and passes a request to the file
manager to retrieve a particular record in a file.
2. The file manager converts this request into the address of the unit of storage
(usually called a page) containing the specified record. A page is the minimum
amount of storage accessed at one time and is typically around 1–4 kbytes. A page
will often contain several short records (e.g., 200 bytes), but a long record (e.g.,
10 kbytes) might be spread over several pages. In this example, we assume that
records are shorter than a page.
3. The disk manager determines the physical location of the page, issues the
retrieval instructions, and passes the page to the file manager.
4. The file manager extracts the requested record from the page and passes it to the
DBMS.
The disk manager
The disk manager is that part of the operating system responsible for physical input
and output (I/O). It maintains a directory of the location of each page on the disk
with all pages identified by a unique page number. The disk manager ’s main
functions are to retrieve pages, replace pages, and keep track of free pages.
Page retrieval requires the disk manager to convert the page number to a physical
address and issue the command to read the physical location. Since a page can
contain multiple records, when a record is updated, the disk manager must retrieve
the relevant page, update the appropriate portion containing the record, and then
replace the page without changing any of the other data on it.
The disk manager thinks of the disk as a collection of uniquely numbered pages.
Some of these pages are allocated to the storage of data, and others are unused.
When additional storage space is required, the disk manager allocates a page
address from the set of unused page addresses. When a page becomes free because a
file or some records are deleted, the disk manager moves that page’s address to the
unallocated set. Note, it does not erase the data, but simply indicates the page is
available to be overwritten. This means that it is sometimes possible to read
portions of a deleted file. If you want to ensure that an old file is truly deleted, you
need to use an erase program that writes random data to the deleted page.
Disk manager’s view of the world
The itemtype index is a file. It contains 10,000 records and two fields. There is one
record in the index for each record in item. The first field contains a value of itemtype
and the second contains a pointer, an address, to the matching record of item. Notice
that the index is in itemtype sequence. Storing the index in a particular order is another
means of reducing disk accesses, as we will see shortly. The index is quite small.
One byte is required for itemtype and four bytes for the pointer. So the total size of the
index is 50 kbytes, which in this case is 50 pages of disk space.
Now consider finding all records with an item type of E. One approach is to read the
entire index into memory, search it for type E items, and then use the pointers to
retrieve the required records from item. This method requires 2,050 disk accesses—
50 to load the index and 2,000 accesses of item, as there are 2,000 items of type E.
Creating an index for item results in substantial savings in disk accesses for this
example. Here, we assume that 20 percent of the records in item contained itemtype =
'E'. The number of disk accesses saved varies with the proportion of records meeting
the query’s criteria. If there are no records meeting the criteria, 9,950 disk accesses
are avoided. At the other extreme, when all records meet the criteria, it takes 50
extra disk accesses to load the index.
The SQL for creating the index is
CREATE INDEX itemtypeindx ON item (itemtype);
The entire index need not be read into memory. As you will see when we discuss
tree structures, we can take advantage of an index’s ordering to reduce disk accesses
further. Nevertheless, the clear advantage of an index is evident: it speeds up
retrieval by reducing disk accesses. Like many aspects of database management,
however, indexes have a drawback. Adding a record to a file without an index
requires a single disk write. Adding a record to an indexed file requires at least two,
and maybe more, disk writes, because an entry has to be added to both the file and
its index. The trade-off is between faster retrievals and slower updates. If there are
many retrievals and few updates, then opt for an index, especially if the indexed
field can have a wide variety of values. If the file is very volatile and updates are
frequent and retrievals few, then an index may cost more disk accesses than it saves.
Indexes can be used for both sequential and direct access. Sequential access means
that records are retrieved in the sequence defined by the values in the index. In our
example, this means retrieving records in itemtype sequence with a range query such
as
• Find all items with a type code in the range E through K.
Direct access means records are retrieved according to one or more specified
values. A sample query requiring direct access would be
• Find all items with a type code of E or N.
Indexes are also handy for existence testing. Remember, the EXISTS clause of SQL
returns true or false and not a value. An index can be searched to check whether the
indexed field takes a particular value, but there is no need to access the file because
no data are returned. The following query can be answered by an index search:
• Are there any items with a code of R?
Multiple indexes
Multiple indexes can be created for a file. The item file could have an index defined
on itemcolor or any other field. Multiple indexes may be used independently, as in this
query:
• List red items.
or jointly, with a query such as
• Find red items of type C.
The preceding query can be solved by using the indexes for itemtype and itemcolor.
Indexes for fields itemtype and itemcolor
Examination of the itemtype index indicates that items of type C are stored at addresses
d6, d7, d8, and d9. The only red item recorded in the itemcolor index is stored at
address d6, and since it is the only record satisfying the query, it is the only record
that needs to be retrieved.
Multiple indexes, as you would expect, involve a trade-off. Whenever a record is
added or updated, each index must also be updated. Savings in disk accesses for
retrieval are exchanged for additional disk accesses in maintenance. Again, you
must consider the balance between retrieval and maintenance operations.
Indexes are not restricted to a single field. It is possible to specify an index that is a
combination of several fields. For instance, if item type and color queries were very
common, then an index based on the concatenation of both fields could be created.
As a result, such queries could be answered with a search of a single index rather
than scanning two indexes as in the preceding example. The SQL for creating the
combined index is
CREATE INDEX typecolorindx ON item (itemtype, itemcolor);
Sparse indexes
Indexes are used to reduce disk accesses to accelerate retrieval. The simple model
of an index introduced earlier suggests that the index contains an entry for each
record of the file. If we can shrink the index to eliminate an entry for each record,
we can save more disk accesses. Indeed, if an index is small enough, it, or key parts
of it, can be retained continuously in primary memory.
There is a physical sequence to the records in a file. Records within a page are in a
physical sequence, and pages on a disk are in a physical sequence. A file can also
have a logical sequence, the ordering of the file on some field within a record. For
instance, the item file could be ordered on itemno. Making the physical and logical
sequences correspond is a way to save disk accesses. Remember that the item file was
assumed to have a record size of 1,024 bytes, the same size as a page, and one
record was stored per page. If we now assume the record size is 512 bytes, then two
records are stored per page. Furthermore, suppose that item is physically stored in
itemno sequence. The index can be compressed by storing itemno for the second record
on each page and that page’s address.
A sparse index for item
Consider the process of finding the record with itemno = 7. First, the index is scanned
to find the first value for itemno that is greater than or equal to 7, the entry for itemno =
8 . Second, the page on which this record is stored (page p + 3) is loaded into
memory. Third, the required record is extracted from the page.
Indexes that take advantage of the physical sequencing of a file are known as sparse
or non-dense because they do not contain an entry for every value of the indexed
field. (A dense index is one that contains an entry for every value of the indexed
field.)
As you would expect, a sparse index has pros and cons. One major advantage is that
it takes less storage space and so requires fewer disk accesses for reading. One
disadvantage is that it can no longer be used for existence tests because it does not
contain a value for every record in the file.
A file can have only one sparse index because it can have only one physical
sequence. This field on which a sparse index is based is often called the primary key.
Other indexes, which must be dense, are called secondary indexes.
In SQL, a sparse index is created using the CLUSTER option. For example, to define a
sparse index on item the command is
CREATE INDEX itemnoindx ON item (itemno) CLUSTER;
B-trees
The B-tree is a particular form of index structure that is frequently the main storage
structure for relational systems. It is also the basis for IBM’s VSAM (Virtual
Storage Access Method), the file structure underlying DB2, IBM’s relational
database. A B-tree is an efficient structure for both sequential and direct accessing of
a file. It consists of two parts: the sequence set and the index set.
The sequence set is a single-level index to the file with pointers to the records (the
vertical arrows in the lower part of the following figure). It can be sparse or dense,
but is normally dense. Entries in the sequence set are grouped into pages, and these
pages are linked together (the horizontal arrows) so that the logical ordering of the
sequence set is the physical ordering of the file. Thus, the file can be processed
sequentially by processing the records pointed to by the first page (records with
identifiers 1, 4, and 5), the records pointed to by the next logical page (6, 19, and
20), and so on.
Structure of a simple B-tree
The index set is a tree-structured index to the sequence set. The top of the index set
is a single node called the root. In this example, it contains two values (29 and 57)
and three pointers. Records with an identifier less than or equal to 29 are found in
the left branch, records with an identifier greater than 29 and less than or equal to 57
are found in the middle branch, and records with an identifier greater than 57 are
found in the right branch. The three pointers are the page numbers of the left,
middle, and right branches. The nodes at the next level have similar interpretations.
The pointer to a particular record is found by moving down the tree until a node
entry points to a value in the sequence set; this value can be used to retrieve the
record. Thus, the index set provides direct access to the sequence set and then the
data.
The index set is a B-tree. The combination of index set and sequence set is generally
known as a B+ tree (B-plus tree). The B-tree simplifies the concept in two ways.
First, the number of data values and pointers for any given node is not restricted to
2 and 3, respectively. In its general form, a B-tree of order n can have at least n and
no more than 2n data values. If it has k values, the B-tree will have k + 1 pointers (in
the example tree, nodes have two data values, k = 2, and there are three, k + 1 = 3,
pointers). Second, B-trees typically have free space to permit rapid insertion of data
values and possible updating of pointers when a new record is added.
As usual, there is a trade-off with B-trees. Retrieval will be fastest when each node
occupies one page and is packed with data values and pointers. This lack of free
space will slow down the addition of new records, however. Most implementations
of the B+ tree permit a specified portion of free space to be defined for both the
index and sequence set.
Hashing
Hashing reduces disk accesses by allowing direct access to a file. As you know,
direct accessing via an index requires at least two or more accesses. The index must
be loaded and searched, and then the record retrieved. For some applications, direct
accessing via an index is too slow. Hashing can reduce the number of accesses to
almost one by using the value in some field (the hash field, which is usually the
primary key) to compute a record’s address. A hash function converts the hash
field into a hash address.
Consider the case of a university that uses the nine-digit Social Security number
(SSN) as the student key. If the university has 10,000 students, it could simply use the
last four digits of the SSN as the address. In effect, the file space is broken up into
10,000 slots with one student record in each slot. For example, the data for the
student with SSN 417-03-4356 would be stored at address 4356. In this case, the hash
field is SSN, and the hash function is
hash address = remainder after dividing SSN by 10,000.
What about the student with SSN 532-67-4356? Unfortunately, the hashing function
will give the same address because most hashing schemes cannot guarantee a unique
hash address for every record. When two hash fields have the same address, they are
called synonyms, and a collision is said to have occurred.
There are techniques for handling synonyms. Essentially, you store the colliding
record in an overflow area and point to it from the hash address. Of course, more
than two records can have the same hash address, which in turn creates a synonym
chain. The following figure shows an example of hashing with a synonym chain.
Three SSNs hash to the same address. The first record (417-03-4356) is stored at the
hash address. The second record (532-67-4356) is stored in the overflow area and is
connected by a pointer to the first record. The third record (891-55-4356) is also
stored in the overflow area and connected by a pointer from the second record.
Because each record contains the full key (SSN in this case), during retrieval the
system can determine whether it has the correct record or should follow the chain to
the next record.
An example of hashing
If there are no synonyms, hashing gives very fast direct retrieval, taking only one
disk access to retrieve a record. Even with a small percentage of synonyms,
retrieval via hashing is very fast. Access time degrades, however, if there are long
synonym chains.
There are a number of different approaches to defining hashing functions. The most
common method is to divide by a prime and use the remainder as the address.
Before adopting a particular hashing function, test several functions on a
representative sample of the hash field. Compute the percentage of synonyms and
the length of synonym chains for each potential hashing function and compare the
results.
Of course, hashing has trade-offs. There can be only one hashing field. In contrast, a
file can have many indexed fields. The file can no longer be processed sequentially
because its physical sequence loses any logical meaning if the records are not in
primary key sequence or are sequenced on any other field.
Linked lists
A linked list is a useful data structure for interfile clustering. Suppose that the query,
Find all stocks of country X, is a frequent request. Disk accesses can be reduced by
storing a nation and its corresponding stocks together in a linked list.
A linked list
In this example, we have two files: nation and stock. Records in nation are connected by
pointers (e.g., the horizontal arrow between Australia and USA). The pointers are
used to maintain the nation file in logical sequence (by nation, in this case). Each
record in nation is linked to its stock records by a forward-pointing chain. The nation
record for Australia points to the first stock record (Indooroopilly Ruby), which
points to the next (Narembeen Plum), which points to the final record in the chain
(Queensland Diamond). Notice that the last record in this stock chain points to the
nation record to which the chain belongs (i.e., Australia). Similarly, there is a chain
for the two USA stocks. In any chain, the records are maintained in logical sequence
(by firm name, in this case).
This linked-list structure is also known as a parent/child structure. (The parent in
this case is nation and the child is stock.) Although it is a suitable structure for
representing a one-to-many (1:m) relationship, it is possible to depict more than one
parent/child relationship. For example, stocks could be grouped into classes (e.g.,
high, medium, and low risk). A second chain linking all stocks of the same risk
class can run through the file. Interfile clustering of a nation and its corresponding
stocks will speed up access; so, the record for the parent Australia and its three
children should be stored on one page. Similarly, all American stocks could be
clustered with the USA record and so on for all other nations.
Of course, you expect some trade-offs. What happens with a query such as Find the
country in which Minnesota Gold is listed? There is no quick way to find Minnesota
Gold except by sequentially searching the stock file until the record is found and then
following the chain to the parent nation record. This could take many disk accesses.
One way to circumvent this is to build an index or hashing scheme for stock so that
any record can be found directly. Building an index for the parent, nation, will speed
up access as well.
Linked lists come in a variety of flavors:
• Lists that have two-way pointers, both forward and backward, speed up
deletion.
• For some lists, every child record has a parent pointer. This helps prevent
chain traversal when the query is concerned with finding the parent of a
particular child.
Bitmap index
A bitmap index uses a single bit, rather than multiple bytes, to indicate the specific
value of a field. For example, instead of using three bytes to represent red as an
item’s color, the color red is represented by a single bit. The relative position of the
bit within a string of bits is then mapped to a record address.
Conceptually, you can think of a bitmap as a matrix. The following figure shows a
bitmap containing details of an item’s color and code. An item can have three
possible colors; so, three bits are required, and two bits are needed for the two
codes for the item. Thus, you can see that, in general, n bits are required if a field
can have n possible values.
A bitmap index
itemcode color code Disk address
Red Green Blue A N
1001 0 0 1 0 1 d1
1002 1 0 0 1 0 d2
1003 1 0 0 1 0 d3
1004 0 1 0 1 0 d4
When an item has a large number of values (i.e., n is large), the bitmap for that field
will be very sparse, containing a large number of zeros. Thus, a bitmap is typically
useful when the value of n is relatively small. When is n small or large? There is no
simple answer; rather, database designers have to simulate alternative designs and
evaluate the trade-offs.
In some situations, the bit string for a field can have multiple bits set to on (set to 1).
A location field, for instance, may have two bits on to represent an item that can be
found in Atlanta and New York.
The advantage of a bitmap is that it usually requires little storage. For example, we
can recast the bitmap index as a conventional index. The core of the bitmap in the
previous figure (i.e., the cells storing data about the color and code) requires 5 bits
for each record, but the core of the traditional index requires 9 bytes, or 72 bits, for
each record.
An index
itemcode color code Disk address
char(8) char(1)
1001 Blue N d1
1002 Red A d2
1003 Red A d3
1004 Green A d4
Bitmaps can accelerate query time for complex queries.
Join index
Many RDBMS queries frequently require two or more tables to be joined. Indeed,
some people refer to the RDBMS as a join machine. A join index can be used to
improve the execution speed of joins by creating indexes based on the matching
columns of tables that are highly likely to be joined. For example, natcode is the
common column used to join nation and stock, and each of these tables can be indexed
on natcode.
Indexes on natcode for nation and stock
nation index stock index
natcode Disk address natcode Disk address
UK d1 UK d101
USA d2 UK d102
UK d103
USA d104
USA d105
A join index is a list of the disk addresses of the rows for matching columns. All
indexes, including the join index, must be updated whenever a row is inserted into
or deleted from the nation or stock tables. When a join of the two tables on the
matching column is made, the join index is used to retrieve only those records that
will be joined. This example demonstrates the advantage of a join index. If you think
of a join as a product with a WHERE clause, then without a join index, 10 (2*5)
rows have to be retrieved, but with the join index only 5 rows are retrieved. Join
indexes can also be created for joins involving several tables. As usual, there is a
trade-off. Joins will be faster, but insertions and deletions will be slower because of
the need to update the indexes.
Join Index
join index
nation disk address stock disk address
d1 d101
d1 d102
d1 d103
d2 d104
d2 d105
Data coding standards
The successful exchange of data between two computers requires agreement on how
information is coded. The American Standard Code for Information Interchange
(ASCII) and Unicode are two widely used data coding standards.
ASCII
ASCII is the most common format for digital text files. In an ASCII file, each
alphabetic, numeric, or special character is represented by a 7-bit code. Thus, 128
(2 ) possible characters are defined. Because most computers process data in eight-
7
bit bytes or multiples thereof, an ASCII code usually occupies one byte.
Unicode
Unicode, officially the Unicode Worldwide Character Standard, is a system for the
interchange, processing, and display of the written texts of the diverse languages of
the modern world. Unicode provides a unique binary code for every character, no
matter what the platform, program, or language. The Unicode standard currently
contains 34,168 distinct coded characters derived from 24 supported language
scripts. These characters cover the principal written languages of the world.
Unicode provides for two encoding forms: a default 16-bit form, and a byte (8-bit)
form called UTF-8 that has been designed for ease of use with existing ASCII-based
systems. As the default encoding of HTML and XML, Unicode is required for
Internet protocols. It is implemented in all modern operating systems and computer
languages such as Java. Unicode is the basis of software that must function globally.
Data storage devices
Many corporations double the amount of data they need to store every two to three
years. Thus, the selection of data storage devices is a key consideration for data
managers. When evaluating data storage options, data managers need to consider
possible uses, which include:
1. Online data
2. Backup files
3. Archival storage
Many systems require data to be online—continually available. Here the prime
concerns are usually access speed and capacity, because many firms require rapid
response to large volumes of data. Backup files are required to provide security
against data loss. Ideally, backup storage is high volume capacity at low cost.
Archived data may need to be stored for many years; so the archival medium should
be highly reliable, with no data decay over extended periods, and low-cost.
In deciding what data will be stored where, database designers need to consider a
number of variables:
1. Volume of data
2. Volatility of data
3. Required speed of access to data
4. Cost of data storage
5. Reliability of the data storage medium
6. Legal standing of stored data
The design options are discussed and considered in terms of the variables just
identified. First, we review some of the options.
Magnetic technology
Over USD 20 billion is spent annually on magnetic storage devices. A significant
62
When a file is written to a striping array of three drives, for instance, one half of the
file is written to the first drive and the second half to the second drive. The third
drive is used for error correction. A parity bit is constructed for each
corresponding pair of bits written to drives one and two, and this parity bit is written
to the third drive. A parity bit is a bit added to a chunk of data (e.g., a byte) to ensure
that the number of bits in the chunk with a value of one is even or odd. Parity bits
can be checked when a file is read to detect data errors, and in many cases the data
errors can be corrected. Parity bits demonstrate how redundancy supports recovery.
When a file is read by a striping array, portions are retrieved from each drive and
assembled in the correct sequence by the controller. If a read error occurs, the lost
bits can be reconstructed by using the parity data on the third drive. If a drive fails, it
can be replaced and the missing data restored on the new drive. Striping requires at
least three drives. Normally, data are written to every drive but one, and that
remaining drive is used for the parity bit. Striping gives added data security without
requiring considerably more storage, but it does not have the same response time
increase as mirroring.
Striping
RAID subsystems are divided into seven levels, labeled 0 through 6. All RAID
levels, except level 0, have common features:
• There is a set of physical disk drives viewed by the operating system as a
single, logical drive.
• Data are distributed across corresponding physical drives.
• Parity bits are used to recover data in the event of a disk failure.
Level 0 has been in use for many years. Data are broken into blocks that are
interleaved or striped across disks. By spreading data over multiple drives, read and
write operations can occur in parallel. As a result, I/O rates are higher, which makes
level 0 ideal for I/O intensive applications such as recording video. There is no
additional parity information and thus no data recovery when a drive failure occurs.
Level 1 implements mirroring, as described previously. This is possibly the most
popular form of RAID because of its effectiveness for critical nonstop applications,
although high levels of data availability and I/O rates are counteracted by higher
storage costs. Another disadvantage is that every write command must be executed
twice (assuming a two-drive array), and thus level 1 is inappropriate for
applications that have a high ratio of writes to reads.
Level 2 implements striping by interleaving blocks of data on each disk and
maintaining parity on the check disk. This is a poor choice when an application has
frequent, short random disk accesses, because every disk in the array is accessed for
each read operation. However, RAID 2 provides excellent data transfer rates for
large sequential data requests. It is therefore suitable for computer-aided drafting
and computer-aided manufacturing (CAD/CAM) and multimedia applications,
which typically use large sequential files. RAID 2 is rarely used, however, because
the same effect can be achieved with level 3 at a lower cost.
Level 3 utilizes striping at the bit or byte level, so only one I/O operation can be
executed at a time. Compared to level 1, RAID level 3 gives lower-cost data storage
at lower I/O rates and tends to be most useful for the storage of large amounts of
data, common with CAD/CAM and imaging applications.
Level 4 uses sector-level striping; thus, only a single disk needs to be accessed for a
read request. Write requests are slower, however, because there is only one parity
drive.
Level 5, a variation of striping, reads and writes data to separate disks independently
and permits simultaneous reading and writing of data. Data and parity are written on
the same drive. Spreading parity data evenly across several drives avoids the
bottleneck that can occur when there is only one parity drive. RAID 5 is well
designed for the high I/O rates required by transaction processing systems and
servers, particularly e-mail servers. It is the most balanced implementation of the
RAID concept in terms of price, reliability, and performance. It requires less
capacity than mirroring with level 1 and higher I/O rates than striping with level 3,
although performance can decrease with update-intensive operations. RAID 5 is
frequently found in local-area network (LAN) environments.
RAID level 5
RAID 6 takes fault tolerance to another level by enabling a RAID system to still
function when two drives fail. Using a method called double parity, it distributes
parity information across multiple drives so that, on the fly, a disk array can be
rebuild even if another drive fails before the rebuild is complete.
RAID does have drawbacks. It often lowers a system’s performance in return for
greater reliability and increased protection against data loss. Extra disk space is
required to store parity information. Extra disk accesses are required to access and
update parity information. You should remember that RAID is not a replacement for
standard backup procedures; it is a technology for increasing the fault tolerance of
critical online systems. RAID systems offer terabytes of storage.
Removable magnetic disk
Removable drives can be plugged into a system as required. The disk’s removability
is its primary advantage, making it ideal for backup and transport. For example, you
might plug in a removable drive to the USB port of your laptop once a day to
backup its hard drive. Removable disk is also useful when applications need not be
continuously online. For instance, the monthly payroll system can be stored on a
removable drive and mounted as required. Removable drives cost about USD 100
per terabyte.
Magnetic tape
A magnetic tape is a thin ribbon of plastic coated with ferric oxide. The once
commonly used nine-track 2,400-foot (730 m) tape has a capacity of about 160
Mbytes and a data transfer rate of 2 Mbytes per second. The designation nine-track
means nine bits are stored across the tape (8 bits plus one parity bit). Magnetic tape
was used extensively for archiving and backup in early database systems; however,
its limited capacity and sequential nature have resulted in its replacement by other
media, such as magnetic tape cartridge.
Magnetic tape cartridges
Tape cartridges, with a capacity measured in Gbytes and transfer rates of up to 6
Mbytes per second, have replaced magnetic tape. They are the most cost effective
(from a purchase price $/GB perspective) magnetic recording device. Furthermore,
a tape cartridge does not consume any power when unmounted and stored in a
library. Thus, the operating costs for a tape storage archival system are much lower
than an equivalent hard disk storage system, though hard disks provide faster access
to archived data.
Mass storage
There exists a variety of mass storage devices that automate labor-intensive tape and
cartridge handling. The storage medium, with a capacity of terabytes, is typically
located and mounted by a robotic arm. Mass storage devices can currently handle
hundreds of petabytes of data. They have slow access time because of the
mechanical loading of tape storage, and it might take several minutes to load a file.
Solid-state memory
A solid-state disk (SSD) connects to a computer in the same way as regular
magnetic disk drives do, but they store data on arrays of memory chips. SSDs
require lower power consumption (longer battery life) and come in smaller sizes
(smaller and lighter devices). As well, SSDs have faster access times. However, they
cost around USD 1.50 per Gbyte, compared to hard disks which are around USD
0.10 per Gbyte. SSD is rapidly becoming the standard for laptops.
A flash drive, also known as a keydrive or jump drive, is a small, removable
storage device with a Universal Serial Bus (USB) connector. It contains solid-state
memory and is useful for transporting small amounts of data. Capacity is in the
range 1 to 128 Gbytes. The price for low capacity drives is about about USD 1-2 per
Gbytes.
While price is a factor, the current major drawback holding back the adoption of
SSD is the lack of manufacturing capacity for the NAND chips used in SSDs. The 63
Data compression
Data compression is a method for encoding digital data so that they require less
storage space and thus less communication bandwidth. There are two basic types of
compression: lossless methods, in which no data are lost when the files are restored
to their original format, and lossy methods, in which some data are lost when the
files are decompressed.
Lossless compression
During lossless compression, the data-compression software searches for redundant
or repetitive data and encodes it. For example, a string of 100 asterisks (*) can be
stored more compactly by a compression system that records the repeated character
(i.e., *) and the length of the repetition (i.e., 100). The same principle can be applied
to a photo that contains a string of horizontal pixels of the same color. Clearly, you
want to use lossless compression with text files (e.g., a business report or
spreadsheet).
Lossy compression
Lossy compression is used for graphics, video, and audio files because humans are
often unable to detect minor data losses in these formats. Audio files can often be
compressed to 10 percent of their original size (e.g., an MP3 version of a CD
recording).
Skill builder
An IS department in a major public university records the lectures for 10 of its
classes for video streaming to its partner universities in India and China. Each
twice-weekly lecture runs for 1.25 hours and a semester is 15 weeks long. A video
streaming expert estimates that one minute of digital video requires 6.5 Mbytes
using MPEG-4 and Apple’s QuickTime software. What is MPEG-4? Calculate how
much storage space will be required and recommend a storage device for the
department.
Comparative analysis
Details of the various storage devices are summarized in the following table. A
simple three-star rating system has been used for each device—the more stars, the
better. In regard to access speed, RAID gets more stars than DVD-RAM because it
retrieves a stored record more quickly. Similarly, optical rates three stars because it
costs less per megabyte to store data on a magneto-optical disk than on a fixed disk.
The scoring system is relative. The fact that removable disk gets two stars for
reliability does not mean it is an unreliable storage medium; it simply means that it
is not as reliable as some other media.
Relative merits of data storage devices
Access speed Volume Volatility Cost per megabyte Reliability Legal standing
Device
Solid state *** * *** * *** *
Fixed disk *** *** *** *** ** *
RAID *** *** *** *** *** *
Removable disk ** ** *** ** ** *
Flash memory ** * *** * *** *
Tape * ** * *** ** *
Cartridge ** *** * *** ** *
Mass Storage ** *** * *** ** *
SAN *** *** *** *** *** *
Optical-ROM * *** * *** *** ***
Optical-R * *** * *** *** **
Optical-RW * *** ** *** *** *
Legend
Characteristic More stars mean …
Access speed Faster access to data
Volume Device more suitable for large files
Volatility Device more suitable for files that change frequently
Cost per megabyte Less costly form of storage
Reliability Device less susceptible to an unrecoverable read error
Legal standing Media more acceptable as evidence in court
Conclusion
The internal and physical aspects of database design are a key determinant of system
performance. Selection of appropriate data structures can substantially curtail disk
access and reduce response times. In making data structure decisions, the database
administrator needs to weigh the pros and cons of each choice. Similarly, in
selecting data storage devices, the designer needs to be aware of the trade-offs.
Various devices and media have both strengths and weaknesses, and these need to be
considered.
Summary
The data deluge is increasing the importance of data management for organizations.
It takes considerably longer to retrieve data from a hard disk than from main
memory. Appropriate selection of data structures and data access methods can
considerably reduce delays by reducing disk accesses. The key characteristics of
disk storage devices that affect database access are rotational speed and access arm
speed. Access arm movement can be minimized by storing frequently used data on
the same track on a single surface or on the same track on different surfaces.
Records that are frequently used together should be clustered together. Intrafile
clustering applies to the records within a single file. Interfile clustering applies to
multiple files. The disk manager, the part of the operating system responsible for
physical I/O, maintains a directory of pages. The file manager, a level above the
disk manager, contains a directory of files.
Indexes are used to speed up retrieval by reducing disk accesses. An index is a file
containing the value of the index field and the address of its full record. The use of
indexes involves a trade-off between faster retrievals and slower updates. Indexes
can be used for both sequential and direct access. A file can have multiple indexes. A
sparse index does not contain an entry for every value of the indexed field. The B-
tree, a particular form of index structure, consists of two parts: the sequence set and
the index set. Hashing is a technique for reducing disk accesses that allows direct
access to a file. There can be only one hashing field. A hashed file can no longer be
processed sequentially because its physical sequence has lost any logical meaning. A
linked list is a useful data structure for interfile clustering. It is a suitable structure
for representing a 1:m relationship. Pointers between records are used to maintain a
logical sequence. Lists can have forward, backward, and parent pointers.
Systems designers have to decide what data storage devices will be used for online
data, backup files, and archival storage. In making this decision, they must consider
the volume of data, volatility of data, required speed of access to data, cost of data
storage, reliability of the data storage medium, and the legal standing of the stored
data. Magnetic technology, the backbone of data storage for six decades, is based on
magnetization and demagnetization of spots on a magnetic recording surface. Fixed
disk, removable disk, magnetic tape, tape cartridge, and mass storage are examples
of magnetic technology. RAID uses several cheaper drives whose total cost is less
than one high-capacity drive. RAID uses a combination of mirroring or striping to
provide greater data protection. RAID subsystems are divided into six levels labeled
0 through 6. A storage-area network (SAN) is a high-speed network for connecting
and sharing different kinds of storage devices, such as tape libraries and disk arrays.
Optical technology offers high storage densities, low-cost medium, and direct
access. CD, DVD, and Blu-ray are examples of optical technology. Optical disks can
reliably store records for at least 10 years and possibly up to 30 years. Optical
technology is not susceptible to head crashes.
Data compression techniques reduce the need for storage capacity and bandwidth.
Lossless methods result in no data loss, whereas with lossy techniques, some data
are lost during compression.
Key terms and concepts
Access time
Archival file
ASCII
B-tree
Backup file
Bitmap index
Blue-ray disc (BD)
Compact disc (CD)
Clustering
Conceptual schema
Cylinder
Data compression
Data deluge
Data storage device
Database architecture
Digital versatile disc (DVD)
Disk manager
External schema
File manager
Hash address
Hash field
Hash function
Hashing
Index
Index set
Interfile clustering
Internal schema
Intrafile clustering
Join index
Linked list
Lossless compression
Lossy compression
Magnetic disk
Magnetic tape
Mass storage
Mirroring
Page
Parity
Pointer
Redundant arrays of inexpensive or independent drives (RAID)
Sequence set
Solid-state disk (SSD)
Sparse index
Storage-area network (SAN)
Striping
Track
Unicode
VSAM
Exercises
1. Why is a disk drive considered a bottleneck?
2. What is the difference between a record and a page?
3. Describe the two types of delay that can occur prior to reading a record from a
disk. What can be done to reduce these delays?
4. What is clustering? What is the difference between intrafile and interfile
clustering?
5. Describe the differences between a file manager and a disk manager.
6. What is an index?
7. What are the advantages and disadvantages of indexing?
8. Write the SQL to create an index on the column natcode in the nation table.
9. A Paris insurance firm keeps paper records of all policies and claims made on it.
The firm now has a vault containing 100 filing cabinets full of forms. Because
Paris rental costs are so high, the CEO has asked you to recommend a more
compact medium for long-term storage of these documents. Because some
insurance claims are contested, she is very concerned with ensuring that
documents, once stored, cannot be altered. What would you recommend and why?
10. A video producer has asked for your advice on a data storage device. She has
specified that she must be able to record video at 5 to 7 Mbytes per second. What
would you recommend and why?
11. The national weather research center of a large South American country has
asked you to recommend a data storage strategy for its historical weather data.
The center electronically collects hourly weather information from 2,000 sites
around the country. This database must be maintained indefinitely. Make a rough
estimate of the storage required for one year of data and then recommend a
storage device.
12. A German consumer research company collects scanning data from
supermarkets throughout central Europe. The scanned data include product code
identifier, price, quantity purchased, time, date, supermarket location, and
supermarket name, and in some cases where the supermarket has a frequent buyer
plan, it collects a consumer identification code. It has also created a table
containing details of the manufacturer of each product. The database contains
over 500 Tbytes of data. The data are used by market researchers in consumer
product companies. A researcher will typically request access to a slice of the
database (e.g., sales of all detergents) and analyze these data for trends and
patterns. The consumer research company promises rapid access to its data. Its
goal is to give clients access to requested data within one to two minutes. Once
clients have access to the data, they expect very rapid response to queries. What
data storage and retrieval strategy would you recommend?
13. A magazine subscription service has a Web site for customers to place orders,
inquire about existing orders, or check subscription rates. All customers are
uniquely identified by an 11-digit numeric code. All magazines are identified by a
2- to 4-character code. The company has approximately 10 million customers
who subscribe to an average of four magazines. Subscriptions are available to
126 magazines. Draw a data model for this situation. Decide what data structures
you would recommend for the storage of the data. The management of the
company prides itself on its customer service and strives to answer customer
queries as rapidly as possible.
14. A firm offers a satellite-based digital radio service to the continental U.S.
market. It broadcasts music, sports, and talk-radio programs from a library of 1.5
million digital audio files, which are sent to a satellite uplink and then beamed to
car radios equipped to accept the service. Consumers pay $9.95 per month to
access 100 channels.
a. Assuming the average size of a digital audio file is 5 Mbytes (~4 minutes of
music), how much storage space is required?
b. What storage technology would you recommend?
15. Is MP3 a lossless or lossy compression standard?
16. What is the data deluge? What are the implications for data management?
19. Data Processing Architectures
The difficulty in life is the choice.
George Moore, The Bending of the Bough, 1900
Learning Objectives
Students completing this chapter will be able to
✤ recommend a data architecture for a given situation;
✤ understand multi-tier client/server architecture;
✤ discuss the fundamental principles that a hybrid architecture should satisfy;
✤ demonstrate the general principles of distributed database design.
On a recent transatlantic flight, Sophie, the Personnel and PR manager, had been
After an extremely pleasant week of wining, dining, and seeing the town—amid, of
course, many hours of valuable work for The Expeditioner—Sophie returned to
London. Now, she must ask Ned what all these weird terms meant. After all, it was
difficult to carry on a conversation when she did not understand the language. That’s
the problem with IS, she thought, new words and technologies were always being
invented. It makes it very hard for the rest of us to make sense of what’s being said.
Architectural choices
In general terms, data can be stored and processed locally or remotely.
Combinations of these two options provide the basic information systems
architectures.
Basic architectures
Remote job entry
In remote job entry, data are stored locally and processed remotely. Data are sent
over a network to a remote computer for processing, and the output is returned the
same way. This once fairly common form of data processing is still used today
because remote job entry can overcome speed or memory shortcomings of a
personal computer. Scientists and engineers typically need occasional access to a
supercomputer for applications, such as simulating global climate change, that are
data or computational intensive. Supercomputers typically have the main memory
and processing power to handle such problems.
Local storage is used for several reasons. First, it may be cheaper than storing on a
supercomputer. Second, the analyst might be able to do some local processing and
preparation of the data before sending them to the supercomputer. Supercomputing
time can be expensive. Where possible, local processing is used for tasks such as
data entry, validation, and reporting. Third, local data storage might be deemed
more secure for particularly sensitive data.
Personal database
People can store and process their data locally when they have their own computers.
Many personal computer database systems (e.g., MS Access and FileMaker) permit
people to develop their own applications, and many common desktop applications
require local database facilities (e.g., a calendar).
Of course, there is a downside to personal databases. First, there is a great danger
of repetition and redundancy. The same application might be developed in a slightly
different way by many users. The same data get stored on many different systems. (It
is not always the same, however, because data entry errors or maintenance
inconsistencies result in discrepancies between personal databases.) Second, data are
not readily shared because various users are not aware of what is available or find it
inconvenient to share data. Personal databases are exactly that; but much of the data
may be of corporate value and should be shared. Third, data integrity procedures are
often quite lax for personal databases. People might not make regular backups,
databases are often not secured from unauthorized access, and data validation
procedures are often ignored. Fourth, often when the employee leaves the
organization or moves to another role, the application and data are lost because they
are not documented and the organization is unaware of their existence. Fifth, most
personal databases are poorly designed and, in many cases, people use a spreadsheet
as a poor substitute for a database. Personal databases are clearly very important for
many organizations—when used appropriately. Data that are shared require a
different processing architecture.
Client/server
The client/server architecture, in which multiple computers interact in a superior
and subordinate mode, is the dominant architecture these days. A client process
initiates a request and a server responds. The client is the dominant partner because
it initiates the request to which the server responds. Client and server processes can
run on the same computer, but generally they run on separate, linked computers. In
the three-tier client/server model, the application and database are on separate
servers.
A generic client/server architecture consists of several key components. It usually
includes a mix of operating systems, data communications managers (usually
abbreviated to DC manager), applications, clients (e.g., browser), and database
management systems. The DC manager handles the transfer of data between clients
and servers.
are located where energy costs are low and minimal cooling is required. Iceland
and parts of Canada are good locations for server farms because of inexpensive
hydroelectric power and a cool climate. Information can be moved to and fro on a
fiber optic network to be processed at a cloud site and then presented on the
customer's computer.
Cloud computing can be more than a way to save some data center management
costs. It has several features, or capabilities, that have strategic implications. We
65
Cloud ownership
Type Description
Public A cloud provided to the general public by its owner
Private A cloud restricted to an organization. It has many of the same capabilities as a pu
Community A cloud shared by several organizations to support a community project
Hybrid Multiple distinct clouds sharing a standardized or proprietary technology that en
Cloud capabilities
Understanding the strategic implications of cloud computing is of more interest
than the layers and types because a cloud’s seven capabilities offer opportunities to
address the five strategic challenges facing all organizations.
Interface control
Some cloud vendors provide their customers with an opportunity to shape the form
of interaction with the cloud. Under open co-innovation, a cloud vendor has a very
open and community-driven approach to services. Customers and the vendor work
together to determine the roadmap for future services. Amazon follows this
approach. The qualified retail model requires developers to acquire permission and
be certified before releasing any new application on the vendor's platform. This is
the model Apple uses with its iTunes application store. Qualified co-innovation
involves the community in the creation and maintenance of application
programming interfaces (APIs). Third parties using the APIs, however, must be
certified before they can be listed in the vendor's application store. This is the model
salesforce.com uses. Finally, we have the open retail model. Application developers
have some influence on APIs and new features, and they are completely free to write
any program on top of the system. This model is favored by Microsoft.
Location independence
This capability means that data and applications can be accessed, with appropriate
permission, without needing to know the location of the resources. This capability is
particularly useful for serving customers and employees across a wide variety of
regions. There are some challenges to location independence. Some countries
restrict where data about their citizens can be stored and who can access it.
Ubiquitous access
Ubiquity means that customers and employees can access any information service
from any platform or device via a Web browser from anywhere. Some clouds are
not accessible in all parts of the globe at this stage because of the lack of
connectivity, sufficient bandwidth, or local restrictions on cloud computing
services.
Sourcing independence
One of the attractions of cloud computing is that computing processing power
becomes a utility. As a result, a company could change cloud vendors easily at low
cost. It is a goal rather than a reality at this stage of cloud development.
Virtual business environments
Under cloud computing, some special needs systems can be built quickly and later
abandoned. For example, in 2009 the U.S. Government had a Cash for Clunkers
program that ran for a few months to encourage people to trade-in their old car for
a more fuel-efficient new car. This short-lived system whose processing needs are
hard to estimate is well suited to a cloud environment. No resources need to be
acquired, and processing power can be obtained as required.
Addressability and traceability
Traceability enables a company to track customers and employees who use an
information service. Depending on the device accessing an information service, a
company might be able to precisely identify the person and the location of the
request. This is particularly the case with smartphones or tablets that have a GPS
capability.
Rapid elasticity
Organizations cannot always judge exactly their processing needs, especially for
new systems, such as the previously mentioned Cash for Clunkers program. Ideally,
an organization should be able to pull on a pool of computer processing resources
that it can scale up and down as required. It is often easier to scale up than down, as
this is in the interests of the cloud vendor.
Strategic risk
Every firm faces five strategic risks: demand, innovation, inefficiency, scaling, and
control risk.
Strategic risks
Risk Description
Demand Fluctuating demand or market collapse
Inefficiency Inability to match competitors’ unit costs
Innovation Not innovating as well as competitors
Scaling Not scaling fast and efficiently enough to meet market growth
Control Inadequate procedures for acquiring or managing resources
We now consider how the various capabilities of cloud computing are related to
these risks which are summarized in a table following the discussion of each risk.
Demand risk
Most companies are concerned about loss of demand, and make extensive use of
marketing techniques (e.g., advertising) to maintain and grow demand. Cloud
computing can boost demand by enabling customers to access a service wherever
they might be. For example, Amazon combines ubiquitous access to its cloud and
the Kindle, its book reader, to permit customers to download and start reading a
book wherever they have access. Addressability and traceability enable a business to
learn about customers and their needs. By mining the data collected by cloud
applications, a firm can learn the connections between the when, where, and what of
customers’ wants. In the case of excessive demand, the elasticity of cloud resources
might help a company to serve higher than expected demand.
Inefficiency risk
If a firm’s cost structure is higher than its competitors, its profit margin will be less,
and it will have less money to invest in new products and services. Cloud computing
offers several opportunities to reduce cost. First, location independence means it
can use a cloud to provide many services to customers across a large geographic
region through a single electronic service facility. Second, sourcing independence
should result in a competitive market for processing power. Cloud providers
compete in a way that a corporate data center does not, and competition typically
drives down costs. Third, the ability to quickly create and test new applications with
minimal investment in resources avoids many of the costs of procurement,
installation, and the danger of obsolete hardware and software. Finally, ubiquitous
access means employees can work from a wide variety of locations. For example,
ubiquity enables some to work conveniently and productively from home or on the
road, and thus the organization needs smaller and fewer buildings.
Innovation risk
In most industries, enterprises need to continually introduce new products and
services. Even in quite stable industries with simple products (e.g., coal mining),
process innovation is often important for lowering costs and delivering high quality
products. The ability to modify a cloud’s interface could be of relevance to
innovation. Apple’s App store restrictions might push some to Android’s more open
model. As mentioned in the previous section, the capability to put up and tear down
a virtual business environment quickly favors innovation because the costs of
experimenting are low. Ubiquitous access makes it easier to engage customers and
employees in product improvement. Customers use a firm’s products every day, and
some reinvent them or use them in ways the firm never anticipated. A firm that
learns of these unintended uses might identify new markets and new features that
will extend sales. Addressability and traceability also enhance a firm’s ability to
learn about how, when, and where customers use electronic services and network
connected devices. Learning is the first step of innovation.
Scaling risk
Sometimes a new product will takeoff, and a company will struggle to meet
unexpectedly high demand. If it can’t satisfy the market, competitors will move in
and capture the sales that the innovator was unable to satisfy. In this case of digital
products, a firm can use the cloud’s elasticity to quickly acquire new storage and
processing resources. It might also take advantage of sourcing independence to use
multiple clouds, another form of elasticity, to meet burgeoning demand.
Control risk
The 2008 recession clearly demonstrated the risk of inadequate controls. In some
cases, an organization acquired financial resources whose value was uncertain. In
others, companies lost track of what they owned because of the complexity of
financial instruments and the multiple transactions involving them. A system’s
interface is a point of data capture, and a well-designed interface is a control
mechanism. Although it is usually not effective at identifying falsified data, it can
require that certain elements of data are entered. Addressability and traceability also
means that the system can record who entered the data, from which device, and
when.
Cloud capabilities and strategic risks
Risk/Capability Demand Inefficiency Innovation Scaling Control
Interface control ✔ ✔
Location independence ✔
Sourcing independence ✔ ✔
Virtual business environment ✔ ✔
Ubiquitous access ✔ ✔ ✔
Addressability and traceability ✔ ✔ ✔
Rapid elasticity ✔ ✔
Most people think of cloud computing as an opportunity to lower costs by shifting
processing from the corporate data center to a third party. More imaginative
thinkers will see cloud computing as an opportunity to gain a competitive
advantage.
Distributed database
The client/server architecture is concerned with minimizing processing costs by
distributing processing between multiple clients and servers. Another factor in the
total processing cost equation is communication. The cost of transmitting data
usually increases with distance, and there can be substantial savings by locating a
database close to those most likely to use it. The trade-off for a distributed database
is lowered communication costs versus increased complexity. While the Internet and
fiber optics have lowered the costs of communication, for a globally distributed
organization data transmission costs can still be an issue because many countries
still lack the bandwidth found in advanced economies.
Distributed database architecture describes the situation where a database is in more
than one location but still accessible as if it were centrally located. For example, a
multinational organization might locate its Indonesian data in Jakarta, its Japanese
data in Tokyo, and its Australian data in Sydney. If most queries deal with the local
situation, communication costs are substantially lower than if the database were
centrally located. Furthermore, since the database is still treated as one logical
entity, queries that require access to different physical locations can be processed.
For example, the query “Find total sales of red hats in Australia” is likely to
originate in Australia and be processed using the Sydney part of the database. A
query of the form “Find total sales of red hats” is more likely to come from a
headquarters’ user and is resolved by accessing each of the local databases, though
the user need not be aware of where the data are stored because the database appears
as a single logical entity.
A distributed database management system (DDBMS) is a federation of individual
DBMSs. Each site has a local DBMS and DC manager. In many respects, each site
acts semi-independently. Each site also contains additional software that enables it to
be part of the federation and act as a single database. It is this additional software
that creates the DDBMS and enables the multiple databases to appear as one.
A DDBMS introduces a need for a data store containing details of the entire system.
Information must be kept about the structure and location of every database, table,
row, and column and their possible replicas. The system catalog is extended to
include such information. The system catalog must also be distributed; otherwise,
every request for information would have to be routed through some central point.
For instance, if the systems catalog were stored in Tokyo, a query on Sydney data
would first have to access the Tokyo-based catalog. This would create a bottleneck.
A hybrid, distributed architecture
Any organization of a reasonable size is likely to have a mix of the data processing
architectures. Databases will exist on stand-alone personal computers, multiple
client/server networks, distributed mainframes, clouds, and so on. Architectures
continue to evolve because information technology is dynamic. Today’s best
solutions for data processing can become obsolete very quickly. Yet, organizations
have invested large sums in existing systems that meet their current needs and do not
warrant replacement. As a result, organizations evolve a hybrid architecture—a mix
of the various forms. The concern of the IS unit is how to patch this hybrid together
so that clients see it as a seamless system that readily provides needed information.
In creating this ideal system, there are some underlying key concepts that should be
observed. These fundamental principles were initially stated in terms of a distributed
database. However, they can be considered to apply broadly to the evolving, hybrid
architecture that organizations must continually fashion.
The fundamental principles of a hybrid architecture
Principle
Transparency
No reliance on a central site
Local autonomy
Continuous operation
Distributed query processing
Distributed transaction processing
Fragmentation independence
Replication independence
Hardware independence
Operating system independence
Network independence
DBMS independence
Transparency
The user should not have to know where data are stored nor how they are processed.
The location of data, its storage format, and access method should be invisible to
the client. The system should accept queries and resolve them expeditiously. Of
course, the system should check that the person is authorized to access the requested
data. Transparency is also known as location independence—the system can be
used independently of the location of data.
No reliance on a central site
Reliance on a central site for management of a hybrid architecture creates two
major problems. First, because all requests are routed through the central site,
bottlenecks develop during peak periods. Second, if the central site fails, the entire
system fails. A controlling central site is too vulnerable, and control should be
distributed throughout the system.
Local autonomy
A high degree of local autonomy avoids dependence on a central site. Data are
locally owned and managed. The local site is responsible for the security, integrity,
and storage of local data. There cannot be absolute local autonomy because the
various sites must cooperate in order for transparency to be feasible. Cooperation
always requires relinquishing some autonomy.
Continuous operation
The system must be accessible when required. Since business is increasingly global
and clients are geographically dispersed, the system must be continuously available.
Many data centers now describe their operations as “24/7” (24 hours a day and 7
days a week). Clouds must operate continuously.
Distributed query processing
The time taken to execute a query should be generally independent of the location
from which it is submitted. Deciding the most efficient way to process the query is
the system’s responsibility, not the client’s. For example, a Sydney analyst could
submit the query, “Find sales of winter coats in Japan.” The system is responsible
for deciding which messages and data to send between the various sites where tables
are located.
Distributed transaction processing
In a hybrid system, a single transaction can require updating of multiple files at
multiple sites. The system must ensure that a transaction is successfully executed for
all sites. Partial updating of files will cause inconsistencies.
Fragmentation independence
Fragmentation independence means that any table can be broken into fragments and
then stored in separate physical locations. For example, the sales table could be
fragmented so that Indonesian data are stored in Jakarta, Japanese data in Tokyo,
and so on. A fragment is any piece of a table that can be created by applying
restriction and projection operations. Using join and union, fragments can be
assembled to create the full table. Fragmentation is the key to a distributed database.
Without fragmentation independence, data cannot be distributed.
Replication independence
Fragmentation is good when local data are mainly processed locally, but there are
some applications that also frequently process remote data. For example, the New
York office of an international airline may need both American (local) and
European (remote) data, and its London office may need American (remote) and
European (local) data. In this case, fragmentation into American and European data
may not substantially reduce communication costs.
Replication means that a fragment of data can be copied and physically stored at
multiple sites; thus the European fragment could be replicated and stored in New
York, and the American fragment replicated and stored in London. As a result,
applications in both New York and London will reduce their communication costs.
Of course, the trade-off is that when a replicated fragment is updated, all copies also
must be updated. Reduced communication costs are exchanged for increased update
complexity.
Replication independence implies that replication happens behind the scenes. The
client is oblivious to replication and requires no knowledge of this activity.
There are two major approaches to replication: synchronous or asynchronous
updates. Synchronous replication means all databases are updated at the same time.
Although this is ideal, it is not a simple task and is resource intensive.
Asynchronous replication occurs when changes made to one database are relayed
to other databases within a certain period established by the database administrator.
It does not provide real-time updating, but it takes fewer IS resources.
Asynchronous replication is a compromise strategy for distributed DBMS
replication. When real-time updating is not absolutely necessary, asynchronous
replication can save IS resources.
Hardware independence
A hybrid architecture should support hardware from multiple suppliers without
affecting the users’ capacity to query files. Hardware independence is a long-term
goal of many IS managers. With virtualization, under which a computer can run
another operating system, hardware independence has become less of an issue. For
example, a Macintosh can run OS X (the native system), a variety of Windows
operating systems, and many variations of Linux.
Operating system independence
Operating system independence is another goal much sought after by IS executives.
Ideally, the various DBMSs and applications of the hybrid system should work on a
range of operating systems on a variety of hardware. Browser-based systems are a
way of providing operating system independence.
Network independence
Clearly, network independence is desired by organizations that wish to avoid the
electronic shackles of being committed to any single hardware or software supplier.
DBMS independence
The drive for independence is contagious and has been caught by the DBMS as well.
Since SQL is a standard for relational databases, organizations may well be able to
achieve DBMS independence. By settling on the relational model as the
organizational standard, ensuring that all DBMSs installed conform to this model,
and using only standard SQL, an organization may approach DBMS independence.
Nevertheless, do not forget all those old systems from the pre-relational days—a
legacy that must be supported in a hybrid architecture.
Organizations can gain considerable DBMS independence by using Open Database
Connectivity (ODBC) technology, which was covered earlier in the chapter on SQL.
An application that uses the ODBC interface can access any ODBC-compliant
DBMS. In a distributed environment, such as an n-tier client/server, ODBC enables
application servers to access a variety of vendors’ databases on different data
servers.
Conclusion—paradise postponed
For data managers, the 12 principles just outlined are ideal goals. In the hurly-burly
of everyday business, incomplete information, and an uncertain future, data
managers struggle valiantly to meet clients’ needs with a hybrid architecture that is
an imperfect interpretation of IS paradise. It is unlikely that these principles will
ever be totally achieved. They are guidelines and something to reflect on when
making the inevitable trade-offs that occur in data management.
Now that you understand the general goals of a distributed database architecture, we
will consider the major aspects of the enabling technology. First, we will look at
distributed data access methods and then distributed database design. In keeping with
our focus on the relational model, illustrative SQL examples are used.
Distributed data access
When data are distributed across multiple locations, the data management logic
must be adapted to recognize multiple locations. The various types of distributed
data access methods are considered, and an example is used to illustrate the
differences between the methods.
Remote request
A remote request occurs when an application issues a single data request to a single
remote site. Consider the case of a branch bank requesting data for a specific
customer from the bank’s central server located in Atlanta. The SQL command
specifies the name of the server (atlserver), the database (bankdb), and the name of the
table (customer).
SELECT * FROM atlserver.bankdb.customer
WHERE custcode = 12345;
A remote request can extract a table from the database for processing on the local
database. For example, the Athens branch may download balance details of
customers at the beginning of each day and handle queries locally rather than
issuing a remote request. The SQL is
SELECT custcode, custbalance FROM atlserver.bankdb.customer
WHERE custbranch = 'Athens';
Remote transaction
Multiple data requests are often necessary to execute a complete business
transaction. For example, to add a new customer account might require inserting a
row in two tables: one row for the account and another row in the table relating a
customer to the new account. A remote transaction contains multiple data requests
for a single remote location. The following example illustrates how a branch bank
creates a new customer account on the central server:
BEGIN WORK;
INSERT INTO atlserver.bankdb.account
(accnum, acctype)
VALUES (789, 'C');
INSERT INTO atlserver.bankdb.cust_acct
(custnum, accnum)
VALUES (123, 789);
COMMIT WORK;
The commands BEGIN WORK and COMMIT WORK surround the SQL commands
necessary to complete the transaction. The transaction is successful only if both SQL
statements are successfully executed. If one of the SQL statements fails, the entire
transaction fails.
Distributed transaction
A distributed transaction supports multiple data requests for data at multiple
locations. Each request is for data on a single server. Support for distributed
transactions permits a client to access tables on different servers.
Consider the case of a bank that operates in the United States and Norway and keeps
details of employees on a server in the country in which they reside. The following
example illustrates a revision of the database to record details of an employee who
moves from the United States to Norway. The transaction copies the data for the
employee from the Atlanta server to the Oslo server and then deletes the entry for
that employee on the Atlanta server.
BEGIN WORK;
INSERT INTO osloserver.bankdb.employee
(empcode, emplname, …)
SELECT empcode, emplname, …
FROM atlserver.bankdb.employee
WHERE empcode = 123;
DELETE FROM atlserver.bankdb.employee
WHERE empcode = 123;
COMMIT WORK;
As in the case of the remote transaction, the transaction is successful only if both
SQL statements are successfully executed.
Distributed request
A distributed request is the most complicated form of distributed data access. It
supports processing of multiple requests at multiple sites, and each request can
access data on multiple sites. This means that a distributed request can handle data
replicated or fragmented across multiple servers.
Let’s assume the bank has had a good year and decided to give all employees a 15
percent bonus based on their annual salary and add USD 1,000 or NOK 7,500 to
their retirement account, depending on whether the employee is based in the United
States or Norway.
BEGIN WORK;
CREATE VIEW temp
(empcode, empfname, emplname, empsalary)
AS
SELECT empcode, empfname, emplname, empsalary
FROM atlserver.bankdb.employee
UNION
SELECT empcode, empfname, emplname, empsalary
FROM osloserver.bankdb.employee;
SELECT empcode, empfname, emplname, empsalary*.15 AS bonus
FROM temp;
UPDATE atlserver.bankdb.employee
SET empusdretfund = empusdretfund + 1000;
UPDATE osloserver.bankdb.employee
SET empkrnretfund = empkrnretfund + 7500;
COMMIT WORK;
The transaction first creates a view containing all employees by a union on the
employee tables for both locations. This view is then used to calculate the bonus.
Two SQL update commands then update the respective retirement fund records of
the U.S. and Norwegian employees. Notice that retirement funds are recorded in U.S.
dollars or Norwegian kroner.
Ideally, a distributed request should not require the application to know where data
are physically located. A DDBMS should not require the application to specify the
name of the server. So, for example, it should be possible to write the following
SQL:
SELECT empcode, empfname, emplname, empsalary*.15 AS bonus
FROM bankdb.employee;
It is the responsibility of the DDBMS to determine where data are stored. In other
words, the DDBMS is responsible for ensuring data location and fragmentation
transparency.
Distributed database design
Designing a distributed database is a two-stage process. First, develop a data model
using the principles discussed in Section 2. Second, decide how data and processing
will be distributed by applying the concepts of partitioning and replication.
Partitioning is the fragmentation of tables across servers. Tables can be
fragmented horizontally, vertically, or by some combination of both. Replication is
the duplication of tables across servers.
Horizontal fragmentation
A table is split into groups of rows when horizontally fragmented. For example, a
firm may fragment its employee table into three because it has employees in Tokyo,
Sydney, and Jakarta and store the fragment on the appropriate DBMS server for
each city.
Horizontal fragmentation
Three separate employee tables would be defined. Each would have a different name
(e.g., emp-syd) but exactly the same columns. To insert a new employee, the SQL code
is
INSERT INTO TABLE emp_syd
SELECT * FROM new_emp
WHERE emp_nation = 'Australia';
INSERT INTO table emp_tky
SELECT * FROM new_emp
WHERE emp_nation = 'Japan';
INSERT INTO table emp_jak
SELECT * FROM new_emp
WHERE emp_nation = 'Indonesia';
Vertical fragmentation
When vertically fragmented, a table is split into columns. For example, a firm may
fragment its employee table vertically to spread the processing load across servers.
There could be one server to handle address lookups and another to process
payroll. In this case, the columns containing address information would be stored
on one server and payroll columns on the other server. Notice that the primary key
column (c1) must be stored on both servers; otherwise, the entity integrity rule is
violated.
Vertical fragmentation
Hybrid fragmentation
Hybrid fragmentation is a mixture of horizontal and vertical. For example, the
employee table could be first horizontally fragmented to distribute the data to where
employees are located. Then, some of the horizontal fragments could be vertically
fragmented to split processing across servers. Thus, if Tokyo is the corporate
headquarters with many employees, the Tokyo horizontal fragment of the employee
database could be vertically fragmented so that separate servers could handle
address and payroll processing.
Horizontal fragmentation distributes data and thus can be used to reduce
communication costs by storing data where they are most likely to be needed.
Vertical fragmentation is used to distribute data across servers so that the processing
load can be distributed. Hybrid fragmentation can be used to distribute both data and
applications across servers.
Replication
Under replication, tables can be fully or partly replicated. Full replication means
that tables are duplicated at each of the sites. The advantages of full replication are
greater data integrity (because replication is essentially mirroring) and faster
processing (because it can be done locally). However, replication is expensive
because of the need to synchronize inserts, updates, and deletes across the replicated
tables. When one table is altered, all the replicas must also be modified. A
compromise is to use partial replication by duplicating the indexes only. This will
increase the speed of query processing. The index can be processed locally, and then
the required rows retrieved from the remote database.
Skill builder
A company supplies pharmaceuticals to sheep and cattle stations in outback
Australia. Its agents often visit remote areas and are usually out of reach of a mobile
phone network. To advise station owners, what data management strategy would you
recommend for the company?
Conclusion
The two fundamental skills of data management, data modeling and data querying,
are not changed by the development of a distributed data architecture such as
client/server and the adoption of cloud computing. Data modeling remains
unchanged. A high-fidelity data model is required regardless of where data are
stored and whichever architecture is selected. SQL can be used to query local,
remote, and distributed databases. Indeed, the adoption of client/server technology
has seen a widespread increase in the demand for SQL skills.
Summary
Data can be stored and processed locally or remotely. Combinations of these two
options provide the basic architecture options: remote job entry, personal database,
and client/server. Many organizations are looking to cloud computing to reduce the
processing costs and to allow them to focus on building systems. Cloud computing
can also provide organizations with a competitive advantage if they exploit its seven
capabilities to reduce one or more of the strategic risks.
Under distributed database architecture, a database is in more than one location but
still accessible as if it were centrally located. The trade-off is lowered
communication costs versus increased complexity. A hybrid architecture is a mix of
data processing architectures. The concern of the IS department is to patch this
hybrid together so that clients see a seamless system that readily provides needed
information. The fundamental principles that a hybrid architecture should satisfy are
transparency, no reliance on a central site, local autonomy, continuous operation,
distributed query processing, distributed transaction processing, fragmentation
independence, replication independence, hardware independence, operating system
independence, network independence, and DBMS independence.
There are four types of distributed data access. In order of complexity, these are
remote request, remote transaction, distributed transaction, and distributed request.
Distributed database design is based on the principles of fragmentation and
replication. Horizontal fragmentation splits a table by rows and reduces
communication costs by placing data where they are likely to be required. Vertical
fragmentation splits a table by columns and spreads the processing load across
servers. Hybrid fragmentation is a combination of horizontal and vertical
fragmentation. Replication is the duplication of identical tables at different sites.
Partial replication involves the replication of indexes. Replication speeds up local
processing at the cost of maintaining the replicas.
Key terms and concepts
Application server
Client/server
Cloud computing
Continuous operation
Data communications manager (DC)
Data processing
Data server
Data storage
Database architecture
Database management system (DBMS)
DBMS independence
DBMS server
Distributed data access
Distributed database
Distributed query processing
Distributed request
Distributed transaction
Distributed transaction processing
Fragmentation independence
Hardware independence
Horizontal fragmentation
Hybrid architecture
Hybrid fragmentation
Local autonomy
Mainframe
N-tier architecture
Network independence
Personal database
Remote job entry
Remote request
Remote transaction
Replication
Replication independence
Server
Software independence
Strategic risk
Supercomputer
Three-tier architecture
Transaction processing monitor
Transparency
Two-tier architecture
Vertical fragmentation
References and additional readings
Child, J. (1987). Information technology, organizations, and the response to
strategic challenges. California Management Review, 30(1), 33-50.
Iyer, B., & Henderson, J. C. (2010). Preparing for the future: understanding the
seven capabilities of cloud computing. MIS Executive, 9(2), 117-131.
Morris, C. R., and C. H. Ferguson. 1993. How architecture wins technology wars.
Harvard Business Review 71 (2):86–96.
Watson, R. T., Wynn, D., & Boudreau, M.-C. (2005). JBoss: The evolution of
professional open source software. MISQ Executive, 4(3), 329-341.
Exercises
1. How does client/server differ from cloud computing?
2. In what situations are you likely to use remote job entry?
3. What are the disadvantages of personal databases?
4. What is a firm likely to gain when it moves from a centralized to a distributed
database? What are the potential costs?
5. In terms of a hybrid architecture, what does transparency mean?
6. In terms of a hybrid architecture, what does fragmentation independence mean?
7. In terms of a hybrid architecture, what does DBMS independence mean?
8. How does ODBC support a hybrid architecture?
9. A university professor is about to develop a large simulation model for
describing the global economy. The model uses data from 65 countries to
simulate alternative economic policies and their possible outcomes. In terms of
volume, the data requirements are quite modest, but the mathematical model is
very complex, and there are many equations that must be solved for each quarter
the model is run. What data processing/data storage architecture would you
recommend?
10. A multinational company has operated relatively independent organizations in
15 countries. The new CEO wants greater coordination and believes that
marketing, production, and purchasing should be globally managed. As a result,
the corporate IS department must work with the separate national IS departments
to integrate the various national applications and databases. What are the
implications for the corporate data processing and database architecture? What
are the key facts you would like to know before developing an integration plan?
What problems do you anticipate? What is your intuitive feeling about the key
features of the new architecture?
20. SQL and Java
The vision for Java is to be the concrete and nails that people use to build this
incredible network system that is happening all around us.
James Gosling, 2000
Learning objectives
Students completing this chapter will be able to
✤ write Java program to process a parameterized SQL query;
✤ read a CSV file and insert rows into tables;
✤ write a program using HMTL and JavaServer Pages (JSP) to insert data
collected by a form data into a relational database;
✤ understand how SQL commands are used for transaction processing.
Introduction
Java is a platform-independent application development language. Its object-
oriented nature makes it easy to create and maintain software and prototypes. These
features make Java an attractive development language for many applications,
including database systems.
This chapter assumes that you have some knowledge of Java programming, can use
an integrated development environment (IDE) (e.g., BlueJ, Eclipse, or NetBeans),
and know how to use a Java library. It is also requires some HTML skills because
JavaServer Pages (JSP) are created by embedding Java code within an HTML
document. MySQL is used for all the examples, but the fundamental principles are
the same for all relational databases. With a few changes, your program will work
with another implementation of the relational model.
JDBC
Java Database Connectivity (JDBC), a Java version of a portable SQL command line
interface (CLI), is modeled on Open Database Connectivity (ODBC.) JDBC enables
programmers to write Java software that is both operating system and DBMS
independent. The JDBC core, which handles 90 percent of database programming,
contains seven interfaces and two classes, which are part of the java.sql package.
The purpose of each of these is summarized in the following table.
JDBC core interfaces and classes
Interfaces Description
Connection Connects an application to a database
Statement A container for an SQL statement
PreparedStatement Precompiles an SQL statement and then uses it multiple times
CallableStatement Executes a stored procedure
ResultSet The rows returned when a query is executed
ResultSetMetaData The number, types, and properties of the result set
Classes
DriverManager Loads driver objects and creates database connections
DriverPropertyInfo Used by specialized clients
For each DBMS, implementations of these interfaces are required because they are
specific to the DBMS and not part of the Java package. For example, specific drivers
are needed for MySQL, DB2, Oracle, and MS SQL. Before using JDBC, you will
need to install these interfaces for the DBMS you plan to access. The standard
practice appears to be to refer to this set of interfaces as the driver, but the ‘driver ’
also includes implementations of all the interfaces.
Java EE
Java EE (Java Platform Enterprise Edition) is a platform for multi-tiered enterprise
applications. In the typical three-tier model, a Java EE application server sits
between the client’s browser and the database server. A Java EE compliant server, of
which there are a variety, is needed to process JSP.
Using SQL within Java
In this section, you will need a Java integrated development environment to execute
the sample code. A good option is Eclipse, a widely used open source IDE.
67
We now examine each of the major steps in processing an SQL query and, in the
process, create Java methods to query a database.
Connect to the database
The getConnection method of the DriverManager specifies the URL of the database,
the account identifier, and its password. You cannot assume that connecting to the
database will be trouble-free. Thus, good coding practice requires that you detect
and report any errors using Java’s try-catch structure.
You connect to a DBMS by supplying its url and the account name and password.
try {
conn = DriverManager.getConnection(url, account, password);
} catch (SQLException error) {
System.out.println("Error connecting to database: " + error.toString());
System.exit(1);
}
The format of the url parameter varies with the JDBC driver, and you will need to
consult the documentation for the driver. In the case of MySQL, the possible formats
are
jdbc:mysql:database
jdbc:mysql://host/database
jdbc:mysql://host:port/database
The default value for host is “localhost” and, for MySQL, the default port is “3306.”
For example:
jdbc:mysql://www.richardtwatson.com:3306/Text
Report a SELECT
The rows in the table containing the results of the query are processed a row at a
time using the next method of the ResultSet object (see the following code).
Columns are retrieved one at a time using a get method within a loop, where the
integer value specified in the method call refers to a column (e.g., rs.getString(1) returns
the first column of the result set as a string). The get method used depends on the
type of data returned by the query. For example, getString() gets a character string,
getInt() gets an integer, and so on.
Reporting a SELECT
while (rs.next()) {
String firm = rs.getString(1);
int div = rs.getInt(2);
System.out.println(firm + " " + div);
}
Inserting a row
The prepareStatement() is also used for inserting a row in a similar manner, as the
following example shows. The executeUpdate() method performs the insert. Notice
the use of try and catch to detect and report an error during an insert.
Inserting a row
try {
stmt = conn.prepareStatement("INSERT INTO alien (alnum, alname) VALUES (?,?)");
stmt.setInt(1,10);
stmt.setString(2, "ET");
stmt.executeUpdate();
}
catch (SQLException error)
{
System.out.println("Error inserting row into database: "
+ error.toString());
System.exit(1);
}
CSV file
A comma-separated values (CSV) file is a common form of text file because most 69
defined by its URL. You can also connect to text files on your personal computer.
Connecting to a CSV file
try {
csvurl = new URL("https://dl.dropbox.com/u/6960256/data/painting.csv");
} catch (IOException error) {
System.out.println("Error accessing url: " + error.toString());
System.exit(1);
}
A CSV file is read a record at a time, and the required fields are extracted for
further processing. The following code illustrates how a loop is used to read each
record in a CSV file and extract some fields. Note the following:
• The first record, the header, contains the names of each field and is read
before the loop starts.
• Field names can be used to extract each field with a get() method.
• You often need to convert the input string data to another format, such as
integer.
• Methods, addArtist, and addArt are included in the loop to add details of an
artist to the artist table and a piece of art to the art table.
Processing a CSV file
try {
input = new CsvReader(new InputStreamReader(csvurl.openStream()));
input.readHeaders();
while (input.readRecord())
{
// Artist
String firstName = input.get("firstName");
String lastName = input.get("lastName");
String nationality = input.get("nationality");
int birthYear = Integer.parseInt(input.get("birthyear"));
int deathYear = Integer.parseInt(input.get("deathyear"));
artistPK = addArtist(conn, firstName, lastName, nationality, birthYear, deathYear);
// Painting
String title = input.get("title");
double length = Double.parseDouble(input.get("length"));
double breadth = Double.parseDouble(input.get("breadth"));
int year = Integer.parseInt(input.get("year"));
addArt(conn, title, length, breadth, year,artistPK);
}
input.close();
} catch (IOException error) {
System.out.println("Error reading CSV file: " + error.toString());
System.exit(1);
}
We now need to consider the addArtist method. Its purpose is to add a row to the
artist table. The primary key, artistID, is specified as AUTO_INCREMENT, which means
it is set by the DBMS. Because of the 1:m relationship between artist and art, we need
to know the value of artistID, the primary key of art, to set the foreign key for the
corresponding piece of art. We can use the following code to determine the primary
key of the last insert.
Determining the value of the primary key of the last insert
rs = stmt.executeQuery("SELECT LAST_INSERT_ID()");
if (rs.next()) {
autoIncKey = rs.getInt(1);
}
The addArtist method returns the value of the primary key of the most recent insert
so that we can then use this value as the foreign key for the related piece of art.
artistPK = addArtist(conn, firstName, lastName, nationality, birthYear, deathYear);
The call to the addArt method includes the returned value as a parameter. That is, it
passes the primary key of artist to be used as foreign key of art.
addArt(conn, title, length, breadth, year,artistPK);
The Java code for the complete application is available on the book’s web site. 71
Skill builder
1. Add a few sculptors and a piece of sculpture to the CSV file. Also, decide how
you will differentiate between a painting and a sculpture.
2. Add a method to ArtCollector.java to insert an artist and that person’s sculpture.
3. Run the revised program so that it adds both paintings and sculptures for various
artists.
JavaServer Pages (JSP)
Standard HTML pages are static. They contain text and graphics that don’t change
unless someone recodes the page. JSP and other technologies, such as PHP and ASP,
enable you to create dynamic pages whose content can change based to suit the
goals of the person accessing the page, the browser used, or the device on which the
page is displayed.
A JSP contains standard HTML code, the static, and JSP elements, the dynamic.
When someone requests a JSP, the server combines HTML and JSP elements to
deliver a dynamic page. JSP is also useful for server side processing. For example,
when someone submits a form, JSP elements can be used to process the form on the
server.
Map collection case
The Expeditioner has added a new product line to meet the needs of its changing
clientele. It now stocks a range of maps. The firm’s data modeler has created a data
model describing the situation . A map has a scale; for example, a scale of 1:1 000
000 means that 1 unit on the map is 1,000,000 units on the ground (or 1 cm is 10 km,
and 1 inch is ~16 miles). There are three types of maps: road, rail, and canal.
Initially, The Expeditioner decided to stock maps for only a few European countries.
The Expeditioner’s map collection data model
Data entry
Once the tables have been defined, we want to insert some records. This could be
done using the insertSQL method of the Java code just created, but it would be very
tedious, as we would type insert statements for each row (e.g., INSERT INTO map
VALUES (1, 1000000, 'Rail');). A better approach is to use a HTML data entry form. An
appropriate form and its associated code follow.
Map collection data entry form
The data entry form captures a map’s identifier, scale, type, and the number of
countries covered by the map. A map identifier is of the form M001 to M999, and a
regular expression is used to validate the input. A map scale must be numeric with a
72
minimum of 1000 and maximum of 100000 in increments of 1000. The type of map
is selected from the list, which is defined so that only one type can be selected.
Multiple countries can be selected by holding down an appropriate key combination
when selected from the displayed list of countries. When the “Add map” button is
clicked, the page mapinsert.jsp is called, which is specified in the first line of the
HTML body code.
Passing data from a form to a JSP
In a form, the various entry fields are each identified by a name specification (e.g.,
name=”mapid” and name=”mapscale”). In the corresponding JSP, which is the one defined in
the form’s action statement (i.e., action="mapinsert.jsp") these same fields are accessed
using the getParameter method when there is a single value or getParameterValues
when there are multiple values. The following chunks of code show corresponding
form and JSP code for maptype.
<label>Map type: <select name="maptype" size="3"
placeholder="Select type of map"></label>
maptype = request.getParameter("maptype");
Transaction processing
A transaction is a logical unit of work. In the case of adding a map, it means
inserting a row in map for the map and inserting one row in mapCountry for each
country on the map. All of these inserts must be executed without failure for the
entire transaction to be processed. SQL has two commands for supporting
transaction processing: COMMIT and ROLLBACK. If no errors are detected, then the
various changes made by a transaction can be committed. In the event of any errors
during the processing of a transaction, the database should be rolled back to the state
prior to the transaction.
Autocommit
Before processing a transaction, you need to turn off autocommit to avoid
committing each database change separately before the entire transaction is
complete. It is a good idea to set the value for autocommit immediately after a
successful database connection, which is what you will see when you inspect the
code for mapinsert.jsp. The following code sets autocommit to false in the case
where conn is the connection object.
conn.setAutoCommit(false);
Commit
The COMMIT command is executed when all parts of a transaction have been
successfully executed. It makes the changes to the database permanent.
conn.commit(); // all inserts successful
Rollback
The ROLLBACK command is executed when any part of a transaction fails. All
changes to the database since the beginning of the transaction are reversed, and the
database is restored to its state before the transaction commenced.
conn.rollback(); // at least one insert failed
mapinsert.jsp
<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
pageEncoding="ISO-8859-1"%>
<!DOCTYPE html PUBLIC "-
//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<%@ page import="java.util.*"%>
<%@ page import="java.lang.*"%>
<%@ page import="java.sql.*"%>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Map insert page</title>
</head>
<body>
<%
String url;
String jdbc = "jdbc:mysql:";
String database = "//localhost:3306/MapCollection";
String username = "student", password = "student";
String mapid, maptype, countries;
String[] country;
int mapscale = 0;
boolean transOK = true;
PreparedStatement insertMap;
PreparedStatement insertMapCountry;
Connection conn = null;
// make connection
url = jdbc + database;
try {
conn = DriverManager.getConnection(url, username, password);
} catch (SQLException error) {
System.out.println("Error connecting to database: "
+ error.toString());
System.exit(1);
}
try {
conn.setAutoCommit(false);
} catch (SQLException error) {
System.out.println("Error turning off autocommit"
+ error.toString());
System.exit(2);
}
//form data
mapid = request.getParameter("mapid");
mapscale = Integer.parseInt(request.getParameter("mapscale"));
maptype = request.getParameter("maptype");
country = request.getParameterValues("countries");
transOK = true;
// insert the map
try {
insertMap = conn.prepareStatement("INSERT INTO map (mapid, mapscale, maptype) VALUES (?,?,?)");
insertMap.setString(1, mapid);
insertMap.setInt(2, mapscale);
insertMap.setString(3, maptype);
System.out.println(insertMap);
insertMap.executeUpdate();
// insert the countries
for (int loopInx = 0; loopInx < country.length; loopInx++) {
insertMapCountry = conn.prepareStatement("INSERT INTO mapCountry (mapid ,cntrycode ) VALUES (?,?)");
insertMapCountry.setString(1, mapid);
insertMapCountry.setString(2, country[loopInx]);
System.out.println(insertMapCountry);
insertMapCountry.executeUpdate();
}
} catch (SQLException error) {
System.out.println("Error inserting row: " + error.toString());
transOK = false;
}
if (transOK) {
conn.commit(); // all inserts successful
System.out.println("Transaction commit");
} else {
conn.rollback(); // at least one insert failed
System.out.println("Transaction rollback");
}
conn.close();
System.out.println("Database close"); // close database
%>
</body>
</html>
21. Data Integrity
Integrity without knowledge is weak and useless, and knowledge without integrity is
dangerous and dreadful.
Samuel Johnson, Rasselas, 1759
Learning objectives
After completing this chapter, you will
✤ understand the three major data integrity outcomes;
✤ understand the strategies for achieving each of the data integrity outcomes;
✤ understand the possible threats to data integrity and how to deal with them;
✤ understand the principles of transaction management;
✤ realize that successful data management requires making data available and
maintaining data integrity.
The Expeditioner has become very dependent on its databases. The day-to-day
operations of the company would be adversely affected if the major operational
databases were lost. Indeed, The Expeditioner may not be able to survive a major
data loss. Recently, there have also been a few minor problems with the quality and
confidentiality of some of the databases. A part-time salesperson was discovered
making a query about staff salaries. A major order was nearly lost when it was
shipped to the wrong address because the complete shipping address had not been
entered when the order was taken. The sales database had been offline for 30
minutes last Monday morning because of a disk sector read error.
The Expeditioner had spent much time and money creating an extremely effective
and efficient management system. It became clear, however, that more attention
needed to be paid to maintaining the system and ensuring that high-quality data were
continuously available to authorized users.
Introduction
The management of data is driven by two goals: availability and integrity.
Availability deals with making data available to whomever needs it, whenever and
wherever he or she needs it, and in a meaningful form. As illustrated in following
figure , availability concerns the creation, interrogation, and update of data stores.
Although most of the book, thus far, has dealt with making data available, a database
is of little use to anyone unless it has integrity. Maintaining data integrity implies
three goals: 74
This chapter considers the three types of strategies for maintaining data integrity:
1. Legal strategies are externally imposed laws, rules, and regulations. Privacy
laws are an example.
2. Administrative strategies are organizational policies and procedures. An
example is a standard operating procedure of storing all backup files in a locked
vault.
3. Technical strategies are those incorporated in the computer system (e.g.,
database management system [DBMS], application program, or operating
system). An example is the inclusion of validation rules (e.g., NOT NULL) in a
database definition that are used to ensure data quality when a database is updated.
A consistent database is one in which all data integrity constraints are satisfied.
Our focus is on data stored in multiuser computer databases and technical and
administrative strategies for maintaining integrity in computer system
environments., commonly referred to as database integrity. More and more,
organizational memories are being captured in computerized databases. From an
integrity perspective, this is a very positive development. Computers offer some
excellent mechanisms for controlling data integrity, but this does not eliminate the
need for administrative strategies. Both administrative and technical strategies are
still needed.
Who is responsible for database integrity? Some would say the data users, others
the database administrator. Both groups are right; database integrity is a shared
responsibility, and the way it is managed may differ across organizations. Our focus
is on the tools and strategies for maintaining data integrity, regardless of who is
responsible.
The strategies for achieving the three integrity outcomes are summarized in the
following table. We will cover procedures for protecting existence, followed by
those for maintaining integrity, and finally those used to ensure confidentiality.
Before considering each of these goals, we need to examine the general issue of
transaction management.
Strategies for maintaining database integrity
Database integrity outcome Strategies for achieving the outcome
Isolation (preventive)
Protecting existence
Database backup and recovery (curative)
Update authorization
Maintaining quality Integrity constraints/data validation
Concurrent update control
Access control
Ensuring confidentiality
Encryption
Transaction management
Transaction management focuses on ensuring that transactions are correctly
recorded in the database. The transaction manager is the element of a DBMS that
processes transactions. A transaction is a series of actions to be taken on the
database such that they must be entirely completed or entirely aborted. A transaction
is a logical unit of work. All its elements must be processed; otherwise, the database
will be incorrect. For example, with a sale of a product, the transaction consists of at
least two parts: an update to the inventory on hand, and an update to the customer
information for the items sold in order to bill the customer later. Updating only the
inventory or only the customer information would create a database without
integrity and an inconsistent database.
Transaction managers are designed to accomplish the ACID (atomicity, consistency,
isolation, and durability) concept. These attributes are:
1. Atomicity: If a transaction has two or more discrete pieces of information,
either all of the pieces are committed or none are.
2. Consistency: Either a transaction creates a valid new database state or, if any
failure occurs, the transaction manager returns the database to its prior state.
3. Isolation: A transaction in process and not yet committed must remain isolated
from any other transaction.
4. Durability: Committed data are saved by the DBMS so that, in the event of a
failure and system recovery, these data are available in their correct state.
Transaction atomicity requires that all transactions are processed on an all-or-
nothing basis and that any collection of transactions is serializable. When a
transaction is executed, either all its changes to the database are completed or none
of the changes are performed. In other words, the entire unit of work must be
completed. If a transaction is terminated before it is completed, the transaction
manager must undo the executed actions to restore the database to its state before the
transaction commenced (the consistency concept). Once a transaction is successfully
completed, it does not need to be undone. For efficiency reasons, transactions
should be no larger than necessary to ensure the integrity of the database. For
example, in an accounting system, a debit and credit would be an appropriate
transaction, because this is the minimum amount of work needed to keep the books
in balance.
Serializability relates to the execution of a set of transactions. An interleaved
execution schedule (i.e., the elements of different transactions are intermixed) is
serializable if its outcome is equivalent to a non-interleaved (i.e., serial) schedule.
Interleaved operations are often used to increase the efficiency of computing
resources, so it is not unusual for the components of multiple transactions to be
interleaved. Interleaved transactions cause problems when they interfere with each
other and, as a result, compromise the correctness of the database.
The ACID concept is critical to concurrent update control and recovery after a
transaction failure.
Concurrent update control
When updating a database, most applications implicitly assume that their actions do
not interfere with any other ’s actions. If the DBMS is a single-user system, then a
lack of interference is guaranteed. Most DBMSs, however, are multiuser systems,
where many analysts and applications can be accessing a given database at the same
time. When two or more transactions are allowed to update a database concurrently,
the integrity of the database is threatened. For example, multiple agents selling
airline tickets should not be able to sell the same seat twice. Similarly, inconsistent
results can be obtained by a retrieval transaction when retrievals are being made
simultaneously with updates. This gives the appearance of a loss of database
integrity. We will first discuss the integrity problems caused by concurrent updates
and then show how to control them to ensure database quality.
Lost update
Uncontrolled concurrent updates can result in the lost-update or phantom-record
problem. To illustrate the lost-update problem, suppose two concurrent update
transactions simultaneously want to update the same record in an inventory file.
Both want to update the quantity-on-hand field (quantity). Assume quantity has a
current value of 40. One update transaction wants to add 80 units (a delivery) to
quantity, while the other transaction wants to subtract 20 units (a sale).
Suppose the transactions have concurrent access to the record; that is, each
transaction is able to read the record from the database before a previous
transaction has been committed. This sequence is depicted in the following figure.
Note that the first transaction, A, has not updated the database when the second
transaction, B, reads the same record. Thus, both A and B read in a value of 40 for
quantity. Both make their calculations, then A writes the value of 120 to disk,
followed by B promptly overwriting the 120 with 20. The result is that the delivery
of 80 units, transaction A, is lost during the update process.
Lost update when concurrent accessing is allowed
Part# Quantity
P10 40
Part# Quantity
P10 40
* Preferred strategy
recognizes that data quality is determined by the customer. It also implies that data
quality is relative to a task. Data could be high-quality for one task and low-quality
for another. The data provided by a flight-tracking system are very useful when you
are planning to meet someone at the airport, but not particularly helpful for
selecting a vacation spot. Defect-free data are accessible, accurate, timely, complete,
and consistent with other sources. Desirable features include relevance,
comprehensiveness, appropriate level of detail, easy-to-navigate source, high
readability, and absence of ambiguity.
Poor-quality data have several detrimental effects. Customer service decreases when
there is dissatisfaction with poor and inaccurate information or a lack of
appropriate information. For many customers, information is the heart of customer
service, and they lose confidence in firms that can’t or don’t provide relevant
information. Bad data interrupt information processing because they need to be
corrected before processing can continue. Poor-quality data can lead to a wrong
decision because inaccurate inferences and conclusions are made.
As we have stated previously, data quality varies with circumstances, and the model
in the following figure will help you to understand this linkage. By considering
variations in a customer ’s uncertainty about a firm’s products and a firm’s ability to
deliver consistently, we arrive at four fundamental strategies for customer-oriented
data quality.
Customer-oriented data quality strategies
Transaction processing: When customers know what they want and firms can
deliver consistently, customers simply want fast and accurate transactions and data
confirming details of the transaction. Most banking services fall into this category.
Customers know they can withdraw and deposit money, and banks can perform
reliably.
Expert system: In some circumstances, customers are uncertain of their needs. For
instance, Vanguard offers personal investors a choice from more than 150 mutual
funds. Most prospective investors are confused by such a range of choices, and
Vanguard, by asking a series of questions, helps prospective investors narrow their
choices and recommends a small subset of its funds. A firm’s recommendation will
vary little over time because the characteristics of a mutual fund (e.g., intermediate
tax-exempt bond) do not change.
Tracking: Some firms operate in environments where they don’t have a lot of
control over all the factors that affect performance. Handling more than 2,200 take-
offs and landings and over 250,000 passengers per day, Atlanta’s Hartsfield
77
Data quality control does not end with the application of integrity constraints.
Whenever an error or unusual situation is detected by the DBMS, some form of
response is required. Response rules need to be given to the DBMS along with the
integrity constraints. The responses can take many different forms, such as abort the
entire program, reject entire update transaction, display a message and continue
processing, or let the DBMS attempt to correct the error. The response may vary
depending on the type of integrity constraint violated. If the DBMS does not allow
the specification of response rules, then it must take a default action when an error
is detected. For example, if alphabetic data are entered in a numeric field, most
DBMSs will have a default response and message (e.g., nonnumeric data entered in
numeric field). In the case of application programs, an error code is passed to the
program from the DBMS. The program would then use this error code to execute
an error-handling procedure.
Ensuring confidentiality
Thus far, we have discussed how the first two goals of data integrity can be
accomplished: Data are available when needed (protecting existence); data are
accurate, complete, and current (maintaining quality). This section deals with the
final goal: ensuring confidentiality or data security. Two DBMS functions—access
control and encryption—are the primary means of ensuring that the data are
accessed only by those authorized to do so. We begin by discussing an overall
model of data security.
General model of data security
The following figure depicts the two functions for ensuring data confidentiality:
access control and encryption. Access control consists of two basic steps—
identification and authorization. Once past access control, the user is permitted to
access the database. Access control is applied only to the established avenues of
entry to a database. Clever people, however, may be able to circumvent the controls
and gain unauthorized access. To counteract this possibility, it is often desirable to
hide the meaning of stored data by encrypting them, so that it is impossible to
interpret their meaning. Encrypted data are stored in a transformed or coded format
that can be decrypted and read only by those with the appropriate key.
A general model of data security
Class Examples
Something a person knows: remembered information Name, account number, password
Something the person has: possessed object Badge, plastic card, key
Something the person is: personal characteristic Fingerprint, voiceprint, signature, hand
Many systems use remembered information to control access. The problem with
such information is that it does not positively identify the client. Passwords have
been the most widely used form of access control. If used correctly, they can be very
effective. Unfortunately, people leave them around where others can pick them up,
allowing unauthorized people to gain access to databases.
To deal with this problem, organizations are moving toward using personal
characteristics and combinations of authenticating mechanisms to protect sensitive
data. Collectively, these mechanisms can provide even greater security. For
example, access to a large firm’s very valuable marketing database requires a smart
card and a fingerprint, a combination of personal characteristic and a possessed
object. The database can be accessed through only a few terminals in specific
locations, an isolation strategy. Once the smart card test is passed, the DBMS
requests entry of other remembered information—password and account number—
before granting access.
Data access authorization is the process of permitting clients whose identity has
been authenticated to perform certain operations on certain data objects in a shared
database. The authorization process is driven by rules incorporated into the DBMS.
Authorization rules are in a table that includes subjects, objects, actions, and
constraints for a given database. An example of such a table is shown in the
following table . Each row of the table indicates that a particular subject is
authorized to take a certain action on a database object, perhaps depending on some
constraint. For example, the last entry of the table indicates that Brier is authorized
to delete supplier records with no restrictions.
Sample authorization table
Subject/Client Action Object Constraint
Accounting department Insert Supplier table None
Purchase department clerk Insert Supplier table If quantity < 200
Purchase department supervisor Insert Delivery table If quantity >= 200
Production department Read Delivery table None
Todd Modify Item table Type and color only
Order-processing program Modify Sale table None
Brier Delete Supplier table None
We have already discussed subjects, but not objects, actions, and constraints. Objects
are database entities protected by the DBMS. Examples are databases, views, files,
tables, and data items. In the preceding table, the objects are all tables. A view is
another form of security. It restricts the client’s access to a database. Any data not
included in a view are unknown to the client. Although views promote security,
several persons may share a view or unauthorized persons may gain access. Thus, a
view is another object to be included in the authorization process. Typical actions on
objects are shown in the table: read, insert, modify, and delete. Constraints are
particular rules that apply to a subject-action-object relationship.
Implementing authorization rules
Most contemporary DBMSs do not implement the complete authorization table
shown in the table. Usually, they implement a simplified version. The most common
form is an authorization table for subjects with limited applications of the constraint
column. Let us take the granting of table privileges, which are needed in order to
authorize subjects to perform operations on both tables and views .
Authorization commands
SQL Command Result
SELECT Permission to retrieve data
UPDATE Permission to change data; can be column specific
DELETE Permission to delete records or tables
INSERT Permission to add records or tables
The GRANT and REVOKE SQL commands discussed in Chapter 10 are used to define
and delete authorization rules. Some examples:
GRANT SELECT ON qspl TO vikki;
GRANT SELECT, UPDATE (splname) ON qspl TO huang;
GRANT ALL PRIVILEGES ON qitem TO vikki;
GRANT SELECT ON qitem TO huang;
The GRANT commands have essentially created two authorization tables, one for
Huang and the other for Vikki. These tables illustrate how most current systems
create authorization tables for subjects using a limited set of objects (e.g., tables)
and constraints.
A sample authorization table
Client Object (table) Action Constraint
vikki qspl SELECT None
vikki qitem UPDATE None
vikki qitem INSERT None
vikki qitem DELETE None
vikki qitem SELECT None
huang qspl SELECT None
huang qspl UPDATE splname only
huang qitem SELECT None
Because authorization tables contain highly sensitive data, they must be protected by
stringent security rules and encryption. Normally, only selected persons in data
administration have authority to access and modify them.
Encryption
Encryption techniques complement access control. As the preceding figure
illustrates, access control applies only to established avenues of access to a database.
There is always the possibility that people will circumvent these controls and gain
unauthorized access to a database. To counteract this possibility, encryption can be
used to obscure or hide the meaning of data. Encrypted data cannot be read by an
intruder unless that person knows the method of encryption and has the key.
Encryption is any transformation applied to data that makes it difficult to extract
meaning. Encryption transforms data into cipher text or disguised information, and
decryption reconstructs the original data from cipher text.
Public-key encryption is based on a pair of private and public keys. A person’s
public key can be freely distributed because it is quite separate from his or her
private key. To send and receive messages, communicators first need to create
private and public keys and then exchange their public keys. The sender encodes a
message with the intended receiver ’s public key, and upon receiving the message,
the receiver applies her private key. The receiver ’s private key, the only one that can
decode the message, must be kept secret to provide secure message exchanging.
Public-key encryption
In a DBMS environment, encryption techniques can be applied to transmitted dat
Skill builder
A university has decided that it will e-mail students their results at the end of each
semester. What procedures would you establish to ensure that only the designated
student opened and viewed the e-mail?
Monitoring activity
Sometimes no single activity will be detected as an example of misuse of a system.
However, examination of a pattern of behavior may reveal undesirable behavior
(e.g., persistent attempts to log into a system with a variety of userids and
passwords). Many systems now monitor all activity using audit trail analysis. A
time- and date-stamped audit trail of all system actions (e.g., database accesses) is
maintained. This audit log is dynamically analyzed to detect unusual behavior
patterns and alert security personnel to possible misuse.
A form of misuse can occur when an authorized user violates privacy rules by using
a series of authorized queries to gain access to private data. For example, some
systems aim to protect individual privacy by restricting authorized queries to
aggregate functions (e.g., AVG and COUNT). Since it is impossible to do non-
aggregate queries, this approach should prevent access to individual-level data,
which it does at the single-query level. However, multiple queries can be constructed
to circumvent this restriction.
Assume we know that a professor in the IS department is aged 40 to 50, is single,
and attended the University of Minnesota. Consider the results of the following set
of queries. 78
The database developer shoulders the bulk of the responsibility for developing data
models and implementing the database. This can be seen in the table, where most of
the cells in the database developer column are labeled “Does.” The database
developer does project planning, requirements definition, database design, database
testing, and database implementation, and in addition, is responsible for database
evolution.
The client’s role is to establish the goals of a specific database project, provide the
database developers with access to all information needed for project development,
and review and regularly scrutinize the developer ’s work.
The data administrator ’s prime responsibilities are implementing and controlling,
but the person also may be required to perform activities and consult. In some
situations, the database developer is not part of the data administration staff and may
be located in a client department, or may be an analyst from an IS project team. In
these cases, the data administrator advises the developer on organizational standards
and policies as well as provides specific technical guidelines for successful
construction of a database. When the database developer is part of the data
administration staff, developer and data administration activities may be carried out
by the same person, or by the person(s) occupying the data administration role. In
all cases, the data administrator should understand the larger business context in
which the database will be used and should be able to relate business needs to
specific technical capabilities and requirements.
Database development life cycle (DDLC)
Previously, we discussed the various roles involved in database development and
how they may be assigned to different persons. In this section, we will assume that
administration and development are carried out by the data administration staff,
since this is the typical situation encountered in many organizations. The activities
of developer and administrator, shown in the first two columns of the table, are
assumed to be performed by data administration staff. Now, let us consider data
administration project-level support activities in detail . These activities are
discussed in terms of the DDLC phase they support.
Database development life cycle
Database project planning
Database project planning includes establishing project goals, determining project
feasibility (financial, technical, and operational), creating an implementation plan
and schedule, assigning project responsibilities (including data stewards), and
establishing standards. All project stakeholders, including clients, senior
management, and developers, are involved in planning. They are included for their
knowledge as well as to gain their commitment to the project.
Requirements definition
During requirements definition, clients and developers establish requirements and
develop a mutual understanding of what the new system will deliver. Data are
defined and the resulting definitions stored in the data dictionary. Requirements
definition generates documentation that should serve as an unambiguous reference
for database development. Although in theory the clients are expected to sign-off on
the specifications and accept the developed database as is, their needs may actually
change in practice. Clients may gain a greater understanding of their requirements
and business conditions may change. Consequently, the original specifications may
require revision. In the preceding figure, the arrows connecting phase 4 (testing)
and phase 3 (design) to phase 2 (requirements definition) indicate that modeling and
testing may identify revisions to the database specification, and these amendments
are then incorporated into the design.
Database design
Conceptual and internal models of the database are developed during database
design. Conceptual design, or data modeling, is discussed extensively in Section 2
of this book. Database design should also include specification of procedures for
testing the database. Any additional controls for ensuring data integrity are also
specified. The external model should be checked and validated by the user.
Database testing
Database testing requires previously developed specifications and models to be
tested using the intended DBMS. Clients are often asked to provide operational data
to support testing the database with realistic transactions. Testing should address a
number of key questions.
• Does the DBMS support all the operational and security requirements?
• Is the system able to handle the expected number of transactions per second?
• How long does it take to process a realistic mix of queries?
Testing assists in making early decisions regarding the suitability of the selected
DBMS. Another critical aspect of database testing is verifying data integrity
controls. Testing may include checking backup and recovery procedures, access
control, and data validation rules.
Database implementation
Testing is complete when the clients and developers are extremely confident the
system meets specified needs. Data integrity controls are implemented, operational
data are loaded (including historical data, if necessary), and database documentation
is finalized. Clients are then trained to operate the system.
Database use
Clients may need considerable support as they learn and adapt to the system.
Monitoring database performance is critical to keeping them satisfied; enables the
data administrator to anticipate problems even before the clients begin to notice and
complain about them, and tune the system to meet organizational needs; and also
helps to enforce data standards and policies during the initial stages of database
implementation.
Database evolution
Since organizations cannot afford to stand still in today’s dynamic business
environment, business needs are bound to change over time, perhaps even after a
few months. Data administration should be prepared to meet the challenge of
change. Minor changes, such as changes in display formats, or performance
improvements, may be continually requested. These have to be attended to on an
ongoing basis. Other evolutionary changes may emerge from constant monitoring
of database use by the data administration staff. Implementing these evolutionary
changes involves repeating phases 3 to 6. Significant business changes may merit a
radical redesign of the database. Major redesign may require repeating all phases of
the DDLC.
Data administration interfaces
Data administration is increasingly a key corporate function, and it requires the
existence of established channels of communication with various organizational
groups. The key data administration interfaces are with clients, management,
development staff, and computer operations. The central position of the data
administration staff, shown in the following figure, reflects the liaison role that it
plays in managing databases. Each of the groups has a different focus and different
terminology and jargon. Data administration should be able to communicate
effectively with all participants. For instance, operations staff will tend to focus on
technical, day-to-day issues to which management is likely to pay less attention.
These different focuses can, and frequently do, lead to conflicting views and
expectations among the different groups. Good interpersonal skills are a must for
data administration staff in order to deal with a variety of conflict-laden situations.
Data administration, therefore, needs a balance of people, technical, and business
skills for effective execution of its tasks.
Major data administration interfaces
Data administration probably will communicate most frequently with computer
operations and development staff, somewhat less frequently with clients, and least
frequently with management. These differences, however, have little to do with the
relative importance of communicating with each group. The interactions between
the data administration staff and each of the four groups are discussed next.
Management
Management sets the overall business agenda for data administration, which must
ensure that its actions directly contribute to the achievement of organizational goals.
In particular, management establishes overall policy guidelines, approves data
administration budgets, evaluates proposals, and champions major changes. For
instance, if the data administration staff is interested in introducing new technology,
management may request a formal report on the anticipated benefits of the proposed
expenditure.
Interactions between the data administration staff and management may focus on
establishing and evolving the information architecture for the organization. In some
instances, these changes may involve the introduction of a new technology that
could fundamentally transform roles or the organization.
Clients
On an ongoing basis, most clients will be less concerned with architectural issues
and will focus on their personal needs. Data administration must determine what
data should be collected and stored, how they should be validated to ensure integrity,
and in what form and frequency they should be made available.
Typically, the data administration staff is responsible for managing the database,
while the data supplier is responsible for ensuring accuracy. This may cause
conflict, however, if there are multiple clients from separate departments. If conflict
does arise, data administration has to arbitrate.
Development staff
Having received strategic directions from management, and after determining
clients’ needs, data administration works with application and database developers,
in order to fulfill the organization’s goals for the new system. On an ongoing basis,
this may consist of developing specifications for implementation. Data
administration works on an advisory basis with systems development, providing
inputs on the database aspects. Data administration is typically responsible for
establishing standards for program and database interfaces and making developers
aware of these standards. Developers may need to be told which commands may be
used in their programs and which databases they can access.
In many organizations, database development is part of data administration, and it
has a very direct role in database design and implementation. In such instances,
communication between data administration and database development is within the
group. In other organizations, database development is not part of data
administration, and communication is between groups.
Computer operations
The focus of computer operations is on the physical hardware, procedures,
schedules and shifts, staff assignments, physical security of data, and execution of
programs. Data administration responsibilities include establishing and monitoring
procedures for operating the database. The data administration staff needs to
establish and communicate database backup, recovery, and archiving procedures to
computer operations. Also, the scheduling of new database and application
installations needs to be coordinated with computer operations personnel.
Computer operations provide data administration with operational statistics and
exception reports. These data are used by data administration to ensure that
corporate database objectives are being fulfilled.
Communication
The diverse parties with which data administration communicates often see things
differently. This can lead to misunderstandings and results in systems that fail to
meet requirements. Part of the problem arises from a difference in perspective and
approaches to viewing database technology.
Management is interested in understanding how implementing database technology
will contribute to strategic goals. In contrast, clients are interested in how the
proposed database and accompanying applications will affect their daily work.
Developers are concerned with translating management and client needs into
conceptual models and converting these into tables and applications. Operations
staff are concerned primarily with efficient daily management of DBMS, computer
hardware, and software.
Data models can serve as a common language for bridging the varying goals of
clients, developers, management, and operational staff. A data model can reduce the
ambiguity inherent in verbal communications and thereby ensure that overall needs
are more closely met and all parties are satisfied with the results. A data model
provides a common meeting point and language for understanding the needs of
each group.
Data administration does not work in isolation. It must communicate successfully
with all its constituents in order to be successful. The capacity to understand and
correctly translate the needs of each stakeholder group is the key to competent data
administration.
Data administration tools
Several computer-based tools have emerged to support data administration. There
are five major classes of tools: data dictionary, DBMS, performance monitoring,
computer-aided software engineering (CASE), and groupware tools. Each of these
tools is now examined and its role in supporting data administration considered. We
focus on how these tools support the DDLC. Note, however, that groupware is not
shown in the following table because it is useful in all phases of the life cycle.
Data administration tool use during the DDLC
Performance
Database development phase Data DBMS
monitoring
Case tools
The organizational designs described are encountered quite frequently but are not
the only ways to organize data administration. Each situation demands a different
way to organize, and the particular needs of the organization need to be fully
considered before adopting any particular structure. In most organizations, the
nature, importance, and location of data administration change over time and are
influenced by new technologies, organizational growth, and evolving patterns of
database usage. It is important to remain alert to such changes and reorganize data
administration when necessary to ensure that the organization’s data requirements
are constantly satisfied.
Summary
Data administration is the task of managing that part of organizational memory that
involves electronically available data. Managing electronic data stores is important
because key organizational decisions are based on information drawn from them,
and it is necessary to ensure that reliable data are available when needed. Data
administration is carried out at both the system level, which involves overall
policies and alignment with organizational goals, and the project level, where the
specific details of each database are handled. Key modules in data administration are
the DBMS, the DD/DS, user interfaces, and external databases.
Data administration is a function performed by those with assigned organizational
roles. Data administration may be carried out by a variety of persons either within
the IS department or in user departments. Also, this function may occur at the
personal, workgroup, or organizational level.
Data administration involves communication with management, clients, developers,
and computer operations staff. It needs the cooperation of all four groups to
perform its functions effectively. Since each group may hold very different
perspectives, which could lead to conflicts and misunderstandings, it is important
for data administration staff to possess superior communication skills. Successful
data administration requires a combination of interpersonal, technical, and business
skills.
Data administration is complex, and its success partly depends on a range of
computer-based tools. Available tools include DD/DS, DBMS, performance
monitoring tools, CASE tools, and groupware.
A variety of options are available for organizing data administration, and a choice
has to be made based on the prevailing organizational context.
Key terms and concepts
2 richardtwatson.com/dm6e/
3 wb.mysql.com/
4 Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs, NJ: Prentice-Hall.
5 Watson, R. T., Boudreau, M.-C., Li, S., & Levis, J. (2010). Telematics at UPS: En route to Energy Informatics.
MISQ Executive, 9(1), 1-11.
6 Daft, R. L., & Lengel, R. H. (1986). Organizational information requirements, media richness, and structural
design. Management Science, 32(5), 554-571.
8 Now would be a good time to install the MySQL Community server < www.mysql.com/downloads/mysql/
>on your computer, unless your instructor has set up a class server.
9 wb.mysql.com/
10 dev.mysql.com/doc/workbench/en/wb-getting-started-tutorial-creating-a-model.html
12 dev.mysql.com/doc/workbench/en/wb-getting-started-tutorial-adding-data.html
14 http://regexlib.com
15 “Two women share a name, birthday, and S.S. number!” Athens [Georgia] Daily News, January 29 1990, 7A.
Also, see wnyt.com/article/stories/S1589778.shtml?cat=10114
16 MySQL does not support INTERSECT. Use another AND in the WHERE statement.
17 The international system of units of measurement. SI is the from French Système International.
18 http://www.garyryanblair.com
19 Adapted from: Goralwalla, I. A., M. T. Özsu, and D. Szafron. 1998. An object-oriented framework for
temporal data models. In Temporal databases: research and practice, edited by O. Etzion, S. Jajoda, and S.
Sripada. Berlin: Springer-Verlag
20 Cowlishaw, M. F. (1987). Lexx—a programmable structured editor. IBM Journal of Research and
Development, 31(1), 73-80.
21 www.ofx.net
22 You should use an XML editor for creating XML files. I have not found a good open source product. Among
the commercial products, my preference is for Oxygen, which you can get for a 30 days trial.
23 In this chapter, numbers in {} refer to line numbers in the corresponding XML code.
24 A namespace is a collection of valid names of attributes, types, and elements for a schema. Think of a
namespace as a dictionary.
26 www.softwareag.com/Corporate/products/wm/tamino/default.asp
27 The lib_mysqludf_xql library is housed at www.mysqludf.org. It might not be installed for the version of
MySQL you are using.
28 http://richardtwatson.com/xml/customerpayments.xml
29 http://r4stats.com/articles/popularity/
30 http://cran.r-project.org/web/packages/
31 http://co2now.org/Current-CO2/CO2-Now/noaa-mauna-loa-co2-data.html
32 http://dev.mysql.com/downloads/connector/j/
33 See http://www.r-project.org/doc/bib/R-books.html
34 http://chartsgraphs.wordpress.com/category/r-climate-data-analysis-tool/
35 Note these prices have been transformed from the original values, but are still representative of the changes
over time.
36 http://www.statmethods.net/management/functions.html
37 Wilkinson, L. (2005). The grammar of graphics (2nd ed.). New York: Springer.
41 Note these prices have been transformed from the original values, but are still representative of the changes
over time.
42 http://research.stowers-institute.org/efg/R/Color/Chart/ColorChart.pdf
43 Ambiguities are often the inspiration for puns. “You can tune a guitar, but you can't tuna fish. Unless of course,
you play bass," by Douglas Adams
44 http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
45 http://jeffreybreen.wordpress.com/tag/sentiment-analysis/
46 http://www.berkshirehathaway.com/letters/letters.html
49 The mathematics of LDA and CTM are beyond the scope of this text. For details, see Grün, B., & Hornik, K.
(2011). Topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1-30.
50 Humans have a capacity to handle about 7±2 concepts at a time. Miller, G. A. (1956). The magical number
seven, plus or minus two: some limits on our capacity for processing information. The Psychological Review,
63(2), 81-97.
51 http://opennlp.apache.org
52 http://knime.org
53 http://www.investors.ups.com/phoenix.zhtml?c=62900&p=irol-reportsannual
54 http://tech.knime.org/named-entity-recognizer-and-tag-cloud-example
55 http://www.manamplified.org/archives/2008/06/hadoop-quotes.html
57 http://hadoop.apache.org
58 http://www.revolutionanalytics.com
59 http://www.erh.noaa.gov/okx/climate/records/monthannualtemp.html
60 http://dl.dropbox.com/u/6960256/data/electricityprices2010-12.csv
61
http://dl.dropbox.com/u/6960256/data/GDP.csv
62 http://www.gartner.com/it/page.jsp?id=2149415
63
www.pcworld.com/businesscenter/article/217883/seagate_solidstate_disks_are_doomed_at_least_for_now.html
64 The further electrons are moved, the more energy is lost. As much as a third of the initial energy generated can
be lost during transmission across the grid before it gets to the final customer.
65 This section is based on Iyer, B., & Henderson, J. C. (2010). Preparing for the future: understanding the seven
capabilities of cloud computing MIS Executive, 9(2), 117-131.
66 Mell, P., & Grance, T. (2009). The NIST definition of cloud computing: National Institute of Standards and
Technology.
69 http://en.wikipedia.org/wiki/Comma-separated_values
71 http://richardtwatson.com/dm6e/Reader/java.html
73 http://richardtwatson.com/dm6e/Reader/java.html
74 To the author’s knowledge, Gordon C. Everest was the first to define data integrity in terms of these three
goals. See Everest, G. (1986). Database management: objectives, systems functions, and administration. New
York, NY: McGraw-Hill.
75 Source: Lewis Jr, W., Watson, R. T., & Pickren, A. (2003). An empirical assessment of IT disaster probabilities.
Communications of the ACM, 46(9), 201-206.
76 Redman, T. C. (2001). Data quality : the field guide. Boston: Digital Press.
78 Adapted from Helman, P. (1994). The science of database management. Burr Ridge, IL: Richard D. Irwin, Inc.
p. 434
79 http://www.sqlite.org
80 For more on information architecture, see Smith, H. A., Watson, R. T., & Sullivan, P. (2012). Delivering
Effective Enterprise Architecture at Chubb Insurance. MISQ Executive. 11(2)
81 http://www.tpc.org
Table of Contents
Title Page
Preface
Section 1: The Managerial Perspective
1. Managing Data
2. Information
Section 2: Data Modeling and SQL
3. The Single Entity
4. The One-to-Many Relationship
5. The Many-to-Many Relationship
6. One-to-One and Recursive Relationships
7. Data Modeling
Reference 1: Basic Structures
8. Normalization and Other Data Modeling Methods
9. The Relational Model and Relational Algebra
10. SQL
Reference 2: SQL Playbook
Section 3: Advanced Data Management
11. Spatial and Temporal Data Management
12. XML: Managing Data Exchange
13. Organizational Intelligence
14. Introduction to R
15. Data visualization
16. Text mining & natural language processing
17. HDFS & MapReduce
Section 4: Managing Organizational Memory
18. Data Structure and Storage
19. Data Processing Architectures
20. SQL and Java
21. Data Integrity
22. Data Administration
Footnotes