Identify Physical Database Requirements LO2
Identify Physical Database Requirements LO2
UNDER
Ethiopian TVET-System
INFORMATION TECHNOLOGY
DATABASE ADMINISTRATION
LEVEL III
LEARNING GUIDE # 6
Unit of Competence : Identify Physical Database Requirements
Functionality
These questions deal with what the system is supposed to accomplish and, to a lesser extent, how. It is
usually best to avoid deciding how the system should do anything until you thoroughly understand what it
should do so you don’t become locked into one idea too early, but it’s still useful to record any impressions
the customers have of how the system should work.
· What should the system do?
· Why are you building this system? What do you hope it will accomplish?
· What should it look like? Sketch out the user interface.
· What response times do you need for different parts of the system? (Typically, interactive
response times
should be under five seconds, whereas reports and other offline activities may take longer.
· What reports are needed?
· Do the end users need to be able to define new reports?
· Who are the players? (ties to previous section)
· Do power users and administrators need to be able to define new reports?
Data Needs
These questions help clarify the project’s data needs. Knowing what data is needed will help you start
defining the
Database’s tables.
· What data is needed for the user interface?
· Where should that data come from? How are those pieces of data related?
· How are these tasks handled today? Where does the data come from?
Data Integrity
These questions deal with data integrity. They help you define some of the integrity constraints that you will
build into the database.
· What values are allowed in which fields?
· Which fields are required? (For example, does a customer record need a phone number? A fax numbers?
An email addresses? One of those but not all of them?)
· What are the valid domains (allowed values) for various fields? What phone number formats are allowed?
How long can customer names be? Addresses? Do addresses need extra lines for suite or apartment number?
Do addresses need to handle U.S. ZIP Codes such as 12345? ZIP+4 Codes such as 12345-6789? Canadian
postal codes such as T1A 6G9? Or other countries’ postal codes?
· Which fields should refer to foreign keys? (For example, an address’s State field might need to be in the
States table and a Customer field might need to be in the Customers table. I’ve seen customers with a big
list of standard comments and a Comments field can only take those values.)
· Should the system validate cities against postal codes? (For example, should it verify that the 10005 ZIP
Code is in New York City, New York? That’s cool but a bit tricky and can involve a lot of data.)
· Do you need a customer record before you can place orders?
· If a customer cancels an account, do you want to delete the corresponding records or just
flag them as inactive?
· What level of system reliability is needed?
· Does the system need 24/7 access?
· How volatile is the data? How often does it need to be backed up?
· How disastrous will it be if the system crashes?
· How quickly do you need to be back up and running?
· How painful will it be if you lose some data during a crash?
Security
These questions focus on the application’s security. The answers to these questions will help you decide
which database product will work best (different products provide different forms of security) and what
architecture to use.
· Does each user need a separate password? (Generally, a good idea.)
· Do different users need access to different pieces of data? (For example, sales clerks might need to access
customer credit card numbers but order fulfillment technicians probably don’t.)
· Does the data need to be encrypted within the database?
· Do you need to provide audit trails recording every action taken and by whom? (For example, you can see
which clerk increased the priority of a customer who was ordering the latest iPod and then ask that clerk
why that happened.)
· What different classes of users will there be?
There are often three classes of users. First, clerks do most of the regular work. They enter orders, print
invoices, and so forth. Second, supervisors can do anything that clerks can and they also perform managerial
tasks. They can view reports, logs, and audit trails; assign clerks to tasks; grant bonuses; and so forth. Third,
super users or key users can do everything. They can reset user passwords, go directly into database tables to
fix problems, change system parameters such as the states that users can pick from dropdowns, and so forth.
There should only be a couple of super users and they should usually log in as supervisors, not as super users,
to prevent accidental catastrophes.
· How many of each class of user will there be? Will only one person need access to the data
at a time? Will
there be hundreds or even thousands (as is the case with some Web applications)?
· Is there existing documentation describing the users’ tasks and responsibilities?
Environment
These questions deal with the project’s surrounding environment. They gather information
about other systems and processes that the project will replace or with which it will interact.
Does this system enhance or replace an existing system?
Is there documentation describing the existing system?
Does the existing system have paper forms that you can study?
What features in the existing system are required? Which are not?
What kinds of data does the existing system use? How is it stored? How are different pieces of
data related?
Is there documentation for the existing system’s data?
Are there other systems with which this one must interact?
Exactly how will it interact with them?
Will the new project send data to existing systems? How?
Will the new project receive data from existing systems? How?
Is there documentation for those systems?
How does your business work? (Try to understand how this project fits into the bigger
picture.)
Brainstorm
At this point, you should have a decent understanding of the customers’ business and needs. To make sure
the customer hasn’t left anything out, you can hold brainstorming sessions. Bring in as many Stakeholders as
you can and let them run wild. Don’t rule out anything yet. If a stakeholder says the database should record
the color of customers’ shoes when they make a purchase, write it down. Continue brainstorming until
everyone has had their say and it’s clear that no new ideas are appearing. Occasionally extra creative people
look like they’re going to go on forever. Let them go for a while but if it’s clear they really can’t stem the
flood of ideas, split up. Have everyone go off separately and write down anything else relevant that they can
think of. Then come back and dump all of the ideas in a big pile. Try not to let the Customer Champion
suppress the others’ creativity too early. Though the customer Champion has the final say, the goal right now
is to gather ideas, not to decide which ones are the best. The goal at this point isn’t to accept or eliminate
anything as much as it is to write everything down. You want to be sure that everything relevant is
considered before you start designing. Later, when you’ve started laying out tables and indexes and changes
are more difficult to make, you don’t want someone to step in and say, ‘‘Owl voltages!
Why didn’t someone think of owl voltages?’’ Hopefully you have owl voltages written down somewhere and
crossed out so you can say they were considered and everyone agreed they were not a priority.
Look to the Future
During the brainstorming process, think about future needs. Explicitly ask the customers what they might
like to have in future releases. You may be able to include some of those ideas in the current project, but
even if you can’t it’s nice to know where things are headed. That will help you design your database flexibly
so you can more easily incorporate changes in the future. For example, suppose your customer Paula Marble
runs a plumbing supply shop but thinks someday it might be nice to add a little café and call the whole thing
‘‘Paula’s Plumbing and Pastries.’’ Think about how this might affect the database and the rest of the project.
Plumbing supplies are generally non-perishable, but pastries must be baked fresh daily and the ingredients
that go into pastries are perishable. You may want to think about using separate inventory tables to hold
information about non-perishable plumbing items that clients can purchase (gaskets, thread tape, pipe
wrenches) and perishable cooking items that the clients won’t buy directly (flour, eggs, raisins). You might
not even track quantity in stock for finished pastries (the clients either see them in the case or not) but you
probably want to be able to record prices for them nonetheless. In that case, you will have entries in an
inventory table that will contain prices but that will never hold quantities. You don’t necessarily need to start
planning the future database just yet, but you can keep these future changes in mind as you study the rest of
the problem.
The second kind of use case lists more specific steps that actors take to perform a task, although the steps are
still listed at a fairly high level. Neither of these kinds of use case diagram provides enough detail to use as a
script for testing, although they do list the cases that you must test. Because they are shown at such a high
level, they are great for executive presentations. For more information on use case diagrams, look for books
about UML (Universal Modeling Language), which includes use case diagrams, or search the Web for ‘‘use
case diagram.’’
Typical use cases might include:
The user logging in.
The user logging out.
Switching users (if the program allows that).
Creating a new customer record.
Editing a customer record.
Marking a customer record as inactive.
Creating a new order for an existing customer.
Creating a new order for a new customer.
Creating an invoice for an order.
Sending out late payment notices.
Creating a replacement invoice in case the customer lost one.
Receiving a payment.
Defining a new inventory item (when the CEO decides that you should start selling Rogaine for
Dogs).
Adding new items to inventory (for example, when you restock your fuzzy dice supply). Etc.
The list can go on practically forever. A large project can include hundreds of use cases and it may take quite
a while to write them all down and then later verify that the finished project handles them all. In addition to
being measurable, use cases should be as realistic as possible. There’s no point in verifying that the program
can handle a situation that will never occur in real life.
Summary
Building any custom product is largely a translation process whether you’re building a small database, a
gigantic Internet sales system similar to the one used by Amazon, or a really tricked-out snowboard. You
need to translate the half-formed ideas floating around in the minds of your customers into reality. The first
step in the translation process is understanding the customers’ needs. This information sheet explained ways
you can gather information about the customers’ problems, wishes, and desires so you can take the next step
in the process.
In this information sheet you learned how to:
Try to decide which customers will play which roles.
Pick the customers’ brains for information.
Look for documentation about user roles and responsibilities, existing procedures, and
existing data.
Watch customers at work and study their current operations directly.
Brainstorm and categorize the results into priority 1, 2, and 3 items.
Write use cases.
After you’ve achieved a good understanding of the customers’ needs and expectations, you can start turning
them into data models.
K2
Match the following the customer roles with their corresponding descriptions.
Both data volume and access frequencies are shown in Figure 4-1. For example, there are 3,000 PARTs in
this database. The supertype PART has two subtypes, MANUFACTURED (40 percent of all PARTs are
manufactured) and PURCHASED (70 percent are purchased; because some PARTs are of both subtypes, the
percentages sum to more than 100 percent). The analysts at Pine Valley estimate that there are typically 150
SUPPLIERs, and Pine Valley receives, on average, 40 SUPPLIES instances from each SUPPLIER, yielding
a total of 6,000 SUPPLIES. The dashed arrows represent access frequencies. So, for example, across all
applications that use this database, there are on average 20,000 accesses per hour of PART data, and these
yields, based on subtype percentages, 14,000 accesses per hour to PURCHASED PART data. There are an
additional 6,000 direct accesses to PURCHASED PART data. Of this total of 20,000 accesses to
PURCHASED PART, 8,000 accesses then also require SUPPLIES data and of these 8,000 accesses to
SUPPLIES, there are 7,000 subsequent accesses to SUPPLIER data. For online and Web-based applications,
usage maps should show the accesses per second. Several usage maps may be needed to show vastly
different usage patterns for different times of day. Performance will also be affected by network
specifications.
The volume and frequency statistics are generated during the systems analysis phase of the systems
development process when systems analysts are studying current and proposed data processing and business
activities. The data volume statistics represent the size of the business and should be calculated assuming
business growth over at least a several years period. The access frequencies are estimated from the timing of
events, transaction volumes, the number of concurrent users, and reporting and querying activities. Because
many databases support ad hoc accesses, and such accesses may change significantly over time, and known
database access can peak and dip over a day, week, or month, the access frequencies tend to be less certain
and even than the volume statistics. Fortunately, precise numbers are not necessary. What is crucial is the
relative size of the numbers, which will suggest where the greatest attention needs to be given during
physical database design in order to achieve the best possible performance. For example,
in Figure 4-1, notice that
· There are 3,000 PART instances, so if PART has many attributes and some, like description, would be quite
long, then the efficient storage of PART might be important.
· For each of the 4,000 times per hour that SUPPLIES is accessed via SUPPLIER, PURCHASED PART is
also accessed; thus, the diagram would suggest possibly combining these two co-accessed entities into a
database table (or file). This act of combining normalized tables is an example of denormalization.
· There is only a 10 percent overlap between MANUFACTURED and PURCHASED parts, so it might make
sense to have two separate tables for these entities and redundantly store data for those parts that are both
manufactured and purchased; such planned redundancy is okay if purposeful. Further, there are a total of
20,000 accesses an hour of PURCHASED PART data (14,000 from access to PART and 6,000 independent
access of PURCHASED PART) and only 8,000 accesses of MANUFACTURED PART per
hour. Thus, it might make sense to organize tables for MANUFACTURED and PURCHASED PART data
differently due to the significantly different access volumes.
It can be helpful for subsequent physical database design steps if you can also explain the nature of the
access for the access paths shown by the dashed lines. For example, it can be helpful to know that of the
20,000 accesses to PART data, 15,000 ask for a part or a set of parts based on the primary key, Part No (e.g.,
access a part with a particular number); the other 5,000 accesses qualify part data for access by the value of
Stonehands. (These specifics are not shown in Figure 4-1.) This more precise description can help in
selecting indexes, one of the major topics we discuss later in this chapter. It might also be helpful to know
whether an access results in data creation, retrieval, update, or deletion. Such a refined description of access
frequencies can be handled by additional notation on a diagram such as in Figure 4-1, or by text and tables
kept in other documentation.
Designing Fields
A field is the smallest unit of application data recognized by system software, such as a programming
language or database management system. A field corresponds to a simple attribute in the logical data model,
and so in the case of a composite attribute, a field represents a single component.
The basic decisions you must make in specifying each field concern the type of data (or storage type) used to
represent values of this field, data integrity controls built into the database, and the mechanisms that the
DBMS uses to handle missing values for the field. Other field specifications, such as display format, also
must be made as part of the total specification of the information system, but we will not be concerned here
with those specifications that are often handled by applications rather than the DBMS.
CHOOSING DATA TYPES
A data type is a detailed coding scheme recognized by system software, such as a DBMS, for representing
organizational data. The bit pattern of the coding scheme is usually transparent to you, but the space to store
data and the speed required to access data are of consequence in physical database design. The specific
DBMS you will use will dictate which choices are available to you? For example, Table 4-1 lists some of the
data types available in the Oracle 11g DBMS, a typical DBMS that uses the SQL data definition and
manipulation language. Additional data types might be available for currency, voice, image, and user defined
for some DBMSs.
Table 4-1 Commonly Used Data Types in Oracle 11g
Data Type
VARCHAR2
CHAR
CLOB
NUMBER
INTEGER
DATE
BLOB
Description
variable-length character data with a maximum length of 4,000 characters; you must enter a maximum field
length
(e.g., varchar2(30) specifies a field with a maximum length of 30 characters). a value less than 30 characters
will consume only the required space.
fixed-length character data with a maximum length of 2,000 characters; default length is 1 character (e.g., char
(5) specifies a field with a fixed length of 5 characters, capable of holding a value from 0 to 5 characters
long).
character large object, capable of storing up to 4 gigabytes of one variable-length character data field (e.g., to hold
a medical instruction or a customer comment).positive or negative number in the range 10–130 to 10126; can
specify the precision (total number of digits to the left and right of the decimal point) and the scale (the
number of digits to the right of the decimal point) (e.g., number (5) specifies an integer field with a maximum
of 5 digits, and number (5,2) specifies a field with no more than 5 digits and exactly 2 digits to the right of
the decimal point).positive or negative integer with up to 38 digits (same as small int). any date from January
1, 4712 B.Sc., to December 31, 9999 a.m.; date stores the century, year, month, day, hour, minute, and
second.
binary large object, capable of storing up to 4 gigabytes of binary data (e.g., a photograph or sound clip).
Selecting a data type involves four objectives that will have different relative levels of importance for different applications:
1. Represent all possible values.
2. Improve data integrity.
3. Support all data manipulations.
4. Minimize storage space.
An optimal data type for a field can, in minimal space, represent every possible value (while eliminating
illegal values) for the associated attribute and can support the required data manipulation (e.g., numeric data
types for arithmetic operations and character data types for string manipulation). Any attribute domain
constraints from the conceptual data model is helpful in selecting a good data type for that attribute.
Achieving these four objectives can be subtle. For example, consider a DBMS for which a data type has a
maximum width of 2 bytes. Suppose this data type is sufficient to represent a Quantity Sold field. When
Quantity Sold fields are summed, the sum may require a number larger than 2 bytes. If the DBMS uses the
field’s data type for results of any mathematics on that field, the 2-byte length will not work. Some data types
have special manipulation capabilities; for example, only the DATE
data type allows true date arithmetic.
Coding Techniques
Some attributes have a sparse set of values or are so large that, given data volumes, considerable storage
space will be consumed. A field with a limited number of possible values can be translated into a code that
requires less space. Consider the example of the Product Finish field illustrated in Figure 4-2. Products at
Pine Valley Furniture come in only a limited number of woods: Birch, Maple, and Oak. By creating a code or
translation table, each Product Finish field value can be replaced by a code, a cross-reference to the lookup
table, similar to a foreign key. This will decrease the amount of space for the Product Finish field and hence
for the PRODUCT file. There will be additional space for the PRODUCT FINISH lookup table, and when
the Product Finish field value is needed, an extra access (called a join) to this lookup table will be required. If
the Product Finish field is infrequently used or if the number of distinct
Product Finish values is very large, the relative advantages of coding may outweigh the costs. Note that the
code table would not appear in the conceptual or logical model. The code table is a physical construct to
achieve data processing performance improvements, not a set of data with business value.
CONTROLLING DATA INTEGRITY
For many DBMSs, data integrity controls (i.e., controls on the possible value a field can assume)
can be built into the physical structure of the fields and controls enforced by the DBMS on those
fields. The data type enforces one form of data integrity control because it may limit the type of
data (numeric or character) and the length of a field value. The following are some other typical
integrity controls that a DBMS may support:
FIGURE 4-2 EXAMPLE OF A CODE LOOKUP TABLE (PINE VALLEY FURNITURE
COMPANY)
Default value. A default value is the value a field will assume unless a user enters an
explicit value for an instance of that field. Assigning a default value to a field can reduce
data entry time because entry of a value can be skipped. It can also help to reduce data
entry errors for the most common value.
Range control. A range control limits the set of permissible values a field may assume.
The range may be a numeric lower-to-upper bound or a set of specific values. Range
controls must be used with caution because the limits of the range may change over time.
A combination of range controls and coding led to the year 2000 problem that many
organizations faced, in which a field for year was represented by only the numbers 00 to
99. It is better to implement any range controls through a DBMS because range controls
in applications may be inconsistently enforced. It is also more difficult to find and change
them in applications than in a DBMS.
Null value control. A null value is defined as an empty value. Each primary key must
have an integrity control that prohibits a null value. Any other required field may also
have a null value control placed on its if that is the policy of the organization. For
example, a university may prohibit adding a course to its database unless that course has
a title as well as a value of the primary key, Coursed. Many fields legitimately may have
a null value, so this control should be used only when truly required by business rules.
Referential integrity. Referential integrity on a field is a form of range control in which
the value of that field must exist as the value in some field in another row of the same or
(most commonly) a different table. That is, the range of legitimate values comes from the
dynamic contents of a field in a database table, not from some pre-specified set of values.
Note that referential integrity guarantees that only some existing cross-referencing value
is used, not that it is the correct one. A coded field will have referential integrity with the
primary key of the associated lookup table.
Handling Missing Data
When a field may be null, simply entering no value may be sufficient. For example, suppose a
customer zip code field is null and a report summarizes total sales by month and zip code. How
should sales to customers with unknown zip codes be handled? Two options for handling or
preventing missing data have already been mentioned: using a default value and not permitting
missing (null) values. Missing data are inevitable. The following are some other possible
methods for handling missing data:
Substitute an estimate of the missing value. For example, for a missing sales value when
computing monthly product sales, use a formula involving the mean of the existing
monthly sales values for that product indexed by total sales for that month across all
products. Such estimates must be marked so that users know that these are not actual
values.
Track missing data so that special reports and other system elements cause people to
resolve unknown values quickly. This can be done by setting up a trigger in the database
definition. A trigger is a routine that will automatically execute when some event occurs
or time period passes. One trigger could log the missing entry to a file when a null or
other missing value is stored, and another trigger could run periodically to create a report
of the contents of this log file.
Perform sensitivity testing so that missing data are ignored unless knowing a value might
significantly change results (e.g., if total monthly sales for a particular salesperson are
almost over a threshold that would make a difference in that person’s compensation).
This is the most complex of the methods mentioned and hence requires the most
sophisticated programming. Such routines for handling missing data may be written in
application programs. All relevant modern DBMSs now have more sophisticated
programming capabilities, such as case expressions, user-defined functions, and triggers,
so that such logic can be available in the database for all users without application-
specific programming.
Summary
During physical database design, you, the designer, translate the logical description of data into
the technical specifications for storing and retrieving data. The goal is to create a design for
storing data that will provide adequate performance and ensure database integrity, security, and
recoverability. In physical database design, you consider normalized relations and data volume
estimates, data definitions, data processing requirements and their frequencies, user expectations,
and database technology characteristics to establish the specifications that are used to implement
the database using a database management system.
A field is the smallest unit of application data, corresponding to an attribute in the logical data
model. You must determine the data type, integrity controls, and how to handle missing values
for each field, among other factors. A data type is a detailed coding scheme for representing
organizational data. Data may be coded to reduce storage space. Field integrity control includes
specifying a default value, a range of permissible values, null value permission, and referential
integrity.