Database Module
Database Module
Chapter One
Introduction to Database system
1.1 Database - Definition and Usage
Database systems are designed to manage large data set in an organization. The data
management involves both definition and the manipulation of the data which ranges from simple
representation of the data to considerations of structures for the storage of information. The data
management also consider the provision of mechanisms for the manipulation of information.
Today, Databases are essential to every business. They are used to maintain internal records, to
present data to customers and clients on the World-Wide-Web, and to support many other
commercial processes. Databases are likewise found at the core of many modern organizations.
The power of databases comes from a body of knowledge and technology that has developed
over several decades and is embodied in specialized software called a database management
system, or DBMS. A DBMS is a powerful tool for creating and managing large amounts of data
efficiently and allowing it to persist over long periods of time, safely. These systems are among
the most complex types of software available.
Thus, for our question: What is a database? In essence a database is nothing more than a
collection of shared information that exists over a long period of time, often many years. In
common dialect, the term database refers to a collection of data that is managed by a DBMS.
Thus the DB course is about:
How to organize data
Supporting multiple users
Efficient and effective data retrieval
Secured and reliable storage of data
Maintaining consistent data
Making information useful for decision making
For example, one user, the grade reporting office, may keep a file on students and their grades.
Programs to print a transcript and to enter new grades into the file are implemented as part of the
application. A second user, the accounting office, may keep track of students fees and their
payments. Although both are interested in the data about students, each user maintains separate
files and programs to manipulate the files because each requires some data not available from the
other users files.
Summary
• File based systems were an early attempt to computerize the manual filing system.
• This approach is the decentralized computerized data handling method.
The introduction of shared files solves the problem of inconsistent data across different versions
of the same file held by different departments, but other problems may emerge, including:
• When each department had its own version of a file for processing, each department
could ensure that the structure of the file suited their specific application. If departments
have to share files, the file structure that suits one department might not suit another, for
example, data might need to be sorted in a different sequence for different applications
(for instance, customer details could be stored in alphabetical order, or numerical order,
or ascending or descending order of customer number).
• Some applications may require access to more data than others, for instance a credit
control application will need access to customer credit limit information, whereas an
delivery note printing application will only need access to customer name and address
details. The file will still need to contain the additional information to support the
application that requires it.
• If the structure of the data file needs to be changed in some way (for example, to reflect a
change in currency), this alteration will need to be reflected in all application programs
that use that data file. This problem is known as physical data dependence, and will be
examined in more detail later in the unit.
• While a data file is being processed by one application, the file will not be available for
other applications or for ad hoc queries. This is because, if more than one application is
allowed to alter data in a file at one time, serious problems can arise in ensuring that the
updates made by each application do not clash with one another. This issue of ensuring
consistent, concurrent updating of information is an extremely important one, and is dealt
with in detail for database systems in the unit on concurrency control. File-Based systems
avoid these problems by not allowing more than one application to access a file at one
time.
DBMS Engine
The engine is the central component of a DBMS. This component provides access to the
database and coordinates all of the functional elements of the DBMS. An important source of
data for the DBMS engine, and the database system as a whole, is known as metadata. Metadata
means data about data. Metadata is contained in a part of the DBMS called the data dictionary
(described below), and is a key source of information to guide the processes of the DBMS
engine. The DBMS engine receives logical requests for data (and metadata) from human users
and from applications, determines the secondary storage location (i.e. the disk address of the
requested data), and issues physical input/output requests to the computer operating system. The
data requested is fetched from physical storage into computer main memory; it is contained
therein special data structures provided by the DBMS. Whilst the data remains in memory, it is
managed by the DBMS engine. Additional data structures are created by the database system
itself, or by users of the system, in order to provide rapid access to data being processed by the
system. These data structures include indexes to speed up access to the data, buffer areas into
which particular types of data are retrieved, lists of free space etc. The management of these
additional data structures is also carried out by the DBMS engine.
3. Data Dictionary:
Due to the fact that a database is a self describing system, this tool, Data Dictionary, is
used to store and organize information about the data stored in the database.
Hardware:
Hardware are components that one can touch and feel. These components are comprised of various types
of personal computers, mainframe or any server computers to be used in multi-user system, network
infrastructure, and other peripherals required in the system.
Software:
Software are collection of commands and programs used to manipulate the hardware to perform a
function. These include components like the DBMS software, application programs, operating systems,
network software, language software and other relevant software.
Data:
Since the goal of any database system is to have better control of the data and making data useful, Data
is the most important component to the user of the database. There are two categories of data in any
database system: that is Operational and Metadata. Operational data is the data actually stored in the
system to be used by the user. Metadata is the data that is used to store information about the database
itself. The structure of the data in the database is called the schema, which is composed of the Entities,
Properties of entities, and relationship between entities.
Procedure:
Procedure is the rules and regulations on how to design and use a database. It includes procedures like
how to log on to the DBMS, how to use facilities, how to start and stop transaction, how to make
backup, how to treat hardware and software failure, how to change the structure of the database.
People:
This component is composed of the people in the organization that are responsible or play a role in
designing, implementing, managing, administering and using the resources in the database. This
component includes group of people with high level of knowledge about the database and the design
technology to other with no knowledge of the system except using the data in the database. In general
database users are include the following people:
Database Administrator Database Designer
Application programmer and System End Users
analysts
4. End Users
Workers, whose job requires accessing the database frequently for various purposes, there are different
group of users in this category.
i. Naïve Users:
Sizable proportion of users
Unaware of the DBMS
Only access the database based on their access level and demand
Use standard and pre-specified types of queries.
These users can be again classified as “Actors on the Scene” and “Workers Behind the Scene”.
Actors on the Scene:
Data Administrator
Database Administrator
Database Designer
End Users
1. Planning: that is identifying information gap in an organization and propose a database solution to
solve the problem.
2. Data analysis and requirements: that concentrates more on fact finding about the problem or the
opportunity. Feasibility analysis, requirement determination and structuring, and selection of best
design method are also performed at this phase.
i. Designer’s efforts are focused on
a) Information needs of Information users.
b) Information sources. Information constitution.
ii. Sources of information for the designer
a) Developing and gathering end user data views
b) Direct observation of the current system: existing and desired output
c) Interface with the systems design group
iii. The designer must identify the company’s business rules and analyze their impacts.
4. Design: in database designing more emphasis is given to this phase. The phase is further divided into
three sub-phases.
a) Conceptual Design: concise description of the data, data type, relationship between data
and constraints on the data. There is no implementation or physical detail consideration.
Used to elicit and structure all information requirements
5. DBMS Selection
The selection of an appropriate DBMS to support the database application.
Undertaken at any time prior to logical design provided sufficient information is available
regarding system requirements.
Also design the user interface and the application programs using the selected DBMS
2) Network Model
The network model is the first one to be implemented by Honeywell in 1964-65 (IDS System). Adopted
heavily due to the support by CODASYL (CODASYL - DBTG report of 1971). Later implemented in a
large variety of systems - IDMS (Cullinet - now CA), DMS 1100 (Unisys), IMAGE (H.P.), VAX -
DBMS (Digital Equipment Corp).
The network model is a database model conceived as a flexible way of representing objects and their
relationships. Its distinguishing feature is that the schema, viewed as a graph in which object types are
nodes and relationship types are arcs, is not restricted to being a hierarchy or lattice. The nodes
corresponds to record types and the links to pointers or relationships. All the relationship are hardwired
or pre-computed and build into structure of database itself because they are very efficient in space
utilization and query execution time.
The network data structure looks like a tree structure except that a dependent node which is called a
child or member, may have more than one parent or owner node. The network model replaces the
hierarchical model with a graph thus allowing more general connections among the nodes. The main
difference of the network model from the hierarchical model is its ability to handle many to many
relationships. In other words it allow a record to have more than one parent.
A schema diagram displays only some aspects of a schema, such as the names of record types
and data items, and some types of constraints.
Other aspects are not specified in the schema diagram; for example, in the previous figure shows
neither the data type of each data item, nor the relationships among the various files.
Many types of constraints are not represented in schema diagrams.
For example a constraint such as students majoring in computer science must take CS1310
before the end of their sophomore year is quite difficult to represent diagrammatically.
Internal(physical) level: This lowest level of abstraction. it closest to physical storage device. It
describes how data are actually stored on the storage medium. The internal schema, which contains the
definition of the
he stored record, the method representing the data fields, expresses the internal view and
the access aids used.
ANSI-SPARC
SPARC Architecture and Database Design Phases
Data Independence
The three-schema architecture can be used to further explain the concept of data independence , which
can be defined as the capacity to change the schema at one level of a database system without having to
change the schema at the next higher level.
Application programs interact with the external database schema, which has an interface, or mapping, to
the conceptual schema. The conceptual schema is concerned with the identity and relationships between
elements of data of interest to an organization, and has an interface or mapping to the internal schema.
The internal schema controls how the data is stored on physical media, such as magnetic disks.
All the DBMS functionality, application program execution, and user inter-face processing were carried
out on one machine. The figure above illustrates the physical components in a centralized architecture.
Gradually, DBMS systems started to exploit the available processing power at the user side, which led to
client/server DBMS architectures.
Client-Server Architectures:
The client/server architecture was developed to deal with computing environments in which a large
number of PCs, workstations, file servers, printers, data-base servers, Web servers, e-mail servers, and
other software and equipment are connected via a network. The idea is to define specialized servers with
specific functionalities.
There are different client-server DBMS. This includes Specialized Servers with Specialized functions,
Clients and DBMS Server.
Specialized Servers with Specialized functions.
File Servers --- maintains the files of the client machines.
Printer Servers -- being connected to various printers; all print requests by the clients are
forwarded to this machine.
Web Servers and E-mail Servers also fall into the specialized server category
Clients:- The resources provided by specialized servers can be accessed by many client machines.
The client machines provide appropriate interfaces and a client-version of the system to access
and utilize the server resources as well as with local processing power to run local applications.
Clients maybe diskless machines or PCs or Workstations with disks with only the client software
installed. Others would have both client and server functionality.
Connected to the servers via some form of a network (LAN: local area network, wireless
network, etc.)
DBMS Server
A server is a system containing both hard-ware and software that can provide services to the
client machines, such as file access, printing, archiving, or database access.
Provides database query and transaction services to the clients
Sometimes called query and transaction servers
Two main types of basic DBMS architectures were created on this underlying client/server framework:
Two-tier client/server architecture
Three-tier client/server architecture
The architectures described here are called two-tier architectures because the software components are
distributed over two systems: client and server. The advantages of this architecture are its simplicity and
seamless compatibility with existing systems.
Chapter Three
Relational Data Model
3.1 Properties of Relational Databases
Each row of table is uniquely identified by a Primary Key composed of one or more columns
Each tuple in a relation must be unique
Group of columns, that uniquely identifies a row in a table is called a Candidate Key
Entity Integrity rule of the model states that no component of the primary key may contain a
NULL value.
A column or combination of columns that matches the primary key of another table is called a
Foreign Key. Used to cross-reference tables.
The Referential Integrity Rule of the model states that, for every foreign key value in a table
there must be a corresponding primary key value in another table in the database or it should
be NULL.
All tables are Logical Entities
A table is either a BASE TABLES (Named Relations) or VIEWS (Unnamed Relations)
Only Base Tables are physically stores
VIEWS are derived from BASE TABLES with SQL instructions like: [SELECT ..FROM
..WHERE .. ORDER BY]
In the collection of tables, Each entity represented in one table and Attributes are fields
(columns) in table
Order of rows and columns is immaterial
Entries with repeating groups are said to be un-normalized
Entries are single-valued
Each column (field or attribute) has a distinct name
All values in a column represent the same attribute and have the same data format
Types of Attributes
(1) Simple (atomic) Vs Composite attributes
• Simple : an attribute that cannot be subdivided (contains a single value) E.g. Age, gender
• Composite attribute, is an attribute that can be further subdivided to yield additional attributes.
• For example, the attribute ADDRESS can be subdivided into street_Address, city, state, and
Postal code. Similarly, the attribute PHONE_NUMBER can be subdivided into area code and
exchange number.
• To facilitate detailed queries, it is wise to change composite attributes into a series of simple
attributes.
(2) Single-valued Vs multi-valued attributes
• Single-valued attribute is an attribute that can have only a single value (the value may change
but has only one value at one time). For example, a person can have only one Name, Sex, Id.
No. color_of_eyes, Social Security number, and a manufactured part can have only one serial
number.
• Keep in mind that a single-valued attribute is not necessarily a simple attribute. For instance, a
part’s serial number, such as SE-08-02-189935, is single-valued, but it is a composite attribute
Relationships
The Relationships between entities which exist and must be taken into account when processing
information. In any business processing one object may be associated with another object due to some
event. Such kind of association is what we call a RELATIONSHIP between entity objects.
• One external event or process may affect several related entities.
• Related entities require setting of LINKS from one part of the database to another.
• A relationship should be named by a word or phrase which explains its function
• Role names are different from the names of entities forming the relationship: one entity may
take on many roles, the same role may be played by different entities
• For each RELATIONSHIP, one can talk about the Number of Entities and the Number of
Tuples participating in the association. These two concepts are called Degree and Cardinality
of a relationship respectively.
Degree of a Relationship
An important point about a relationship is how many entities participate in it. The number of entities
participating in a relationship is called the Degree of the relationship. Among the Degrees of
relationship, the following are the basic:
Unary/Recursive Relationship: Tuples/records of Single entity are related with each other.
Binary Relationships: Tuples/records of two entities are associated in a relationship
Ternary Relationship: Tuples/records of three different entities are associated
N-Nary Relationship (a generalized one): Tuples from arbitrary number of entity sets are
participating in a relationship.
Cardinality of a Relationship
Another important concept about relationship is the number of instances/tuples that can be associated
with a single instance from one entity in a single relationship. The number of instances participating or
associated with a single instance from an entity in a relationship is called the Cardinality of the
relationship. The major cardinalities of a relationship are:
ONE-TO-ONE: one tuple is associated with only one other tuple.
E.g. Building – Location as a single building will be located in a single location and as a
single location will only accommodate a single Building.
ONE-TO-MANY, one tuple can be associated with many other tuples, but not the reverse.
E.g. Department-Student as one department can have multiple students.
Relational Views
Relations are perceived as a Table from the users’ perspective. Actually, there are two kinds of relation
in relational database. The two categories or types of Relations are Named and Unnamed Relations. The
basic difference is on how the relation is created, used and updated:
1. Base Relation:- A Named Relation corresponding to an entity in the conceptual schema, whose
tuples are physically stored in the database.
Chapter Four
Data Modeling Using the Entity-Relationship (ER) Model
4.1 Database Design
Database design is the process of coming up with different kinds of specification for the data to be stored
in the database. The database design part is one of the middle phases we have in information systems
development where the system uses a database approach. Design is the part on which we would be
engaged to describe how the data should be perceived at different levels and finally how it is going to be
stored in a computer system.
The ability to design databases and associated applications is critical to the success of the modern
enterprise. Database design requires understanding both the operational and business requirements of an
organization as well as the ability to model and realize those requirements using a database.
Developing database and information systems is performed using a development lifecycle, which
consists of a series of steps. As it is one component in most information system development tasks, there
are several steps to follow in designing a database system.
Information System with Database application consists of several tasks which include:
Planning of Information systems Design
Requirements Analysis,
Design (Conceptual, Logical and Physical Design)
Tuning
Implementation
Operation and Support
The requirements gathering and specification provides you with a high-level understanding of the
organization, its data, and the processes that you must model in the database. Database design involves
constructing a suitable model of this information. Since the design process is complicated, especially for
large databases, database design is mainly focused on this three phases:
1. Conceptual Design
2. Logical Design, and
3. Physical Design
In general, one has to go back and forth between these tasks to refine a database design, and decisions in
one task can influence the choices in another task.
(b) Attributes
Are properties used to describe each Entity or real world object.
Are used to store pieces of information about entities.
Attributes will give rise to recorded items of data in the database
For example, the STUDENT entity includes, among many others, the attributes STU_LNAME,
STU_FNAME, and STU_INITIAL.
In the original Chen notation, attributes are represented by ovals and are connected to the entity
rectangle with a line.
(c) Relationships
Relationships describe associations among data (exist between entities).
Most relationships describe associations between two entities.
Relationship (relationship type) is a meaningful association among entity types.
Generally, a relationship is represented as a connection between (or among) entities.
In standard ER model, it uses a diamond shape to connect between (or among) entities.
The relationship name is an active or passive verb; for example, a STUDENT takes a CLASS,
a PROFESSOR teaches a CLASS, a DEPARTMENT employs a PROFESSOR, a
DIVISION is managed by an EMPLOYEE.
The entities that participate in a relationship are also known as participants, and each
relationship is identified by a name that describes the relationship.
When the basic data model components were introduced, three types of relationships among data were
illustrated:
One-to-Many (1:M)
Many-to-Many (M:N), and
One-to-One (1:1)
The ER model uses the term connectivity to label the relationship types.
The name of the relationship is usually an active or passive verb.
For example, a PAINTER paints many PAINTINGs; an EMPLOYEE learns many SKILLs;
an EMPLOYEE manages a STORE.
Before working on the conceptual design of the database, one has to know and answer the following
basic questions.
• What are the entities and relationships in the enterprise?
• What information about these entities and relationships should we store in the database?
• What is the integrity constraints that hold? Constraints on each data with respect to update,
retrieval and store.
• Represent this information pictorially in ER diagrams, then map ER diagram into a relational
schema.
Ovals
Key
Key
Total participation:
Every tuple in the entity or relation participates in at least one relationship by taking a role. This means,
every tuple in a relation will be attached with at least one other tuple. The entity with total participation
in a relationship will be connected to the relationship using a double line. The existence of a mandatory
relationship indicates that the minimum cardinality is at least 1 for the mandatory entity.
Let’s examine a few more scenarios. Suppose that Tiny College employs some professors who
conduct research without teaching classes.
If you examine the “PROFESSOR teaches CLASS” relationship, it is quite possible for a
PROFESSOR not to teach a CLASS. Therefore, CLASS is optional to PROFESSOR. On the
other hand, a CLASS must be taught by a PROFESSOR. Therefore, PROFESSOR is mandatory
to CLASS
Problem: Which car (Car1 or Car3 or Car5) is used by Employee 6 Emp6 working in Branch 1 (Bra1)?
Thus from this ER Model one cannot tell which car is used by which staff since a branch can have more
than one car and also a branch is populated by more than one employee. Thus we need to restructure the
model to avoid the connection trap.
To avoid the Fan Trap problem we can go for restructuring of the E-R Model. This will result in the
following E-R Model.
If we have a set of projects that are not active currently then we can not assign a project manager for
these projects. So there are project with no project manager making the participation to have a minimum
value of zero.
Problem:
How can we identify which BRANCH is responsible for which PROJECT? We know that whether the
PROJECT is active or not there is a responsible BRANCH. But which branch is a question to be
answered, and since we have a minimum participation of zero between employee and PROJECT we
can’t identify the BRANCH responsible for each PROJECT.
The solution for this Chasm Trap problem is to add another relation ship between the extreme entities
(BRANCH and PROJECT)
Example;
The company is organized into departments. Each department has a unique name, a unique number, and
a particular employee who manages the department. We keep track of the start date when that employee
began managing the department. A department may have several locations. A department controls a
number of projects, each of which has a unique name, a unique number, and a single location.
We store each employee’s name, Social Security number, address, salary, sex(gender), and birth date.
An employee is assigned to one department, but may work on several projects, which are not necessarily
controlled by the same department. We keep track of the current number of hours per week that an
employee works on each project. We also keep track of the direct supervisor of each employee (who is
another employee). We want to keep track of the dependents of each employee for insurance purposes.
We keep each dependent’s first name, sex, birth date, and relation-ship to the employee
So far, we have not represented the fact that an employee can work on several projects, nor have we
represented the number of hours per week an employee works on each project. This characteristic is
listed as part of the third requirement and it can be represented by a multivalued composite attribute of
EMPLOYEE called Works_on with the simple components (Project, Hours). Alternatively, it can be
represented as a multivalued composite attribute of PROJECT called Workers with the simple
Exercises
1. Consider the following set of requirements for a UNIVERSITY database that is used to keep track of
students’ transcripts.
a) The university keeps track of each student’s name, student number, Social Security number,
current address and phone number, permanent address and phone number, birth date, sex,
class (freshman, sophomore, ..., grad-uate), major department, minor department (if any), and
degree program (B.A., B.S., ..., Ph.D.). Some user applications need to refer to the city, state,
and ZIP Code of the student’s permanent address and to the stu-dent’s last name. Both Social
Security number and student number have unique values for each student.
b) Each department is described by a name, department code, office num-ber, office phone
number, and college. Both name and code have unique values for each department.
c) Each course has a course name, description, course number, number of semester hours, level,
and offering department. The value of the course number is unique for each course.
2. Design an ER schema for keeping track of information about votes taken in the U.S. House of
Representatives during the current two-year congressional session. The database needs to keep track
of each U.S. STATE ’s Name (e.g.,‘Texas’, ‘New York’, ‘California’) and include the Region of
the state (whose domain is {‘Northeast’, ‘Midwest’, ‘Southeast’, ‘Southwest’, ‘West’}). Each
CONGRESS_PERSON in the House of Representatives is described by his or her Name, plus the
District represented, the Start_date when the congress person was first elected, and the political
Party to which he or she belongs (whose domain is {‘Republican’, ‘Democrat’, ‘Independent’,
‘Other’}). The database keeps track of each BILL(i.e., proposed law), including the Bill_name, the
Date_of_vote on the bill, whether the bill Passed_or_failed (whose domain is {‘Yes’, ‘No’}), and the
Sponsor (the congressperson(s) who sponsored—that is, proposed—the bill). The database also
keeps track of how each congressperson voted on each bill (domain of Vote attribute is {‘Yes’, ‘No’,
‘Abstain’, ‘Absent’}). Draw an ER schema diagram for this application. State clearly any
assumptions you make
3. A database is being constructed to keep track of the teams and games of a sports league. A team has
a number of players, not all of whom participate in each game. It is desired to keep track of the
players participating in each game for each team, the positions they played in that game, and the
result of the game. Design an ER schema diagram for this application, stating any assumptions you
make. Choose your favorite sport (e.g., soccer, baseball, football).
4. Consider an entity type SECTION in a UNIVERSITY database, which describes the section
offerings of courses. The attributes of SECTION are Section_number, Semester, Year ,
Course_number , Instructor, Room_no (where section is taught), Building (where section is taught),
Weekdays(domain is the possible combinations of weekdays in which a section can be offered
{‘MWF’, ‘MW’, ‘TT’, and so on}), and Hours (domain is all possible time periods during which
sections are offered {‘9–9:50 A . M .’, ‘10–10:50 A . M .’, ...,‘3:30–4:50 P.M.’, ‘5:30–6:20 P.M.’,
and so on}). Assume that Section_number is unique for each course within a particular
semester/year combination (that is, if a course is offered multiple times during a particular semester,
its section offerings are numbered 1, 2, 3, and so on). There are several composite keys for section,
and some attributes are components of more than one key. Identify three composite keys, and show
how they can be represented in an ER schema diagram.
Superclass/Supertype Entity
• Is the generalized entity
• An entity type whose tuples share common attributes. Attributes that are shared by all entity
occurrences (including the identifier) are associated with the supertype.
Subclass/Subtype Entity
• An entity type whose tuples have attributes that distinguish its members from tuples of the
generalized or Superclass entities.
• When one generalized Superclass has various subgroups with distinguishing features and these
subgroups are represented by specialized form, the groups are called subclasses.
• Subclasses can be either mutually exclusive (disjoint) or overlapping (inclusive).
• A single subclass may inherit attributes from two distinct superclasses.
• A mutually exclusive category/subclass is when an entity instance can be in only one of the
subclasses.
E.g.: An EMPLOYEE can either be SALARIED or PART-TIMER but not both.
• An overlapping category/subclass is when an entity instance may be in two or more subclasses.
E.g.: A PERSON who works for a university can be both EMPLOYEE and a
STUDENT at the same time.
Consider the EMPLOYEE supertype entity shown above. This entity can have several different
subtype entities (for example: HOURLY and SALARIED), each with distinct properties not
shared by other subtypes. But whether the employee is HOURLY or SALARIED, same
attributes (EmployeeId, Name, and DateHired) are shared.
The Supertype EMPLOYEE stores all properties that subclasses have in common. And
HOURLY employees have the unique attribute Wage (hourly wage rate), while SALARIED
employees have two unique attributes, StockOption and Salary.
Completeness Constraint.
• The Completeness Constraint addresses the issue of whether or not an occurrence of a Super
class must also have a corresponding Subclass occurrence.
• The completeness constraint requires that all instances of the subtype be represented in the super
type.
• The Total Specialization Rule specifies that an entity occurrence should at least be a member of
one of the subclasses. Total Participation of super class instances on subclasses is diagrammed
with a double line from the Super type to the circle as shown below.
E.g.: If we have EXTENTION and REGULAR as subclasses of a super class STUDENT,
then it is mandatory that each student to be either EXTENTION or REGULAR student.
Thus the participation of instances of STUDENT in EXTENTION and REGULAR
subclasses will be total.
• The Partial Specialization Rule specifies that it is not necessary for all entity occurrences in the
superclass to be a member of one of the subclasses. Here we have an optional participation on
the specialization. Partial Participation of superclass instances on subclasses is diagrammed with
a single line from the Supertype to the circle.
E.g.: If we have MANAGER and SECRETARY as subclasses of a superclass EMPLOYEE,
thenit is not the case that all employees are either manager or secretary. Thus the
participation of instances of employee in MANAGER and SECRETARY subclasses
will be partial.
The two types of constraints on generalization and specialization (Disjointness and Completeness
constraints) are not dependent on one another. That is, being disjoint will not favour whether the tuples
in the superclass should have Total or Partial participation for that specific specialization.
From the two types of constraints we can have four possible constraints
Disjoint AND Total Overlapping AND Total
Disjoint AND Partial Overlapping AND Partial
Chapter Five
Logical and Physical Database Design
5.1 Logical Database Design
Logical design is the process of constructing a model of the information used in an
enterprise based on a specific data model (e.g. relational, hierarchical or network or
object), but independent of a particular DBMS and other physical considerations.
There are a collection of rules to be maintained in logical database design that helps us to
discover new entities and Revise attributes based on the rules and the discovered Entities.
This rule is called Normalization process.
The first step before applying the rules in relational data model is converting the
conceptual design to a form suitable for relational logical model, which is in a form of
tables.
5.1.2 Normalization
A relational database is merely a collection of data, organized in a particular manner. As
the father of the relational database approach, Codd created a series of rules called
normal forms that help define that organization.
Deletion Anomalies:
If employee with ID 16 is deleted then ever information about skill C++ and the type of
skill is deleted from the database. Then we will not have any information about C++ and
its skill type.
Insertion Anomalies:
What if we have a new employee with a skill called Pascal? We cannot decide whether
Pascal is allowed as a value for skill and we have no clue about the type of skill that
Pascal should be categorized as.
Since the type of Wine served depends on the type of Dinner, we say Wine is
functionally dependent on Dinner.
Dinner Wine
Dinner Type of Wine Type of Fork
Course
Meat Red Meat fork
Fish White Fish fork
Cheese Rose Cheese fork
Since both Wine type and Fork type are determined by the Dinner type, we say Wine is
functionally dependent on Dinner and Fork is functionally dependent on Dinner.
Dinner Wine
Dinner Fork
Partial Dependency
If an attribute which is not a member of the primary key is dependent on some part of the
primary key (if we have composite primary key) then that attribute is partially
functionally dependent on the primary key.
Let {A,B} is the Primary Key and C is no key attribute.
C and B
Then if {A,B} C
Then C is partially functionally dependent on {A,B}
Full Dependency
If an attribute which is not a member of the primary key is not dependent on some part of
the primary key but the whole key (if we have composite primary key) then that attribute
is fully functionally dependent on the primary key.
Let {A,B} is the Primary Key and C is no key attribute
C and B
Then if {A,B} C and A
C does not hold
Then C Fully functionally dependent on {A,B}
Transitive Dependency
In mathematics and logic, a transitive relationship is a relationship of the following form:
"If A implies B, and if also B implies C, then A implies C."
First Normal Form (1nf): Remove all repeating groups. Distribute the multi-valued
attributes into different rows and identify a unique identifier for the relation so that is can
be said is a relation in relational database.
Emp First LastName Skill Skill SkillType Schoo SchoolAdd Skill
ID Name ID l Level
EMP_PROJ rearranged
EmpID Proj EmpName ProjName Proj ProjFund ProjMangID Incentiv
No Loc e
Business rule: Whenever an employee participates in a project, he/she will be entitled for
an incentive.
This schema is in its 1NF since we don’t have any repeating groups or attributes with
multi-valued property. To convert it to a 2NF we need to remove all partial dependencies
of non key attributes on part of the primary key.
{EmpID, ProjNo}EmpName, ProjName, ProjLoc, ProjFund, ProjMangID, Incentive
But in addition to this we have the following dependencies
FD1: {EmpID}EmpName
FD2: {ProjNo}ProjName, ProjLoc, ProjFund, ProjMangID
FD3: {EmpID, ProjNo}Incentive
PROJECT
ProjNo ProjName ProjLoc ProjFund ProjMangID
EMP_PROJ
EmpID ProjNo Incentive
Generally, even though there are other four additional levels of Normalization, a table is
said to be normalized if it reaches 3NF. A database with all tables in the 3NF is said to be
Normalized Database.
Mnemonic for remembering the rationale for normalization up to 3NF could be the
following:
No Repeating or Redunduncy: no repeting fields in the table.
The Fields Depend Upon the Key: the table should solely depend on the key.
The Whole Key: no partial keybdependency.
And Nothing But The Key: no inter data dependency.
So Help Me Codd: since Coddcame up with these rules.
It is considered desirable to keep these three levels quite separate -- one of Codd's requirements for an
RDBMS is that it should maintain logical-physical data independence. The generality of the relational
model means that RDBMSs are potentially less efficient than those based on one of the older data
models where access paths were specified once and for all at the design stage. However the relational
data model does not preclude the use of traditional techniques for accessing data - it is still essential to
exploit them to achieve adequate performance with a database of any size.
We can consider the topic of physical database design from three aspects:
• What techniques for storing and finding data exist
• Which are implemented within a particular DBMS
• Which might be selected by the designer for a given application knowing the properties of the
data
Thus the purpose of physical database design is:
1. How to map the logical database design to a physical database design.
2. How to design base relations for target DBMS.
3. How to design enterprise constraints for target DBMS.
4. How to select appropriate file organizations based on analysis of transactions.
5. When to use secondary indexes to improve performance.
6. How to estimate the size of the database
7. How to design user views
8. How to design security mechanisms to satisfy user requirements.
Physical database design is the process of producing a description of the implementation of the database
on secondary storage.
Physical design describes the base relation, file organization, and indexes used to achieve efficient
access to the data, and any associated integrity constraints and security measures.
c) Choose indexes
To determine whether adding indexes will improve the performance of the system.
One approach is to keep tuples unordered and create as many secondary indexes as necessary.
Another approach is to order tuples in the relation by specifying a primary or clustering index.
In this case, choose the attribute for ordering or clustering the tuples as:
• Attribute that is used most often for join operations - this makes join operation more efficient, or
• Attribute that is used most often to access the tuples in a relation in order of that attribute.
If ordering attribute chosen is key of relation, index will be a primary index; otherwise, index will be a
clustering index.
Each relation can only have either a primary index or a clustering index.
Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be
used to retrieve data more efficiently.
Overhead involved in maintenance and use of secondary indexes that has to be balanced against
performance improvement gained when retrieving data.
This includes:
• Adding an index record to every secondary index whenever tuple is inserted;
• Updating a secondary index when corresponding tuple is updated;
• Increase in disk space needed to store the secondary index;
• Possible performance degradation during query optimization to consider all secondary
indexes.
Two mathematical Query Languages form the basis for Relational languages
Relational Algebra:
Relational Calculus:
We may describe the relational algebra as procedural language: it can be used to tell the DBMS how
to build a new relation from one or more relations in the database.
We may describe relational calculus as a non-procedural language: it can be used to formulate the
definition of a relation in terms of one or more database relations.
Formally the relational algebra and relational calculus are equivalent to each other. For every
expression in the algebra, there is an equivalent expression in the calculus.
Both are non-user friendly languages. They have been used as the basis for other, higher-level data
manipulation languages for relational databases.
A query is applied to relation instances, and the result of a query is also a relation instance.
Schemas of input relations for a query are fixed
The schema for the result of a given query is also fixed! Determined by definition of query
language constructs.
The result of the retrieval is a new relation, which may have been formed from one or more relations. The
algebra operations thus produce new relations, which can be further manipulated using operations of the
same algebra.
A sequence of relational algebra operations forms a relational algebra expression, whose result will also
be a relation that represents the result of a database query (or retrieval request).
Relational algebra is a theoretical language with operations that work on one or more relations to
define another relation without changing the original relation.
The output from one operation can become the input to another operation (nesting is possible)
There are different basic operations that could be applied on relations on a database based on the
requirement.
♠ Selection ( ) Selects a subset of rows from a relation.
♠ Projection ( ) Deletes unwanted columns from a relation.
♠ Renaming: assigning intermediate relation for a single operation
♠ Cross-Product ( x ) Allows us to combine two relations.
♠ Set-Difference ( - ) Tuples in relation1, but not in relation2.
♠ Union ( ) Tuples in relation1 or in relation2.
♠ Intersection () Tuples in relation1 and in relation2
♠ Join Tuples joined from two relations based on a condition
Using these we can build up sophisticated database queries.
Table1:
Sample table used to illustrate different kinds of relational operations. The relation contains information
about employees, IT skills they have and the school where they attend each skill.
Employee
EmpID FName LName SkillID Skill SkillType School SchoolAdd SkillLevel
12 Abebe Mekuria 2 SQL Database AAU Sidist_Kilo 5
16 Lemma Alemu 5 C++ Programming Unity Gerji 6
28 Chane Kebede 2 SQL Database AAU Sidist_Kilo 10
25 Abera Taye 6 VB6 Programming Helico Piazza 8
65 Almaz Belay 2 SQL Database Helico Piazza 9
24 Dereje Tamiru 8 Oracle Database Unity Gerji 5
51 Selam Belay 4 Prolog Programming Jimma Jimma City 8
94 Alem Kebede 3 Cisco Networking AAU Sidist_Kilo 7
18 Girma Dereje 1 IP Programming Jimma Jimma City 4
13 Yared Gizaw 7 Java Programming AAU Sidist_Kilo 6
6.2.1 Selection
Selects subset of tuples/rows in a relation that satisfy selection condition.
Selection operation is a unary operator (it is applied to a single relation)
The Selection operation is applied to each tuple individually
The degree of the resulting relation is the same as the original relation but the cardinality (no. of
tuples) is less than or equal to the original relation.
The Selection operator is commutative.
Set of conditions can be combined using Boolean operations ((AND), (OR), and ~(NOT))
No duplicates in result!
Schema of result identical to schema of (only) input relation.
Result relation can be the input for another relational algebra operation! (Operator composition.)
It is a filter that keeps only those tuples that satisfy a qualifying condition (those satisfying the
condition are selected while others are discarded.)
Notation:
<Selection Condition> <Relation Name>
Example: Find all Employees with skill type of Database.
If the query is all employees with a SkillType Database and School Unity the relational algebra operation
and the resulting relation will be as follows.
Notation:
<Selected Attributes> <Relation Name>
Example: To display Name, Skill, and Skill Level of an employee, the query and the resulting relation
will be:
1. Write the operations as a single relational algebra expression by nesting the operations.
2. Apply one operation at a time and create intermediate result relations. In the latter case, we must give
names to the relations that hold the intermediate resultsRename Operation
If we want to have the Name, Skill, and Skill Level of an employee with salary greater than 1500 and
working for department 5, we can write the expression for this query using the two alternatives:
1. A single algebraic expression:
The above used query is using a single algebra operation, which is:
Then Result will be equivalent with the relation we get using the first alternative.
Type Compatibility
Two relations R1 and R2 are said to be Type Compatible if:
1. The operand relations R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) have the same number of
attributes, and
2. The domains of corresponding attributes must be compatible; that is, Dom(Ai)=Dom(Bi) for
i=1, 2, ..., n.
To illustrate the three set operations, we will make use of the following two tables:
Employee
EmpID FName LName SkillID Skill SkillType School SkillLevel
12 Abebe Mekuria 2 SQL Database AAU 5
16 Lemma Alemu 5 C++ Programming Unity 6
28 Chane Kebede 2 SQL Database AAU 10
25 Abera Taye 6 VB6 Programming Helico 8
65 Almaz Belay 2 SQL Database Helico 9
24 Dereje Tamiru 8 Oracle Database Unity 5
51 Selam Belay 4 Prolog Programming Jimma 8
94 Alem Kebede 3 Cisco Networking AAU 7
18 Girma Dereje 1 IP Programming Jimma 4
13 Yared Gizaw 7 Java Programming AAU 6
RelationOne: Employees who attend Database Course
EmpID FName LName SkillID Skill SkillType School SkillLevel
12 Abebe Mekuria 2 SQL Database AAU 5
28 Chane Kebede 2 SQL Database AAU 10
65 Almaz Belay 2 SQL Database Helico 9
24 Dereje Tamiru 8 Oracle Database Unity 5
The resulting relation for; R1 R2, R1 R2, or R1-R2 has the same attribute names as the first operand
relation R1 (by convention).
Some Properties of the Set Operators
Notice that both union and intersection are commutative operations; that is
R S = S R, and R S = S R
Both union and intersection can be treated as n-nary operations applicable to any number of relations as
both are associative operations; that is
R (S T) = (R S) T, and (R S) T = R (S T)
The minus operation is not commutative; that is, in general
R-S≠S–R
Employee
ID FName LName
123 Abebe Lemma
567 Belay Taye
822 Kefle Kebede
Dept
DeptID DeptName MangID
2 Finance 567
3 Personnel 123
Then the Cartesian product between Employee and Dept relations will be of the form:
Employee X Dept:
ID FName LName DeptID DeptName MangID
123 Abebe Lemma 2 Finance 567
123 Abebe Lemma 3 Personnel 123
567 Belay Taye 2 Finance 567
567 Belay Taye 3 Personnel 123
822 Kefle Kebede 2 Finance 567
822 Kefle Kebede 3 Personnel 123
Basically, even though it is very important in query processing, the Cartesian Product is not useful by itself
since it relates every tuple in the First Relation with every other tuple in the Second Relation. Thus, to make
use of the Cartesian Product, one has to use it with the Selection Operation, which discriminate tuples of a
relation by testing whether each will satisfy the selection condition.
In our example, to extract employee information about managers of the departments (Managers of each
department), the algebra query and the resulting relation will be.
This operation is very important for any relational database with more than a single relation, because it
allows us to process relationships among relations.
The general form of a join operation on two relations
R(A1, A2,. . ., An) and S(B1, B2, . . ., Bm) is:
Where, R and S can be any relation that results from general relational algebra expressions.
Since JOIN is an operation that needs two relation, it is a Binary operation.
The standard definition of natural join requires that the two join attributes, or each pair of corresponding
join attributes, have the same name in both relations. If this is not the case, a renaming operation on the
attributes is applied first.
1. RIGHT OUTER JOIN: where non matching tuples from the second (Right) relation are included
in the result with NULL value for attributes of the first (Left) relation.
2. LEFT OUTER JOIN: where non matching tuples from the first (Left) relation are included in the
result with NULL value for attributes of the second (Right) relation.
R <Join Condition> S
6.3 Relational Calculus
A relational calculus expression creates a new relation, which is specified in terms of variables that range
over rows of the stored database relations (in tuple calculus) or over columns of the stored relations (in
domain calculus).
In a calculus expression, there is no order of operations to specify how to retrieve the query result. A
calculus expression specifies only what information the result should contain rather than how to retrieve
it.
In Relational calculus, there is no description of how to evaluate a query; this is the main distinguishing
feature between relational algebra and relational calculus.
Relational calculus is considered to be a nonprocedural language. This differs from relational algebra,
where we must write a sequence of operations to specify a retrieval request; hence relational algebra can
be considered as a procedural way of stating a query.
When applied to relational database, the calculus is not that of derivative and differential but in a form of
first-order logic or predicate calculus, a predicate is a truth-valued function with arguments.
When we substitute values for the arguments in the predicate, the function yields an expression, called a
proposition, which can be either true or false.
If a predicate contains a variable, as in ‘x is a member of staff’, there must be a range for x. When we
substitute some values of this range for x, the proposition may be true; for other values, it may be false.
If COND is a predicate, then the set off all tuples evaluated to be true for the predicate COND will be
expressed as follows:
{t | COND(t)}
Where t is a tuple variable and COND (t) is a conditional expression involving t. The result of such a
query is the set of all tuples t that satisfy COND (t).
If we have set of predicates to evaluate for a single query, the predicates can be connected using
(AND), (OR), and ~(NOT)
A relational calculus expression creates a new relation, which is specified in terms of variables that range
over rows of the stored database relations (in tuple calculus) or over columns of the stored relations (in
domain calculus).
6.3.1 Tuple-oriented Relational Calculus
The tuple relational calculus is based on specifying a number of tuple variables. Each tuple variable
usually ranges over a particular database relation, meaning that the variable may take as its value any
individual tuple from that relation.
Tuple relational calculus is interested in finding tuples for which a predicate is true for a relation.
Based on use of tuple variables.
Tuple variable is a variable that ‘ranges over’ a named relation: that is, a variable whose only
permitted values are tuples of the relation.
If E is a tuple that ranges over a relation employee, then it is represented as EMPLOYEE(E) i.e.
Range of E is EMPLOYEE
Then to extract all tuples that satisfy a certain condition, we will represent is as all tuples E such that
COND(E) is evaluated to be true.
{E COND(E)}
The predicates can be connected using the Boolean operators:
(AND), (OR), (NOT)
To find only the EmpId, FName, LName, Skill and the School where the skill is attended where of
employees with skill level greater than or equal to 8, the tuple based relational calculus expression will
be:
{E.EmpId, E.FName, E.LName, E.Skill, E.School | Employee(E) E.SkillLevel >= 8}
E.FName means the value of the First Name (FName) attribute for the tuple E.
This means, there exist at least one tuple of the relation employee where the value for
the SkillLevel is greater than or equal to 8
This means, for all tuples of relation employee where value for the SkillLevel attribute
is greater than or equal to 8.
Example:
To find employees who work on projects controlled by department 5 the query will be:
{E | Employee(E) (P)(Project(P) (w)(WorksOn(w) P.Dept=5 E.EID=W.EID))}