Fundamentals of Database Systems
Fundamentals of Database Systems
Bonga University
College of Engineering and Technology
Department of Computer Science
January 2023
Bonga University, Ethiopia
CHAPTER ONE
INTRODUCTION TO DATABASE SYSTEM
Database System
Database systems are designed to manage large data set in an organization. The data
management involves both definition and the manipulation of the data which ranges from
simple representation of the data to considerations of structures for the storage of
information. The data management also consider the provision of mechanisms for the
manipulation of information.
The power of databases comes from a body of knowledge and technology that has
developed over several decades and is embodied in specialized software called a database
management system, or DBMS. A DBMS is a powerful tool for creating and managing
large amounts of data efficiently and allowing it to persist over long periods of time, safely.
These systems are among the most complex types of software available.
Thus, for our question: What is a database? In essence a database is nothing more than a
collection of shared information that exists over a long period of time, often many years.
In common dialect, the term database refers to a collection of data that is managed by a
DBMS.
1. Manual Approach
2. Traditional File Based Approach
3. Database Approach
• Files for as many events and objects as the organization has are used to store
information.
• Each of the files containing various kinds of information is labelled and stored in
one ore more cabinets.
• The cabinets could be kept in safe places for security purpose based on the
sensitivity of the information contained in it.
• Insertion and retrieval is done by searching first for the right cabinet then for the
right the file then the information.
• One could have an indexing system to facilitate access to the data
• Prone to error
• Difficult to update, retrieve, integrate
• You have the data but it is difficult to compile the information
• Limited to small size information
• Cross referencing is difficult
An alternative approach of data handling is a computerized way of dealing with the
information. The computerized approach could also be either decentralized or centralized
base on where the data resides in the system.
database, or even same data, simultaneously) storage of and access to MASSIVE amounts of
PERSISTENT (data outlives programs that operate on it) data. A DBMS also provides a systematic
method for creating, updating, storing, retrieving data in a database. DBMS also provides
the service of controlling data access, enforcing data integrity, managing concurrency
control, and recovery. Having this in mind, a full scale DBMS should at least have the
following services to provide to the user.
1. Hardware: are components that one can touch and feel. These components are
comprised of various types of personal computers, mainframe or any server
computers to be used in multi-user system, network infrastructure, and other
peripherals required in the system.
3. Data: since the goal of any database system is to have better control of the data
and making data useful, Data is the most important component to the user of the
database. There are two categories of data in any database system: that is
Operational and Metadata. Operational data is the data actually stored in the system
to be used by the user. Metadata is the data that is used to store information about
the database itself.
The structure of the data in the database is called the schema, which is composed
of the Entities, Properties of entities, and relationship between entities.
4. Procedure: this is the rules and regulations on how to design and use a
database. It includes procedures like how to log on to the DBMS, how to use
facilities, how to start and stop transaction, how to make backup, how to treat
hardware and software failure, how to change the structure of the database.
5. People: this component is composed of the people in the organization that are
responsible or play a role in designing, implementing, managing, administering and
using the resources in the database. This component includes group of people with
high level of knowledge about the database and the design technology to other with
no knowledge of the system except using the data in the database.
2. Analysis: that concentrates more on fact finding about the problem or the
opportunity. Feasibility analysis, requirement determination and structuring, and
selection of best design method are also performed at this phase.
3. Design: in database designing more emphasis is given to this phase. The phase is
further divided into three sub-phases.
a. Conceptual Design: concise description of the data, data type, relationship
between data and constraints on the data.
• There is no implementation or physical detail consideration.
• Used to elicit and structure all information requirements
b. Logical Design: a higher level conceptual abstraction with selected specific
data model to implement the data structure.
• It is particular DBMS independent and with no other physical
considerations.
c. Physical Design: physical implementation of the upper level design of the
database with respect to internal storage and file structure of the database for
the selected DBMS.
• To develop all technology and organizational specification.
4. Implementation: the deployment and testing of the designed database for use.
5. Operation and Support: administering and maintaining the operation of the
database system and providing support to users.
4. End Users
Workers, whose job requires accessing the database frequently for various purposes,
there are different group of users in this category.
5. Naïve Users:
➢ Sizable proportion of users
➢ Unaware of the DBMS
➢ Only access the database based on their access level and demand
➢ Use standard and pre-specified types of queries.
6. Sophisticated Users
➢ Are users familiar with the structure of the Database and facilities of the
DBMS.
➢ Have complex requirements
➢ Have higher level queries
➢ Are most of the time engineers, scientists, business analysts, etc
ANSI-SPARC Architecture
The purpose and origin of the Three-Level database architecture
All users should be able to access same data. This is important since the
database is having a shared data feature where all the data is stored in one
location and all users will have their own customized way of interacting with
the data.
External Level: Users' view of the database. Describes that part of the database that is
relevant to a particular user. Different users have their own customized view of the
database independent of other users.
Conceptual Level: Community view of the database. Describes what data is stored in
database and relationships among the data.
Internal Level: Physical representation of the database on the computer. Describes how
the data is stored in the database.
The following example can be taken as an illustration for the difference between the
three levels in the ANSI-SPARC database Architecture. Where: The first level is
concerned about the group of users and their respective data requirement independent of
the other.
The second level describes the whole content of the database where
one piece of information will be represented once. The third level
describes the physical storage of the data.
Data Independence
Logical Data Independence:
Refers to immunity of external schemas to changes in conceptual schema.
Conceptual schema changes e.g., addition/removal of entities should not
require changes to external schema or rewrites of application programs.
The capacity to change the conceptual schema without having to change the
external schemas and their application programs.
Data Model
A specific DBMS has its own specific Data Definition Language, but this type of
language is too low level to describe the data requirements of an organization in a
way that is readily understandable by a variety of users.
We need a higher-level language.
Such a higher-level is called data-model.
Data Model: a set of concepts to describe the structure of a database, and
certain constraints that the database should obey.
A data model is a description of the way that data is stored in a database. Data model
helps to understand the relationship between entities and to create the most effective
structure to hold data.
Data Model is a collection of tools or concepts for describing
Data
Data relationships
Data semantics
Data constraints
The main purpose of Data Model is to represent the data in an understandable way.
Categories of data models include:
Object-based
Record-based
Physical
Record-based Data Models
Consist of a number of fixed format records.
Each record type defines a fixed number of fields,
Each field is typically of a fixed length.
Hierarchical Data Model
Network Data Model
Relational Data Model
1. Hierarchical Model
• The simplest data model
• Record type is referred to as node or segment
• The top node is the root node
• Nodes are arranged in a hierarchical structure as sort of upsidedown tree
• A parent node can have more than one child node
• A child node can only have one parent node
• The relationship between parent and child is one-to-many
Department
Employee Job
2. Network Model
◼ Allows record types to have more that one parent unlike hierarchical model
◼ A network data models sees records as set members
◼ Each set has an owner and one or more members
◼ Allow many to many relationship between entities
Department Job
Employee
Activity
Time Card
Alternative terminologies
CHAPTER TWO
RELATIONAL DATA MODEL PROPERTIES OF RELATIONAL DATABASES
All values in a column represent the same attribute and have the same data format
1. The ENTITIES (persons, places, things etc.) which the organization has to deal with.
Relations can also describe relationships
The name given to an entity should always be a singular noun descriptive of each
item to be stored in it. E.g.: student NOT students.
Every relation has a schema, which describes the columns, or fields the relation itself
corresponds to our familiar notion of a table:
A relation is a collection of tuples, each of which contains values for a fixed number
of attributes
◼ Weak entity : an entity that can not exist without the entity with which it has
a relationship – it is indicated by a double rectangle
2. The ATTRIBUTES - the items of information which characterize and describe these
entities.
Attributes are pieces of information ABOUT entities. The analysis must of course
identify those which are actually relevant to the proposed application. Attributes
will give rise to recorded items of data in the database
Types of Attributes
(1) Simple (atomic) Vs Composite attributes
• Simple: contains a single value (not divided into sub parts) E.g. Age,
gender
• Composite: Divided into sub parts (composed of other attributes)
E.g., Name, address
(2) Single-valued Vs multi-valued attributes
Single-valued: have only single value(the value may change but has
only one value at one time)
E.g. Name, Sex, Id. No. color_of_eyes
Multi-Valued: have more than one value E.g.
Address, college_degree
Person may have several college degrees
(3) Stored vs. Derived Attribute
Stored : not possible to derive or compute
E.g. Name, Address
Derived: The value may be derived (computed) from the values of
other attributes.
E.g. Age (current year – year of birth)
If the structure (city, Woreda, Kebele, etc) is important, e.g. want to retrieve
employees in a given city, address must be modeled as an entity (attribute values
are atomic).
Degree of a Relationship
An important point about a relationship is how many entities participate in it.
The number of entities participating in a relationship is called the DEGREE
of the relationship.
Cardinality of a Relationship
Another important concept about relationship is the number of
instances/tuples that can be associated with a single instance from one entity
in a single relationship. The number of instances participating or associated
with a single instance from an entity in a relationship is called the
CARDINALITY of the relationship. The major cardinalities of a
relationship are:
o ONE-TO-ONE: one tuple is associated with only one other tuple.
▪ E.g. Building – Location as a single building will be located
in a single location and as a single location will only
accommodate a single Building.
o ONE-TO-MANY, one tuple can be associated with many other
tuples, but not the reverse.
▪ E.g. Department-Student as one department can have
multiple students.
o MANY-TO-ONE, many tuples are associated with one tuple but not
the reverse.
▪ E.g. Employee – Department: as many employees belong to
a single department.
o MANY-TO-MANY: one tuple is associated with many other tuples
and from the other side, with a different role name one tuple will be
associated with many tuples
Relational Integrity
Key constraints
If tuples are need to be unique in the database, and then we need to make each
tuple distinct. To do this we need to have relational keys that uniquely identify
each relation.
Super Key: an attribute or set of attributes that uniquely identifies a tuple within a
relation.
Candidate Key: a super key such that no proper subset of that collection is a
Super Key within the relation. A candidate key has two properties:
1. Uniqueness
2. Irreducibility
If a super key is having only one attribute, it is automatically a
Candidate key.
Primary Key: the candidate key that is selected to identify tuples uniquely within
the relation.
The entire set of attributes in a relation can be considered as a primary
case in a worst case.
Foreign Key: an attribute, or set of attributes, within one relation that matches the
candidate key of some relation.
A foreign key is a link between different relations to create the view or the
unnamed relation
Relations are perceived as a table from the users’ perspective. Actually, there are
two kinds of relation in relational database. The two categories or types of
Relations are Named and Unnamed Relations. The basic difference is on how the
relation is created, used and updated:
1. Base Relation
A Named Relation corresponding to an entity in the conceptual schema,
whose tuples are physically stored in the database.
Purpose of a view
➢ Hides unnecessary information from users: since only part of the base
relation (Some collection of attributes, not necessarily all) are to be
included in the virtual table.
➢ Provide powerful flexibility and security: since unnecessary
information will be hidden from the user there will be some sort of
data security.
➢ Provide customized view of the database for users: each users are
going to be interfaced with their own preferred data set and format by
making use of the Views.
➢ A view of one base relation can be updated.
Schemas
◼ Schema describes how data is to be structured, defined at setup/Design time (also
called "metadata")
◼ Since it is used during the database development phase, there is rare tendency of
changing the schema unless there is a need for system maintenance which demands
change to the definition of a relation.
⚫ Database Schema (Intension): specifies name of relation and the collection of the
attributes (specifically the Name of attributes).
➢ refer to a description of database (or intention)
➢ specified during database design
➢ should not be changed unless during maintenance
⚫ Schema Diagrams
➢ convention to display some aspect of a schema visually
⚫ Schema Construct
➢ refers to each object in the schema (e.g. STUDENT)
E.g.: STUNEDT (FName,LName,Id,Year,Dept,Sex)
Instances
Instance: is the collection of data in the database at a particular point of time (snap-
shot).
Also called State or Snap Shot or Extension of the database
➢ Refers to the actual data in the database at a specific point in time
➢ State of database is changed any time we add, delete or update an item.
➢ Valid state: the state that satisfies the structure and constraints specified in
the schema and is enforced by DBMS
◼ Since Instance is actual data of database at some point in time, changes rapidly
◼ To define a new database, we specify its database schema to the DBMS (database is
empty)
◼ database is initialized when we first load it with data
CHAPTER THREE
DATABASE DESIGN
Database design is the process of coming up with different kinds of specification for
the data to be stored in the database. The database design part is one of the middle
phases we have in information systems development where the system uses a
database approach. Design is the part on which we would be engaged to describe
how the data should be perceived at different levels and finally how it is going to be
stored in a computer system.
Information System with Database application consists of several tasks which include:
➢ Planning of Information systems Design
➢ Requirements Analysis,
➢ Design (Conceptual, Logical and Physical Design)
➢ Tuning
➢ Implementation
➢ Operation and Support
From these different phases, the prime interest of a database system will be the Design
part which is again sub divided into other three sub-phases.
These sub-phases are:
1. Conceptual Design
2. Logical Design, and
3. Physical Design
➢ In general, one has to go back and forth between these tasks to refine a database
design, and decisions in one task can influence the choices in another task.
➢ In developing a good design, one should answer such questions as:
▪ What are the relevant Entities for the Organization
▪ What are the important features of each Entity
▪ What are the important Relationships
▪ What are the important queries from the user
Conceptual Design
Logical Design
Physical Design
◼ Designing conceptual model for the database is not a one linear process but
an iterative activity where the design is refined again and again.
◼ To identify the entities, attributes, relationships, and constraints on the data,
there are different set of methods used during the analysis phase.
These include information gathered by…
➢ Interviewing end users individually and in a group
➢ Questionnaire survey
➢ Direct observation
➢ Examining different documents
◼ The basic E-R model is graphically depicted and presented for review.
◼ The process is repeated until the end users and designers agree that the ER
diagram is a fair representation of the organization’s activities and functions.
◼ Checking for Redundant Relationships in the ER Diagram. Relationships
between entities indicate access from one entity to another - it is therefore
possible to access one entity occurrence from another entity occurrence even
if there are other entities and relationships that separate them - this is often
referred to as Navigation' of the ER diagram
◼ The last phase in ER modeling is validating an ER Model against requirement
of the user.
Ovals
Key
Diamond Diamond
Id Gpa
Students Course
s
Age
Enrolled_In Semester
Academic
Year
Grade
One-to-one relationship:
➢ A customer is associated with at most one loan via the relationship borrower
➢ A loan is associated with at most one customer via borrower
One-To-Many Relationships
➢ In the one-to-many relationship a loan is associated with at most one customer
via borrower, a customer is associated with several (including 0) loans via
borrower
Many-To-Many Relationship
➢ A customer is associated with several (possibly 0) loans via borrower
➢ A loan is associated with several (possibly 0) customers via borrower
➢ Partial participation: some tuple in the entity or relation may not participate
in the relationship. This means, there is at least one tuple from that Relation not
taking any role in that specific relationship. The entity with partial participation in
a relationship will be connected to the relationship using a single line.
Problem in ER Modeling
The Entity-Relationship Model is a conceptual data model that views the real world as
consisting of entities and relationships. The model visually represents these concepts by
the Entity-Relationship diagram. The basic constructs of the ER model are entities,
relationships, and attributes. Entities are concepts, real or abstract, about which information
is collected. Relationships are associations between the entities. Attributes are properties
which describe the entities.
While designing the ER model one could face a problem on the design which is called a
connection traps. Connection traps are problems arising from misinterpreting certain
relationships
There are two types of connection traps;
1. Fan trap:
Occurs where a model represents a relationship between entity types, but the pathway
between certain entity occurrences is ambiguous.
May exist where two or more one-to-many (1:M) relationships fan out from an
entity. The problem could be avoided by restructuring the model so that there would
be no 1:M relationships fanning out from a singe entity and all the semantics of the
relationship is preserved.
Example:
CAR
Problem: Which car (Car1 or Car3 or Car5) is used by Employee 6 Emp6 working in
Branch 1 (Bra1)? Thus from this ER Model one can not tell which car is used by which
staff since a branch can have more than one car and also a branch is populated by more
than one employee. Thus, we need to restructure the model to avoid the connection trap.
To avoid the Fan Trap problem we can go for restructuring of the E-R Model. This will result
in the following E-R Model.
CAR EMPLOYEE
BRANCH
Car1
Bra1 Emp1
Car2
Bra2 Emp2
Car3
Bra3 Emp3
Car4
Bra4 Emp4
Car5
Emp5
Car6
Emp6
Car7
Emp7
2. Chasm Trap:
Occurs where a model suggests the existence of a relationship between entity types,
but the path way does not exist between certain entity occurrences.
Generalization
➢ Generalization occurs when two or more entities represent categories of the same real-world
object.
➢ Generalization is the process of defining a more general entity type from a set of more
specialized entity types.
➢ A generalization hierarchy is a form of abstraction that specifies that two or more entities
that share common attributes can be generalized into a higher-level entity type.
➢ Is considered as bottom-up definition of entities.
➢ Generalization hierarchy depicts relationship between higher level superclass and lower-
level subclass.
Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a supertype of
another. The level of nesting is limited only by the constraint of simplicity.
Example: Account is a generalized form for Saving and Current Accounts
Specialization
➢ Is the result of subset of a higher-level entity set to form a lower-level entity set.
➢ The specialized entities will have additional set of attributes (distinguishing characteristics)
that distinguish them from the generalized entity.
➢ Is considered as Top-Down definition of entities.
➢ Specialization process is the inverse of the Generalization process. Identify the distinguishing
features of some entity occurrences, and specialize them into different subclasses.
➢ Reasons for Specialization o Attributes only partially applying to superclasses o Relationship
types only partially applicable to the superclass
➢ In many cases, an entity type has numerous sub-groupings of its entities that are meaningful
and need to be represented explicitly. This need requires the representation of each subgroup
in the ER model. The generalized entity is a superclass and the set of specialized entities will
be subclasses for that specific Superclass.
Example: Saving Accounts and Current Accounts are Specialized entities for the generalized
entity Accounts. Manager, Sales, Secretary: are specialized employees.
Subclass/Subtype
➢ An entity type whose tuples have attributes that distinguish its members from
tuples of the generalized or Superclass entities.
➢ When one generalized Superclass has various subgroups with distinguishing
features and these subgroups are represented by specialized form, the groups are
called subclasses.
Attribute Inheritance
➢ An entity that is a member of a subclass inherits all the attributes of the
entity as a member of the superclass.
➢ The entity also inherits all the relationships in which the superclass
participates.
➢ An entity may have more than one subclass categories.
➢ All entities/subclasses of a generalized entity or superclass share a
common unique identifier attribute (primary key). i.e. The primary key
of the superclass and subclasses are always identical.
• Consider the EMPLOYEE supertype entity shown above. This entity can
have several different subtype entities (for example: HOURLY and
SALARIED), each with distinct properties not shared by other subtypes. But
whether the employee is HOURLY or SALARIED, same attributes
(EmployeeId, Name, and DateHired)
are shared.
• The Supertype EMPLOYEE stores all properties that subclasses have in
common. And HOURLY employees have the unique attribute Wage (hourly
wage rate), while SALARIED employees have two unique attributes,
StockOption and Salary.
◼ The Partial Specialization Rule specifies that it is not necessary for all entity
occurrences in the superclass to be a member of one of the subclasses. Here we have an
optional participation on the specialization. Partial Participation of superclass instances
on subclasses is diagrammed with a single line from the Supertype to the circle.
◼ Di jointness Constraints.
• Specifies the rule whether one entity occurrence can be a member of more than one
subclasses. i.e., it is a type of business rule that deals with the situation where an entity
occurrence of a Superclass may also have more than one Subclass occurrence.
• The Disjoint Rule restricts one entity occurrence of a superclass to be a member of only
one of the subclasses. Example: a EMPLOYEE can either be SALARIED or PART-
TIMER, but not the both at the same time.
• The Overlap Rule allows one entity occurrence to be a member f more than one subclass.
Example: EMPLOYEE working at the university can be both a STUDENT and an
EMPLOYEE at the same time.
CHAPTER FOUR
LOGICAL DATABASE DESIGN
Logical design is the process of constructing a model of the information used in an
enterprise based on a specific data model (e.g., relational, hierarchical or network or
object), but independent of a particular DBMS and other physical considerations.
◼ Normalization process
◼ Collection of Rules to be maintained
◼ Discover new entities in the process
◼ Revise attributes based on the rules and the discovered Entities
The first step before applying the rules in relational data model is converting the conceptual
design to a form suitable for relational logical model, which is in a form of tables.
Converting ER Diagram to Relational Tables
Three basic rules to convert ER into tables or relations:
1. For a relationship with One-to-One Cardinality:
⚫ All the attributes are merged into a single table. Which means one
can post the primary key or candidate key of one of the relations to
the other as a foreign key.
2. For a relationship with One-to-Many Cardinality:
⚫ Post the primary key or candidate key from the “one” side as a
foreign key attribute to the “many” side. E.g.: For a relationship
called “Belongs To” between Employee (Many) and Department
(One)
3. For a relationship with Many-to-Many Cardinality:
⚫ Create a new table (which is the associative entity) and post primary
key or candidate key from each entity as attributes in the new table
along with some additional attributes (if applicable)
After converting the ER diagram in to table forms, the next phase is implementing the
process of normalization, which is a collection of rules each table should satisfy.
4.1. Normalization
A relational database is merely a collection of data, organized in a particular manner. As
the father of the relational database approach, Codd created a series of rules called normal
forms that help define that organization
One of the best ways to determine what information should be stored in a database is to
clarify what questions will be asked of it and what data would be included in the answers.
1. Insertion Anomalies
2. Deletion Anomalies
3. Modification Anomalies
Normalization may reduce system performance since data will be cross referenced from
many tables. Thus, denormalization is sometimes used to improve performance, at the cost
of reduced consistency guarantees.
All the normalization rules will eventually remove the update anomalies that may exist
during data manipulation after the implementation. The update anomalies are;
The type of problems that could occur in insufficiently normalized table is called update
anomalies which includes;
(1) Insertion anomalies
An "insertion anomaly" is a failure to place information about a new database entry
into all the places in the database where information about that new entry needs to be
Deletion Anomalies:
If employee with ID 16 is deleted then ever information about skill C++ and the
type of skill is deleted from the database. Then we will not have any information
about C++ and its skill type.
Insertion Anomalies:
What if we have a new employee with a skill called Pascal? We can not decide
weather Pascal is allowed as a value for skill and we have no clue about the type
of skill that Pascal should be categorized as.
Modification Anomalies:
What if the address for Helico is changed fro Piazza to Mexico? We need to look
for every occurrence of Helico and change the value of School_Add from Piazza
to Mexico, which is prone to error.
Data Dependency
The logical associations between data items that point the database designer in the direction
of a good database design are refered to as determinant or dependent relationships.
Two data items A and B are said to be in a determinant or dependent relationship if certain
values of data item B always appears with certain values of data item A. if the data item A
is the determinant data item and B the dependent data item then the direction of the
association is from A to B and not vice versa.
The essence of this idea is that if the existence of something, call it A, implies that B must
exist and have a certain value, and then we say that "B is functionally dependent on
A." We also often express this idea by saying that "A determines B," or that "B is a function
of A," or that "A functionally governs B." Often, the notions of functionality and functional
dependency are expressed briefly by the statement, "If A, then B." It is important to note
that the value B must be unique for a given value of A, i.e., any given value of A must
imply just one and only one value of B, in order for the relationship to qualify for the name
"function." (However, this does not necessarily prevent different values of A from implying
the same value of B.)
X Y holds if whenever two tuples have the same value for X, they must have the same
value for Y
Example
Dinner Type of Wine
Course
Meat Red
Fish White
Cheese Rose
Since the type of Wine served depends on the type of Dinner, we say Wine is functionally dependent
on Dinner.
Dinner Wine
Since both Wine type and Fork type are determined by the Dinner type, we say Wine is
functionally dependent on Dinner and Fork is functionally dependent on Dinner.
Dinner Wine
Dinner Fork
Partial Dependency
If an attribute which is not a member of the primary key is dependent on some part of the
primary key (if we have composite primary key) then that attribute is partially functionally
dependent on the primary key.
Let {A,B} is the Primary Key and C is non key attribute.
Full Dependency
If an attribute which is not a member of the primary key is not dependent on some part of
the primary key but the whole key (if we have composite primary key) then that attribute
is fully functionally dependent on the primary key.
Let {A, B} is the Primary Key and C is non key attribute
Transitive Dependency
In mathematics and logic, a transitive relationship is a relationship of the following form: "If
A implies B, and if also B implies C, then A implies C."
Example:
If Mr X is a Human, and if every Human is an Animal, then Mr X must be an Animal.
Generalized way of describing transitive dependency is that:
If A functionally governs B, AND
If B functionally governs C
THEN A functionally governs C
Provided that neither C nor B determines A i.e. (B / A and C / A) In the
normal notation:
Steps of Normalization:
We have various levels or steps in normalization called Normal Forms. The level of
complexity, strength of the rule and decomposition increases as we move from one
lower-level Normal Form to the higher.
normal form below represents a stronger condition than the previous one
EMP_PROJ rearranged
EmpID ProjNo EmpName ProjName ProjLoc ProjFund ProjMangID Incentive
Business rule: Whenever an employee participates in a project, he/she will be entitled for an
incentive.
This schema is in its 1NF since we don’t have any repeating groups or attributes with
multi-valued property. To convert it to a 2NF we need to remove all partial dependencies
of non-key attributes on part of the primary key.
{EmpID, ProjNo} EmpName, ProjName, ProjLoc, ProjFund, ProjMangID, Incentive
But in addition to this we have the following dependencies
FD1: {EmpID} EmpName
FD2: {ProjNo} ProjName, ProjLoc, ProjFund, ProjMangID
FD3: {EmpID, ProjNo} Incentive
As we can see, some non-key attributes are partially dependent on some part of the
primary key. This can be witnessed by analyzing the first two functional dependencies
(FD1 and FD2). Thus, each Functional Dependencies, with their dependent attributes
should be moved to a new relation where the Determinant will be the Primary Key for
each.
EMPLOYEE
EmpID EmpName
PROJECT
ProjNo ProjName ProjLoc ProjFund ProjMangID
The non-primary key attributes, dependent on each other will be moved to another table
and linked with the main table using Candidate Key- Foreign Key relationship.
STUDENT DORM
Mnemonic for remembering the rationale for normalization up to 3NF could be the
following:
The correct solution, to cause the model to be in 4th normal form, is to ensure that all M:M
relationships are resolved independently if they are indeed independent, as shown below.
Pitfalls of Normalization
CHAPTER FIVE
Physical Database Design Methodology for Relational Database
We have established that there are three levels of database design:
• Conceptual: producing a data model which accounts for the relevant entities and
relationships within the target application domain;
• Logical: ensuring, via normalization procedures and the definition of integrity rules, that the
stored database will be non-redundant and properly connected;
• Physical: specifying how database records are stored, accessed and related to ensure
adequate performance.
We can consider the topic of physical database design from three aspects:
• What techniques for storing and finding data exist
• Which are implemented within a particular DBMS
• Which might be selected by the designer for a given application knowing the properties
of the data
Thus, the purpose of physical database design is to describe:
1. How to map the logical database design to a physical database design.
2. How to design base relations for target DBMS.
3. How to design enterprise constraints for target DBMS.
4. How to select appropriate file organizations based on analysis of transactions.
5. When to use secondary indexes to improve performance.
6. How to estimate the size of the database
7. How to design user views
8. How to design security mechanisms to satisfy user requirements.
Physical database design is the process of producing a description of the implementation of the
database on secondary storage.
Physical design describes the base relation, file organization, and indexes used to achieve
efficient access to the data, and any associated integrity constraints and security measures.
◼ Sources of information for the physical design process include global logical data model and
documentation that describes model.
◼ Logical database design is concerned with the what; physical database design is concerned
with the how.
◼ The process of producing a description of the implementation of the database on secondary
storage.
◼ Describes the storage structures and access methods used to achieve efficient access to the
data.
Knowledge of the DBMS includes: how to create base relations whether the system supports:
definition of Primary key
definition of foreign key
definition of Alternate key
definition of Domains
Referential integrity constraints
definition of enterprise level constraints
represent any derived data present in the global logical data model in the target DBMS
should be devised.
Examine logical data model and data dictionary, and produce list of all derived attributes.
Most of the time derived attributes are not expressed in the logical model but will be
included in the data dictionary. Whether to store derived attributes in a base relation or
calculate them when required is a decision to be made by the designer considering the
performance impact.
Option selected is based on:
• Additional cost to store the derived data and keep it consistent with operational
data from which it is derived;
• Cost to calculate it each time it is required.
Less expensive option is chosen subject to performance constraints.
The representation of derived attributes should be fully documented.
Data security
CHAPTER SIX
RELATIONAL QUERY LANGUAGES
◼ Query languages: Allow manipulation and retrieval of data from a database.
◼ Query Languages! = programming languages!
◼ QLs not intended to be used for complex calculations.
◼ QLs support easy, efficient access to large data sets.
◼ Relational model supports simple, powerful query languages.
Formal Relational Query Languages
◼ There are varieties of Query languages used by relational DBMS for manipulating
relations.
◼ Some of them are procedural
◼ User tells the system exactly what and how to manipulate the data
◼ Others are non-procedural
◼ User states what data is needed rather than how it is to be retrieved.
Two mathematical Query Languages form the basis for Relational languages
Relational Algebra:
◼ Relational Calculus:
◼ We may describe the relational algebra as procedural language: it can be used to tell
the DBMS how to build a new relation from one or more relations in the database.
◼ We may describe relational calculus as a non-procedural language: it can be used to
formulate the definition of a relation in terms of one or more database relations.
◼ Formally the relational algebra and relational calculus are equivalent to each other. For
every expression in the algebra, there is an equivalent expression in the calculus.
◼ Both are non-user-friendly languages. They have been used as the basis for other, higher-
level data manipulation languages for relational databases.
A query is applied to relation instances, and the result of a query is also a relation instance.
◼ Schemas of input relations for a query are fixed
◼ The schema for the result of a given query is also fixed! Determined by definition
of query language constructs.
Relational Algebra
The basic set of operations for the relational model is known as the relational algebra. These
operations enable a user to specify basic retrieval requests.
The result of the retrieval is a new relation, which may have been formed from one or more
relations. The algebra operations thus produce new relations, which can be further manipulated
using operations of the same algebra.
A sequence of relational algebra operations forms a relational algebra expression, whose result
will also be a relation that represents the result of a database query (or retrieval request).
◼ Relational algebra is a theoretical language with operations that work on one or
more relations to define another relation without changing the original relation.
◼ The output from one operation can become the input to another operation (nesting
is possible)
◼ There are different basic operations that could be applied on relations on a
database based on the requirement.
◼ Selection ( ) Selects a subset of rows from a relation.
◼ Projection ( ) Deletes unwanted columns from a relation.
◼ Renaming: assigning intermediate relation for a single operation
◼ Cross-Product ( x ) Allows us to combine two relations.
◼ Set-Difference ( - ) Tuples in relation1, but not in relation2.
Table1:
Sample table used to illustrate different kinds of relational operations. The relation contains
information about employees, IT skills they have and the school where they attend each skill.
Employee
EmpID FName LName SkillID Skill SkillType School SchoolAdd SkillLevel
1. Selection
◼ Selects subset of tuples/rows in a relation that satisfy selection condition.
◼ Selection operation is a unary operator (it is applied to a single relation)
◼ The Selection operation is applied to each tuple individually
◼ The degree of the resulting relation is the same as the original relation but the cardinality
(no. of tuples) is less than or equal to the original relation.
◼ The Selection operator is commutative.
◼ Set of conditions can be combined using Boolean operations ( (AND), (OR), and
~(NOT))
◼ No duplicates in result!
◼ Schema of result identical to schema of (only) input relation.
◼ Result relation can be the input for another relational algebra operation! (Operator
composition.)
◼ It is a filter that keeps only those tuples that satisfy a qualifying condition
(Those satisfying the condition are selected while others are discarded.)
Notation:
2. Projection
◼ Selects certain attributes while discarding the other from the base relation.
◼ The PROJECT creates a vertical partitioning – one with the needed columns (attributes)
containing results of the operation and other containing the discarded Columns.
◼ Deletes attributes that are not in projection list.
◼ Schema of result contains exactly the fields in the projection list, with the same names
that they had in the (only) input relation.
◼ Projection operator has to eliminate duplicates!
◼ Note: real systems typically don’t do duplicate elimination unless the user explicitly asks
for it.
◼ If the Primary Key is in the projection list, then duplication will not occur
◼ Duplication removal is necessary to insure that the resulting table is also a relation.
Notation:
3. Rename Operation
We may want to apply several relational algebra operations one after the other. The query
could be written in two different forms:
1. Write the operations as a single relational algebra expression by nesting the
operations.
2. Apply one operation at a time and create intermediate result relations. In the
latter case, we must give names to the relations that hold the intermediate
results Rename Operation
If we want to have the Name, Skill, and Skill Level of an employee with salary greater than 1500
and working for department 5, we can write the expression for this query using the two
alternatives:
Then Result will be equivalent with the relation we get using the first
alternative.
4. Set Operations
The three main set operations are the Union, Intersection and Set Difference. The properties of
these set operations are similar with the concept we have in mathematical set theory. The
difference is that, in database context, the elements of each set, which is a Relation in Database,
will be tuples. The set operations are Binary operations which demand the two operand Relations
to have type compatibility feature.
Type Compatibility
Two relations R1 and R2 are said to be Type Compatible if:
1. The operand relations R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) have the same number
of attributes, and
2. The domains of corresponding attributes must be compatible; that is,
Dom(Ai)=Dom(Bi) for i=1, 2, ..., n.
To illustrate the three set operations, we will make use of the following two tables:
Employee
EmpID FName LName SkillID Skill SkillType School SkillLevel
12 Abebe Mekuria 2 SQL Database AAU 5
a. UNION Operation
The result of this operation, denoted by R U S, is a relation that includes all tuples
that are either in R or in S or in both R and S. Duplicate tuple is eliminated.
The two operands must be "type compatible"
Employees who attend Database in any School or who attend any course at AAU
b. INTERSECTION Operation
The result of this operation, denoted by R ∩ S, is a relation that includes all tuples
that are in both R and S. The two operands must be "type compatible"
Employees who attend Database Course but didn’t take any course at AAU
Employees who attend Database Course but didn’t take any course at AAU
The resulting relation for; R1 R2, R1 R2, or R1-R2 has the same attribute names as
the first operand relation R1 (by convention).
Some Properties of the Set Operators
Notice that both union and intersection are commutative operations; that is
R (S T) = (R S) T, and (R S) T=R (S T)
The minus operation is not commutative; that is, in general
R-S≠S–R
5. CARTESIAN (cross product) Operation
This operation is used to combine tuples from two relations in a combinatorial fashion. That
means, every tuple in Relation1(R) one will be related with every other tuple in Relation2 (S).
• In general, the result of R(A1, A2, . . ., An) x S(B1,B2, . . ., Bm) is a relation Q with degree
n + m attributes Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order.
• Where R has n attributes and S has m attributes.
• The resulting relation Q has one tuple for each combination of tuples— one from R and
one from S.
• Hence, if R has n tuples, and S has m tuples, then | R x S | will have n* m tuples.
Example:
Employee
ID FName LName
123 Abebe Lemma
Dept
DeptID DeptName MangID
2 Finance 567
3 Personnel 123
Then the Cartesian product between Employee and Dept relations will be of the form:
Employee X Dept:
ID FName LName DeptID DeptName MangID
123 Abebe Lemma 2 Finance 567
6. JOIN Operation
The sequence of Cartesian product followed by select is used quite commonly to identify and
select related tuples from two relations, a special operation, called JOIN. Thus in JOIN
operation, the Cartesian Operation and the Selection Operations are used together.
d. SEMIJOIN Operation
SEMI JOIN is another version of the JOIN operation where the resulting Relation will contain
those attributes of only one of the Relations that are related with tuples in the other Relation. The
following notation depicts the inclusion of only the attributes form the first relation (R) in the
result which are actually participating in the relationship.
R <Join Condition> S
Relational Calculus
A relational calculus expression creates a new relation, which is specified in terms of
variables that range over rows of the stored database relations (in tuple calculus) or over
columns of the stored relations (in domain calculus).
In a calculus expression, there is no order of operations to specify how to retrieve the
query result. A calculus expression specifies only what information the result should
In Relational calculus, there is no description of how to evaluate a query; this is the main
distinguishing feature between relational algebra and relational calculus.
Relational calculus is considered to be a nonprocedural language. This differs from
relational algebra, where we must write a sequence of operations to specify a retrieval
request; hence relational algebra can be considered as a procedural way of stating a query.
When applied to relational database, the calculus is not that of derivative and differential
but in a form of first-order logic or predicate calculus, a predicate is a truth-valued
function with arguments.
When we substitute values for the arguments in the predicate, the function yields an
expression, called a proposition, which can be either true or false.
for x. When we substitute some values of this range for x, the proposition may be true;
for other values, it may be false.
If COND is a predicate, then the set of all tuples evaluated to be true for the predicate
COND will be expressed as follows:
{t | COND(t)}
Where t is a tuple variable and COND (t) is a conditional expression involving t. The
result of such a query is the set of all tuples t that satisfy
COND (t).
If we have set of predicates to evaluate for a single query, the predicates can be
A relational calculus expression creates a new relation, which is specified in terms of variables
that range over rows of the stored database relations (in tuple calculus) or over columns of the
stored relations (in domain calculus).
Tuple-oriented Relational Calculus
➢ The tuple relational calculus is based on specifying a number of tuple variables. Each
tuple variable usually ranges over a particular database relation, meaning that the variable
may take as its value any individual tuple from that relation.
➢ Tuple relational calculus is interested in finding tuples for which a predicate is true for a
relation. Based on use of tuple variables.
➢ Tuple variable is a variable that ‘ranges over’ a named relation: that is, a variable whose
only permitted values are tuples of the relation.
➢ If E is a tuple that ranges over a relation employee, then it is represented as
EMPLOYEE(E) i.e. Range of E is EMPLOYEE
➢ Then to extract all tuples that satisfy a certain condition, we will represent is as all
tuples E such that COND(E) is evaluated to be true.
{E COND(E)}
The predicates can be connected using the Boolean operators:
(AND), (OR), (NOT)
COND(t) is a formula, and is called a Well-Formed-Formula (WFF) if:
➢ Where the COND is composed of n-nary predicates (formula composed of
n single predicates) and the predicates are connected by any of the Boolean
operators.
➢ And each predicate is of the form A B and is one of the logical
operators { <, , >, , , = }which could be evaluated to either true or
false. And A and B are either constant or variables.
➢ Formulae should be unambiguous and should make sense.
Department of Computer Science Fundamentals of Database Systems (CoSc2041)
Bonga University College of Engineering & Technology_________
➢ To find only the EmpId, FName, LName, Skill and the School where the skill is
attended where of employees with skill level greater than or equal to 8, the tuple
based relational calculus expression will be:
{E.EmpId, E.FName, E.LName, E.Skill, E.School | Employee(E) E.SkillLevel >= 8}
CHAPTER SEVEN
ADVANCED CONCEPTS IN DATABASE SYSTEMS
Database Security and Integrity
Distributed Database Systems Data warehousing
1. Database Security and Integrity
A database represents an essential corporate resource that should be properly secured using
appropriate controls.
Database security encompasses hardware, software, people and data
Multi-user database system - DBMS must provide a database security and authorization
subsystem to enforce limits on individual and group access rights and privileges.
Database security and integrity is about protecting the database from being inconsistent and being
disrupted. We can also call it database misuse.
Database misuse could be Intentional or accidental, where accidental misuse is easier to cope
with than intentional misuse. Accidental inconsistency could occur due to:
➢ System crash during transaction processing
➢ Anomalies due to concurrent access
➢ Anomalies due to redundancy
➢ Logical errors
Likewise, even though there are various threats that could be categorized in this group,
intentional misuse could be:
➢ Unauthorized reading of data
➢ Unauthorized modification of data or
➢ Unauthorized destruction of data
Most systems implement good Database Integrity to protect the system from accidental misuse
while there are many computer based measures to protect the system from intentional misuse,
which is termed as Database Security measures.
Database security is considered in relation to the following situations:
➢ Theft and fraud
Department of Computer Science Fundamentals of Database Systems (CoSc2041)
Bonga University College of Engineering & Technology_________
• Physical control
• Policy issues regarding privacy of individual level at enterprise and national level
• Operational consideration on the techniques used (password, etc)
• System level security including operating system and hardware control Security
levels and security policies in enterprise level
• Database security - the mechanisms that protect the database against intentional
or accidental threats. And Database security encompasses hardware, software,
people and data
• Threat – any situation or event, whether intentional or accidental, that may
adversely affect a system and consequently the organization
• A threat may be caused by a situation or event involving a person, action, or
circumstance that is likely to bring harm to an organization
• The harm to an organization may be tangible or intangible Tangible – loss of
hardware, software, or data
Intangible – loss of credibility or client confidence
Examples of threats:
✓ Using another persons’ means of access
✓ Unauthorized amendment/modification or copying of data
✓ Program alteration
✓ Inadequate policies and procedures that allow a mix of confidential and normal
out put
✓ Wire-tapping
Department of Computer Science Fundamentals of Database Systems (CoSc2041)
Bonga University College of Engineering & Technology_________
4. Database System: concerned with data access limit enforced by the database system.
Access limit like password, isolated transaction and etc.
Even though we can have different levels of security and authorization on data objects
and users, who access which data is a policy matter rather than technical.
These policies
➢ should be known by the system: should be encoded in the system
➢ should be remembered: should be saved somewhere (the catalogue)
• An organization needs to identify the types of threat it may be subjected to and initiate
appropriate plans and countermeasures, bearing in mind the costs of implementing
them
Authorization
▪ The granting of a right or privilege that enables a subject to have legitimate access
to a system or a system’s object
▪ Authorization controls can be built into the software, and govern not only what
system or object a specified user can access, but also what the user may do with
it
▪ Authorization controls are sometimes referred to as access controls
▪ The process of authorization involves authentication of subjects (i.e. a user or
program) requesting access to objects (i.e. a database table, view, procedure,
trigger, or any other object that can be created within the system)
Views
▪ A view is the dynamic result of one or more relational operations operation on
the base relations to produce another relation
▪ A view is a virtual relation that does not actually exist in the database, but is
produced upon request by a particular user
▪ The view mechanism provides a powerful and flexible security mechanism by
hiding parts of the database from certain users
▪ Using a view is more restrictive than simply having certain privileges granted to
a user on the base relation(s)
Integrity
▪ Integrity constraints contribute to maintaining a secure database system by
preventing data from becoming invalid and hence giving misleading or incorrect
results
▪ Domain Integrity
▪ Entity integrity
▪ Referential integrity Key constraints
Encryption
▪ The encoding of the data by a special algorithm that renders the data
unreadable by any program without the decryption key
▪ If a database system holds particularly sensitive data, it may be deemed
necessary to encode it as a precaution against possible external threats or
attempts to access it
▪ The DBMS can access data after decoding it, although there is a
degradation in performance because of the time taken to decode it
▪ Encryption also protects data transmitted over communication lines
▪ To transmit data securely over insecure networks requires the use of a
Cryptosystem, which includes:
Authentication
➢ All users of the database will have different access levels and permission for
different data objects, and authentication is the process of checking whether the
user is the one with the privilege for the access level.
➢ Is the process of checking the users are who they say they are.
➢ Each user is given a unique identifier, which is used by the operating system to
determine who they are
➢ Thus, the system will check whether the user with a specific username and
password is trying to use the resource.
➢ Associated with each identifier is a password, chosen by the user and known to
the operation system, which must be supplied to enable the operating system to
authenticate who the user claims to be
Any database access request will have the following three major components
2. Requested Object: on which resource or data of the database is the operation sought to be
applied?
3. Requesting User: who is the user requesting the operation on the specified object?
The database should be able to check for all the three components before processing any
request. The checking is performed by the security subsystem of the DBMS.
◼ Replication: System maintains multiple copies of data, stored in different sites, for
faster retrieval and fault tolerance.
Advantages of DDBMS
1. Data sharing and distributed control:
➢ User at one site may be able access data that is available at another site.
➢ Each site can retain some degree of control over local data
➢ We will have local as well as global database administrator
2. Reliability and availability of data
➢ If one site fails the rest can continue operation as long as transaction does not demand data
from the failed system and the data is not replicated in other sites
3. Speedup of query processing
➢ If a query involves data from several sites, it may be possible to split the query into sub-
queries that can be executed at several sites which is parallel processing
Disadvantages of DDBMS
1. Software development cost
3. Data warehousing
Data warehouse is an integrated, subject-oriented, time-variant, nonvolatile
database that provides support for decision making.
✓ Integrated centralized, consolidated database that integrates data derived from
the entire organization.
Consolidates data from multiple and diverse sources with diverse formats.
➢ Helps managers to better understand the company’s operations.
✓ Subject-Oriented Data warehouse contains data organized by topics. Eg.
Sales, marketing, finance, etc.
➢ Data warehouse contains data that reflect what happened last week, last
month, past five years, and so on.
✓ Non volatile Once data enter the data warehouse, they are never removed.
Because the data in the warehouse represent the company’s entire history.
◼ The data found in data warehouse is analyzed to discover previously unknown data
characteristics, relationships, dependencies, or trends.