DB Lecture 4-7
DB Lecture 4-7
Chapter Four
The purpose of normalization is to find the suitable set of relations that supports
the data requirements of an enterprise.
A suitable set of relations has the following characteristics;
Itec 222 1
Database Systems
The first step before applying the rules in relational data model is converting the
conceptual design to a form suitable for relational logical model, which is in a form
of tables.
Itec 222 2
Database Systems
Itec 222 3
Database Systems
FNam LNam
ee ee
EI Salar DI DLoc
D Nam y Manag D
e es
1 1
Employee Department
M 1 M WorksFo 1
r
Tel DNam
e
StartDate
Leads
EndDate
Participa
te
PBonu
s
M
M
Project
PFund
PID PNam
e
Itec 222 4
Database Systems
After we have drawn the ER diagram, the next thing is to map the ER into
relational schema so as the rules of the relational data model can be tested for each
relational schema. The mapping can be done for the entities followed by
relationships based on the rule of mapping. the mapping has been done as follows.
Itec 222 5
Database Systems
side (Employee table). This will require adding the PK of Department (DID)
in the Employee Table as a foreign key. We can give the foreign key another
name which is EDID to mean "Employee's Department id". this will affect
the degree of the Employee table.
Employee
EID FName LName Salary EDID
At the end of the mapping we will have the following relational schema (tables)
for the logical database design phase.
Department
DID DName DLoc MEID
Project
PID PName PFund
Telephone
EID Tel
Employee
EID FName LName Salary EDID
Emp_Partc_Project
EID PID
Emp_Lead_Project
Itec 222 6
Database Systems
After converting the ER diagram in to table forms, the next phase is implementing
the process of normalization, which is a collection of rules each table should
satisfy.
Normalization
A relational database is merely a collection of data, organized in a particular
manner. As the father of the relational database approach, Codd created a series
of rules (tests) called normal forms that help define that organization
One of the best ways to determine what information should be stored in a database
is to clarify what questions will be asked of it and what data would be included in
the answers.
1. Insertion Anomalies
2. Deletion Anomalies
3. Modification Anomalies
Normalization may reduce system performance since data will be cross referenced
from many tables. Thus denormalization is sometimes used to improve
performance, at the cost of reduced consistency guarantees.
All the normalization rules will eventually remove the update anomalies that may
exist during data manipulation after the implementation. The update anomalies
are;
The type of problems that could occur in insufficiently normalized table is called
update anomalies which includes;
(1) Insertion anomalies
Itec 222 7
Database Systems
Itec 222 8
Database Systems
Deletion Anomalies:
If employee with ID 16 is deleted then ever information about skill C++ and
the type of skill is deleted from the database. Then we will not have any
information about C++ and its skill type.
Insertion Anomalies:
What if we have a new employee with a skill called Pascal? We can not
decide weather Pascal is allowed as a value for skill and we have no clue
about the type of skill that Pascal should be categorized as.
Modification Anomalies:
What if the address for Helico is changed from Piazza to Mexico? We need
to look for every occurrence of Helico and change the value of School_Add
from Piazza to Mexico, which is prone to error.
Database-management system can work only with the information that we put
explicitly into its tables for a given database and into its rules for working with those
tables, where such rules are appropriate and possible.
Itec 222 9
Database Systems
Data Dependency
The logical associations between data items that point the database designer in the
direction of a good database design are refered to as determinant or dependent
relationships.
The essence of this idea is that if the existence of something, call it A, implies that
B must exist and have a certain value, then we say that "B is functionally
dependent on A." We also often express this idea by saying that "A functionally
determines B," or that "B is a function of A," or that "A functionally governs B."
Often, the notions of functionality and functional dependency are expressed
briefly by the statement, "If A, then B." It is important to note that the value of B
must be unique for a given value of A, i.e., any given value of A must imply just
one and only one value of B, in order for the relationship to qualify for the name
"function." (However, this does not necessarily prevent different values of A from
implying the same value of B.)
However, for the purpose of normalization, we are interested in finding 1..1 (one
to one) dependencies, lasting for all times (intension rather than extension of the
database), and the determinant having the minimal number of attributes.
X Y holds if whenever two tuples have the same value for X, they must have the
same value for Y
FDs are derived from the real-world constraints on the attributes and they are
properties on the database intension not extension.
Itec 222 10
Database Systems
Example
Dinner Type of Wine
Course
Meat Red
Fish White
Cheese Rose
Since the type of Wine served depends on the type of Dinner, we say Wine is
functionally dependent on Dinner.
Dinner Wine
Since both Wine type and Fork type are determined by the Dinner type, we say
Wine is functionally dependent on Dinner and Fork is functionally dependent on
Dinner.
Dinner Wine
Dinner Fork
Partial Dependency
If an attribute which is not a member of the primary key is dependent on some
part of the primary key (if we have composite primary key) then that attribute is
partially functionally dependent on the primary key.
Itec 222 11
Database Systems
Transitive Dependency
In mathematics and logic, a transitive relationship is a relationship of the following
form: "If A implies B, and if also B implies C, then A implies C."
Example:
If Mr X is a Human, and if every Human is an Animal, then Mr X must be an Animal.
Itec 222 12
Database Systems
Steps of Normalization:
We have various levels or steps in normalization called Normal Forms. The level
of complexity, strength of the rule and decomposition increases as we move from
one lower level Normal Form to the higher.
A normal form below represents a stronger condition than the previous one
UnNormalized Form(UNF):
Identify all data elements
First Normal Form(1NF):
Find the key with which you can find all data i.e. remove any repeating group
Second Normal Form(2NF):
Remove part-key dependencies (partial dependency). Make all data dependent on the
whole key.
Third Normal Form(3NF)
Remove non-key dependencies (transitive dependencies). Make all data dependent on
nothing but the key.
For most practical purposes, databases are considered normalized if they adhere
to the third normal form (there is no transitive dependency).
Itec 222 13
Database Systems
Remove all repeating groups. Distribute the multi-valued attributes into different
rows and identify a unique identifier for the relation so that is can be said is a
relation in relational database. Flatten the table.
Itec 222 14
Database Systems
EMP_PROJ rearranged
EmpID ProjNo EmpName ProjName ProjLoc ProjFund ProjMangID Incentive
This schema is in its 1NF since we don’t have any repeating groups or attributes
with multi-valued property. To convert it to a 2NF we need to remove all partial
dependencies of non key attributes on part of the primary key.
FD1: {EmpID}EmpName
FD2: {ProjNo}ProjName, ProjLoc, ProjFund, ProjMangID
FD3: {EmpID, ProjNo} Incentive
As we can see, some non key attributes are partially dependent on some part of
the primary key. This can be witnessed by analyzing the first two functional
dependencies (FD1 and FD2). Thus, each Functional Dependencies, with their
dependent attributes should be moved to a new relation where the Determinant
will be the Primary Key for each.
Itec 222 15
Database Systems
EMPLOYEE
EmpID EmpName
PROJECT
ProjNo ProjName ProjLoc ProjFund ProjMangID
EMP_PROJ
EmpID ProjNo Incentive
This schema is in its 2NF since the primary key is a single attribute and
there are no repeating groups (multi valued attributes).
Let’s take StudID, Year and Dormitary and see the dependencies.
Itec 222 16
Database Systems
AndYear can not determine StudID and Dormitary can not determine
StudID Then transitively StudIDDormitary
STUDENT DORM
StudID Stud Stud Dept Year Year Dormitary
F_Name L_Name 1 401
125/97 Abebe Mekuria Info Sc 1
3 403
654/95 Lemma Alemu Geog 3
842/95 Chane Kebede CompSc 3
165/97 Alem Kebede InfoSc 1
985/95 Almaz Belay Geog 3
Itec 222 17
Database Systems
BCNF is based on functional dependency that takes in to account all the candidate
keys in a relation.
So, table is in BCNF if it is in 3NF and if every determinant is a candidate key.
Violation of the BCNF is very rare. The potential sources for violation of this rule are
1. The relation contains two (or more) composite candidate keys
2. The candidate keys over lap i.e. have common attribute.
The correct solution, to cause the model to be in 4th normal form, is to ensure that all
M:M relationships are resolved independently if they are indeed independent, as
shown below.
A------>>B
A------->>C
18
Database Systems
19
Database Systems
Pitfalls of Normalization
20
Database Systems
Chapter Five
Conceptual design: producing a data model which accounts for the relevant
entities and relationships within the target application domain;
Logical design: ensuring, via normalization procedures and the definition
of integrity rules, that the stored database will be non-redundant and
properly connected;
Physical design: specifying how database records are stored, accessed and
related to ensure adequate performance.
We can consider the topic of physical database design from three aspects:
What techniques for storing and finding data exist
Which are implemented within a particular DBMS
Which might be selected by the designer for a given application knowing
the properties of the data
21
Database Systems
22
Database Systems
23
Database Systems
Examine logical data model and data dictionary, and produce list of all
derived attributes. Most of the time derived attributes are not expressed in
the logical model but will be included in the data dictionary. Whether to store
derived attributes in a base relation or calculate them when required is a decision
to be made by the designer considering the performance impact.
Option selected is based on:
Additional cost to store the derived data and keep it consistent with
operational data from which it is derived;
Cost to calculate it each time it is required.
Less expensive option is chosen subject to performance constraints.
The representation of derived attributes should be fully documented.
24
Database Systems
All the enterprise level constraints and the definition method in the target
DBMS should be fully documented.
25
Database Systems
26
Database Systems
This includes:
Adding an index record to every secondary index whenever tuple is
inserted;
Updating a secondary index when corresponding tuple is updated;
Increase in disk space needed to store the secondary index;
Possible performance degradation during query optimization to
consider all secondary indexes.
Guidelines for Choosing Indexes
(1) Do not index small relations.
(2) Index PK of a relation if it is not a key of the file organization.
(3) Add secondary index to a FK if it is frequently accessed.
(4) Add secondary index to any attribute that is heavily used as a
secondary key.
(5) Add secondary index on attributes that are involved in: selection or
join criteria; ORDER BY; GROUP BY; and other operations
involving sorting (such as UNION or DISTINCT).
(6) Add secondary index on attributes involved in built-in functions.
(7) Add secondary index on attributes that could result in an index-
only plan.
(8) Avoid indexing an attribute or relation that is frequently updated.
(9) Avoid indexing an attribute if the query will retrieve a significant
proportion of the tuples in the relation.
(10) Avoid indexing attributes that consist of long character strings.
28
Database Systems
29
Database Systems
Chapter Six
Relational Query Languages
In addition to the structural component of any data model equally important is
the manipulation mechanism. This component of any data model is called the
“query language”.
Two mathematical Query Languages form the basis for Relational Query
Languages
Relational Algebra:
Relational Calculus:
30
Database Systems
A query is applied to relation instances, and the result of a query is also a relation
instance.
Schemas of input relations for a query are fixed
The schema for the result of a given query is also fixed! Determined
by definition of query language constructs.
Relational Algebra
The basic set of operations for the relational model is known as the relational
algebra. These operations enable a user to specify basic retrieval requests.
The result of the retrieval is a new relation, which may have been formed from one
or more relations. The algebra operations thus produce new relations, which can
be further manipulated using operations of the same algebra.
31
Database Systems
Table1:
Sample table used to illustrate different kinds of relational
operations. The relation contains information about employees, IT
skills they have and the school where they attend each skill.
Employee
EmpID FName LName SkillID Skill SkillType School SchoolAdd SkillLevel
12 Abebe Mekuria 2 SQL Database AAU Sidist_Kilo 5
16 Lemma Alemu 5 C++ Programming Unity Gerji 6
28 Chane Kebede 2 SQL Database AAU Sidist_Kilo 10
25 Abera Taye 6 VB6 Programming Helico Piazza 8
65 Almaz Belay 2 SQL Database Helico Piazza 9
24 Dereje Tamiru 8 Oracle Database Unity Gerji 5
51 Selam Belay 4 Prolog Programming Jimma Jimma City 8
94 Alem Kebede 3 Cisco Networking AAU Sidist_Kilo 7
18 Girma Dereje 1 IP Programming Jimma Jimma City 4
13 Yared Gizaw 7 Java Programming AAU Sidist_Kilo 6
32
Database Systems
1. Selection
Selects subset of tuples/rows in a relation that satisfy selection condition.
Selection operation is a unary operator (it is applied to a single relation)
The Selection operation is applied to each tuple individually
The degree of the resulting relation is the same as the original relation but
the cardinality (no. of tuples) is less than or equal to the original relation.
The Selection operator is commutative.
Set of conditions can be combined using Boolean operations ((AND), (OR),
and ~(NOT))
No duplicates in result!
Schema of result identical to schema of (only) input relation.
Result relation can be the input for another relational algebra operation!
(Operator composition.)
It is a filter that keeps only those tuples that satisfy a qualifying condition
(those satisfying the condition are selected while others are discarded.)
Notation:
<Selection Condition> <Relation Name>
Example: Find all Employees with skill type of Database.
If the query is all employees with a SkillType Database and School Unity the
relational algebra operation and the resulting relation will be as follows.
33
Database Systems
2. Projection
Selects certain attributes while discarding the other from the base relation.
The PROJECT creates a vertical partitioning – one with the needed columns
(attributes) containing results of the operation and other containing the
discarded Columns.
Deletes attributes that are not in projection list.
Schema of result contains exactly the fields in the projection list, with the
same names that they had in the (only) input relation.
Projection operator has to eliminate duplicates!
Note: real systems typically don’t do duplicate elimination unless
the user explicitly asks for it.
If the Primary Key is in the projection list, then duplication will not occur
Duplication removal is necessary to insure that the resulting table is also a
relation.
Notation:
<Selected Attributes> <Relation Name>
Example: To display Name, Skill, and Skill Level of an employee, the query and
the resulting relation will be:
34
Database Systems
3. Rename Operation
We may want to apply several relational algebra operations one after the
other. The query could be written in two different forms:
1. Write the operations as a single relational algebra expression by
nesting the operations.
2. Apply one operation at a time and create intermediate result
relations. In the latter case, we must give names to the relations
that hold the intermediate resultsRename Operation
If we want to have the Name, Skill, and Skill Level of an employee with salary
greater than 1500 and working for department 5, we can write the expression for
this query using the two alternatives:
Then Result will be equivalent with the relation we get using the first
alternative.
35
Database Systems
4. Set Operations
The three main set operations are the Union, Intersection and Set Difference. The
properties of these set operations are similar with the concept we have in
mathematical set theory. The difference is that, in database context, the elements
of each set, which is a Relation in Database, will be tuples. The set operations are
Binary operations which demand the two operand Relations to have type
compatibility feature.
Type Compatibility
Two relations R1 and R2 are said to be Type Compatible if:
1. The operand relations R1(A1, A2, ..., An) and R2(B1, B2, ..., Bn) have the
same number of attributes, and
2. The domains of corresponding attributes must be compatible; that is,
Dom(Ai)=Dom(Bi) for i=1, 2, ..., n.
To illustrate the three set operations, we will make use of the following two tables:
Employee
EmpID FName LName SkillID Skill SkillType School SkillLevel
12 Abebe Mekuria 2 SQL Database AAU 5
16 Lemma Alemu 5 C++ Programming Unity 6
28 Chane Kebede 2 SQL Database AAU 10
25 Abera Taye 6 VB6 Programming Helico 8
65 Almaz Belay 2 SQL Database Helico 9
24 Dereje Tamiru 8 Oracle Database Unity 5
51 Selam Belay 4 Prolog Programming Jimma 8
94 Alem Kebede 3 Cisco Networking AAU 7
18 Girma Dereje 1 IP Programming Jimma 4
13 Yared Gizaw 7 Java Programming AAU 6
36
Database Systems
a. UNION Operation
The result of this operation, denoted by R U S, is a relation that
includes all tuples that are either in R or in S or in both R and S.
Duplicate tuple is eliminated.
The two operands must be "type compatible"
b. INTERSECTION Operation
The result of this operation, denoted by R ∩ S, is a relation that
includes all tuples that are in both R and S. The two operands must
be "type compatible"
Eg: RelationOne ∩ RelationTwo
Employees who attend Database Course at AAU
37
Database Systems
The resulting relation for; R1 R2, R1 R2, or R1-R2 has the same attribute
names as the first operand relation R1 (by convention).
38
Database Systems
Example:
Employee
ID FName LName
123 Abebe Lemma
567 Belay Taye
822 Kefle Kebede
Dept
DeptID DeptName MangID
2 Finance 567
3 Personnel 123
Then the Cartesian product between Employee and Dept relations will be of the
form:
Employee X Dept:
ID FName LName DeptID DeptName MangID
123 Abebe Lemma 2 Finance 567
123 Abebe Lemma 3 Personnel 123
567 Belay Taye 2 Finance 567
567 Belay Taye 3 Personnel 123
822 Kefle Kebede 2 Finance 567
822 Kefle Kebede 3 Personnel 123
39
Database Systems
6. JOIN Operation
The sequence of Cartesian product followed by select is used quite commonly to
identify and select related tuples from two relations, a special operation, called
JOIN. Thus in JOIN operation, the Cartesian Operation and the Selection
Operations are used together.
JOIN Operation is denoted by a symbol.
This operation is very important for any relational database with more than a
single relation, because it allows us to process relationships among relations.
The general form of a join operation on two relations
R(A1, A2,. . ., An) and S(B1, B2, . . ., Bm) is:
R S
<join condition> is equivalent to (R X S)
<selection condition>
Where, R and S can be any relation that results from general relational algebra
expressions.
Since JOIN is an operation that needs two relation, it is a Binary operation.
Example:
Thus in the above example we want to extract employee information about
managers of the departments, the algebra query using the JOIN operation will
be.
a. EQUIJOIN Operation
The most common use of join involves join conditions with equality comparisons
only (=). Such a join, where the only comparison operator used is the equal sign is
called an EQUIJOIN. In the result of an EQUIJOIN we always have one or more
pairs of attributes (whose names need not be identical) that have identical values
in every tuple since we used the equality logical operator.
For example, the above JOIN expression is an EQUIJOIN since the logical
operator used is the equal to operator (=).
40
Database Systems
When two relations are joined by a JOIN operator, there could be some tuples in
the first relation not having a matching tuple from the second relation, and the
query is interested to display these non matching tuples from the first or second
relation. Such query is represented by the OUTER JOIN.
d. SEMIJOIN Operation
SEMI JOIN is another version of the JOIN operation where the resulting Relation
will contain those attributes of only one of the Relations that are related with tuples
in the other Relation. The following notation depicts the inclusion of only the
attributes form the first relation (R) in the result which are actually participating
in the relationship.
41
Database Systems
R <Join Condition> S
Aggregate functions and Grouping statements
Some queries may involve aggregate function (scalar aggregates
like totals in a report, or Vector aggregates like subtotals in
reports)
b) GA AL (R):
Vector aggregate functions on relation R with
AL as list of (<aggregate function >, <attribute >) pairs with
a grouping attribute GA.
42
Database Systems
Relational Calculus
A relational calculus expression creates a new relation, which is specified in
terms of variables that range over rows of the stored database relations (in
tuple calculus) or over columns of the stored relations (in domain calculus).
When we substitute values for the arguments in the predicate, the function
yields an expression, called a proposition, which can be either true or false.
If COND is a predicate, then the set of all tuples evaluated to be true for the
predicate COND will be expressed as follows:
{t | COND(t)}
Where t is a tuple variable and COND (t) is a conditional
expression involving t. The result of such a query is the set of all tuples
t that satisfy COND (t).
43
Database Systems
If we have set of predicates to evaluate for a single query, the predicates can
be connected using (AND), (OR), and ~(NOT)
44
Database Systems
To find only the EmpId, FName, LName, Skill and the School where
the skill is attended where of employees with skill level greater than
or equal to 8, the tuple based relational calculus expression will be:
E.FName means the value of the First Name (FName) attribute for the
tuple E.
45
Database Systems
This means, for all tuples of relation employee where value for the
SkillLevel attribute is greater than or equal to 8.
Example:
46
Database Systems
47
Database Systems
Chapter Seven
Database security and integrity is about protecting the database from being
inconsistent and being disrupted. We can also call it database misuse.
Like wise, even though there are various threats that could be categorized
in this group, intentional misuse could be:
Unauthorized reading of data
48
Database Systems
49
Database Systems
Examples of threats:
Using another persons’ means of access
Unauthorized amendment/modification or copying of data
Program alteration
Inadequate policies and procedures that allow a mix of
confidential and normal out put
Wire-tapping
Illegal entry by hacker
Blackmail
Creating ‘trapdoor’ into system
Theft of data, programs, and equipment
Failure of security mechanisms, giving greater access than
normal
Staff shortages or strikes
Inadequate staff training
Viewing and disclosing unauthorized data
Electronic interference and radiation
Data corruption owing to power loss or surge
Fire (electrical fault, lightning strike, arson), flood, bomb
Physical damage to equipment
Breaking cables or disconnection of cables
Introduction of viruses
50
Database Systems
These policies
should be known by the system: should be encoded in the system
should be remembered: should be saved somewhere (the catalogue)
51
Database Systems
Views
A view is the dynamic result of one or more relational operations
operation on the base relations to produce another relation
A view is a virtual relation that does not actually exist in the
database, but is produced upon request by a particular user
The view mechanism provides a powerful and flexible security
mechanism by hiding parts of the database from certain users
Using a view is more restrictive than simply having certain
privileges granted to a user on the base relation(s)
Integrity
Integrity constraints contribute to maintaining a secure database
system by preventing data from becoming invalid and hence giving
misleading or incorrect results
Domain Integrity
Entity integrity
Referential integrity
Key constraints
52
Database Systems
Encryption
The encoding of the data by a special algorithm that renders the
data unreadable by any program without the decryption key
If a database system holds particularly sensitive data, it may be
deemed necessary to encode it as a precaution against possible
external threats or attempts to access it
The DBMS can access data after decoding it, although there is a
degradation in performance because of the time taken to
decode it
Encryption also protects data transmitted over communication
lines
To transmit data securely over insecure networks requires the
use of a Cryptosystem, which includes:
53
Database Systems
Authentication
All users of the database will have different access levels and
permission for different data objects, and authentication is the process
of checking whether the user is the one with the privilege for the
access level.
Is the process of checking the users are who they say they are.
Each user is given a unique identifier, which is used by the operating
system to determine who they are
Thus the system will check whether the user with a specific username
and password is trying to use the resource.
Associated with each identifier is a password, chosen by the user and
known to the operation system, which must be supplied to enable the
operating system to authenticate who the user claims to be
Any database access request will have the following three major
components
1. Requested Operation: what kind of operation is requested
by a specific query?
2. Requested Object: on which resource or data of the database
is the operation sought to be applied?
3. Requesting User: who is the user requesting the operation on
the specified object?
The database should be able to check for all the three components before
processing any request. The checking is performed by the security
subsystem of the DBMS.
54
Database Systems
2. Insert Authorization: the user with this privilege is allowed only to insert
new records or items to the data object.
4. Delete Authorization: users with this privilege are only allowed to delete
a record and not anything else.
Different users, depending on the power of the user, can have one or the
combination of the above forms of authorization on different data objects.
55
Database Systems
Concepts in DDBMS
Replication: System maintains multiple copies of data, stored in
different sites, for faster retrieval and fault tolerance.
Fragmentation: Relation is partitioned into several fragments stored in
distinct sites
Data transparency: Degree to which system user may remain unaware
of the details of how and where the data items are stored in a distributed
system
56
Database Systems
Advantages of DDBMS
1. Data sharing and distributed control:
User at one site may be able access data that is available at another site.
Each site can retain some degree of control over local data
We will have local as well as global database administrator
Disadvantages of DDBMS
1. Software development cost
2. Greater potential for bugs (parallel processing may endanger
correctness)
3. Increased processing overhead (due to communication jargons)
4. Communication problems
57
Database Systems
3. Data warehousing
Data warehouse is an integrated, subject-oriented, time-variant,
non-volatile database that provides support for decision making.
58