Database Management System Complete Notes
Database Management System Complete Notes
Data: data means a known fact that can be recorded and that have some
implicit meaning. E.g., the names, telephone number, address of people etc.
Database: a database is a collection of inter-related data with an implicit
meaning.
DBMS: stands for Database Management System and it consist of 2 major
components.
The collection of inter related data, which is called as the database.
A set of software packages or a set of software tools or programs that
can access the data and process the data.
FILE SYSTEM
In the traditional approach, information is stored in flat files which are maintained by
the file system of OS.
Application programs go through the file system to access these flat files.
BHABANI SHANKAR PRADHAN 13
Traditional Approach: (Challenges)
Data Security: The data as maintained in the flat file(s) is easily accessible and
therefore not secure.
Example: Consider the Banking System. The Customer transaction file has details about the total available balance of
all customers. A Customer wants information about his account balance. In a file system it is difficult to give the
Customer access to only his data. Thus enforcing security constraints for the entire file or for certain data items are
difficult.
Data Redundancy: Often the same information is duplicated in two or more files.
Example: Assume the same data is repeated in two or more files. If change is made to data in one file, it is required
that the change be made to the data in the other file as well. If this is not done, it will lead to error during access of
the data.
Data Isolation: Data Isolation means that all the related data is not available in one file.
Generally, the data is scattered in various files, and the files may be in different
formats
Concurrent Access Anomalies: Many traditional systems allow multiple users to access and
update the same piece of data simultaneously. But the interaction of concurrent updates
may result in inconsistent data.
In the database approach a single repository of data is maintained that is defined once
and is accessed by various users in many ways.
Characteristics: The following are the main characteristics of the database approach.
Self describing nature of a database system: The database system contains not only the database itself
but also a complete definition / description of the database structure and constraints. The descriptions include
the structure of each file, its type and storage format of each data item and various constraints imposed on
them. This information is called as metadata and it is stored in the DBMS catalog.
Insulation between program and data, and data abstraction: in the DBMS approach if we change the
structure of a file then it will not require to change all programs that access the file as the structure of the
data files are stored in the DBMS catalog separately from the access programs. This property is called as
program–data independence.
Data abstraction: the characteristic that allows program – data independence and program – operation
independence is called as data abstraction.
Support of multiple views of data: different users of database require a different perspective or
view of the database.
A view may be a subset of the database or it may contain virtual data that is
derived from the database files but is not explicitly stored.
Shared data: a database allows the sharing of data under its control by any number of
application programs or users.
Restricting unauthorized access:
The database administrator (DBA) who has the ultimate responsibility for the data in
the DBMS can ensure that proper access procedures are followed including proper
authentication schemes for access to the DBMS and additional checks before permitting
access to sensitive data.
Different levels of security can be implemented for the various types of data and
operations.
BHABANI SHANKAR PRADHAN 20
Advantages of DBMS:
Providing backup and recovery: the backup and recovery subsystem of the DBMS is
providing the facilities for recovering from hardware / software failures.
For e.g., if the computer system fails in the middle of a complex update transaction, the recovery
subsystem is responsible for making sure that the database is restored to the state it was in before the
transaction started executing.
Integrity: data integrity means that the data contained in the database is both accurate
and consistent. Therefore data values being entered for storage could be checked to
ensure that they fall within a specified range and are of the correct format.
For e.g., the value of the age of an employee may be in the range of 18 and 60. Another type of
constraint specifies the uniqueness of data item values such as “every student must have a unique value for
roll number”.
•Providing storage structure for efficient query processing: DBMS provides a specialized
data structures to speed up disk search for the required data. Indexes that are used for disk
search are typically based on the tree data structure or hash data structure.
The query processing and optimization module of the DBMS is responsible for choosing an efficient query
execution plan for each query based on the existing storage structure.
Representing complex relationships among data: a DBMS must have the capability to represent a variety
of complex relationships of data that are interrelated in many ways as well as to retrieve and update
related data easily and efficiently.
Database users are those who really use and take the benefits of database. There will
be different types of users depending on their need and way of accessing the database.
Application Programmers - They are the developers who interact with the database by means of
DML queries.
These DML queries are written in the application programs like C, C++, JAVA, Pascal etc.
These queries are converted into object code to communicate with the database. (For example,
writing a C program to generate the report of employees who are working in particular department
will involve a query to fetch the data from database. )
It will include a embedded SQL query in the C Program.
Sophisticated Users: They are database developers, who write SQL queries to
select/insert/delete/update data.
These users will be scientists, engineers, analysts who thoroughly study SQL and DBMS to apply the
concepts in their requirement. In short, we can say this category includes designers and developers of
DBMS and SQL.
Specialized Users - These are also sophisticated users, but they write special database application
programs. They are the developers who develop the complex programs to the requirement.
Stand-alone Users - These users will have stand-alone database for their personal use. These kinds of
database will have readymade database packages which will have menus and graphical interfaces.
Naive users - Any user who does not have any knowledge about database can be in this category. There
task is to just use the developed application and get the desired results. For example: Clerical staff in any
bank is a naïve user. They don’t have any DBMS knowledge but they still use the database and perform
their given task.
Introduction: Earlier architectures used mainframe computers to provide the main function
for all functions of the system, including user application programs as well as all the
DBMS functionality.
•Data models:
A data model is a collection of concepts that can be used to describe the structure of a database
where by structure of database, we mean the data types relationships and constraints that should hold
for the data.
Most data models also include a set of basic operation or basic data model operation such as insert,
delete, modify or retrieve any kind of objects.
The data model is organized according to the types of concepts they use to describe
the database structure.
Logical / conceptual / high level data model: it provides the concepts that are close to the
way many users perceive data. It describes what data are stored in the database and what
relationship exists among the data. It uses the concept such as entities, attributes and relationships.
Entity: it represents a real world object / concept such as an employee, student or project….
Attribute: it represents some properties that describe the entity such as employee’s name, mobile etc…
Relationship: a relationship is the association among two or more entities.
Physical / low level data model: It provides the concept that describes the details of how data is
stored in the computer.
Representational / Implementation data model: between the above two data model, this level
provides concepts that may be understood by end users but are not too far removed from the way
data is organized within the computer.
Data abstraction: It provides user a conceptual representation with an abstract view of the data.
(The system hides certain details of haw data are stored and maintained that are not needed by most
database users.)
Database schema: the description of the database or the overall design of the database is called as
the database schema, which is specified during database design and is not expected to change
frequently.
Database state / snapshot / Instance: the data in the database at a particular moment in time is
called as the database state.
Every time we insert or delete a record or change the value of a data item in a record, we change one
state of the database into another state.
Metadata: it is the data about data means the description of schema constructs and constraints.
Metadata is the information such as the structure of each file, the type and storage format of each data
item and various constraints on the data.
The process of transforming requests and result between levels are called mappings.
Data independence: It is the concept which can be defined as the capacity to change the
schema at one level of a database system without having to change the schema at the next
higher level. We can define two types of data independence:
•Logical data independence: It is the capacity to change the conceptual schema without having to
change the external schema or application programs. We may change the conceptual schema to expand
the database by adding a data item, to change constraints or to reduce the database and that will not
affect the external schema. Only the view definition and the mappings need to be changed.
•Physical data independence: It is the capacity to change the internal schema without having to change
the conceptual schema. For e.g., the storage structure or devices used for storing the data could be
changed without necessitating a change in the conceptual view / external view.
In the hierarchical database approach/model the data are in the hierarchical relationship.
The data are stored in paper and file.
Each customer section would contain folders for individual orders, and the orders would list each item being
purchased.
To store or retrieve data, the database system must start at top
The difficulty of searching for item in the middle/bottom of the hierarchy.
The database is used to store information that is useful for an organization and represents
this information through some means of modeling.
The ER model was developed to facilitate the database design that represents the overall
logical structure of a database.
The ER model is very useful in mapping the meanings and interactions of real-world
enterprises onto a conceptual schema.
The database structure, employing the ER model is usually shown pictorially using entity-
relationship(ER)diagram ,which shown how the schema for the database application can be
displayed by means of the graphical notation .
The ER data model employs 3 basic notations:
Entity sets
relationship sets
Attributes.
BHABANI SHANKAR PRADHAN 41
Entity sets:-
An entity is a “thing” or “object” in the real world that is distinguished from all other
objects. For example, each student in the student database is an entity.
An entity has a set of properties and the values of the set of properties may uniquely
identify an entity.
For example , a student may have a roll number property whose value uniquely identifies that
student.
Entity set:- An entity set is a set of entities of the same type that share the same
properties, or attributes.
For example the set of all students in the student database can be defined as an entity set
student.
A database thus includes a collection of entity sets, each of which contains any number
of entities of the same type.
Each entity has a value for each of its attributes. For example possible attributes for
the student entity are roll-no, name, age etc.
Domain/value set:- For each attribute, there is a set of permitted values, called the
domain or value set, of that attribute.
For example , the domain of the attribute roll-no might be the set of all integers from 0 to 9.
Derived attribute are the attributes that is derived from another attribute.
For example, the age attribute can be derived from the attribute date_of_birth.
Complex attributes:-
Composite and multi-valued attribute nested to form a complex attribute.
For example , if a person can have more than one residence and for each residence can have multiple
phones, or attributes address phone can be specified as complex attribute.
Foreign key
•Usually a foreign key is a “copy” of a
primary key that has been exported
from one relation into another to
represent the existence of a
relationship between them.
The function that an entity plays in a relationship is called the entity’s role. The role
name signifies the role that a participating entity from the entity type plays in each
relationship instance ,and helps to explain what the relationship means.
Recursive relationship set:- When the entity sets of a relationship set are not distinct, that
is the same entity set participating in relationship set more than once in different roles. In
this type of relationship set ,it is called as the recursive relationship set.
Degree of relationship:- Degree of the relationship type is the name of participating
entity type in the relationship.
Mapping cardinalities:-(Cardinality ratios): Mapping cardinality express the number of
entity to which another entity can express the number via a relationship set.
For example , a binary relationship set R between entity sets A and B , the mapping cardinalities must
be one of the following:-
A weak entity type do not have sufficient attribute to form a key attribute .
Entities belonging to weak entity type are identified by being related to a another entity type known as
owner entity type.
A weak entity type always has a total participation with respect to its identification relationship, because a
weak entity cannot be identified without an owner entity.
For example ,consider the entity type DEPENDENT, related to EMPLOYEE, which is used to keep the details of
the employee’s department. The attributes of DEPENDENT are Name, DOB, Sex and relationship to the
employee.
BHABANI SHANKAR PRADHAN 56
Participation Constraints:-
A weak entity type ( in our ex. DEPENDENT) has a partial key(suppose name) which is a set
of attribute that can uniquely identify the weak entities with the help of the strong entity
type(Ex. EMPLOYEE)
In ER-diagram, both a weak entity type and its identifying relationship are distinguished
by surrounding their boxes and diamonds with double lines.
Represented by an ellipse
from which other ellipses
produced and represent the
component attributes.
E.g Address
Unary Relationship:
A unary relationship is represented as a diamond which
connects one entity to itself as a loop.
The relationship above means, some instances of
employee manage other instances of Employee
Role Names:
Role names may be added to make the meaning more
explicit.
Binary Relationship:
A relationship between two entity types
Ternary Relationship:
A relationship connecting three entity
types.
Relationship Participation:
All instances of the entity type Employee don’t participate in the relationship, Head-of.
•Every employee doesn’t head a department. So, employee entity type is said to partially participate in the
relationship.
•But, every department would be headed by some employee.
•So, all instances of the entity type Department participate in this relationship. So, we say that it is total
participation from the department side.
Attributes of a Relationship:
These attributes best describe the
relationship prescription rather than
any individual entity Doctor, Patient
or Medicine.
Weak Entity:
The identifying relationship is the
one which relates the weak entity
(dependant) with the strong entity
(Employee) on which it depends.
Id is underlined with a dotted line
because it is used to form composite
key of dependent entity along with
E#.
One course is enrolled by multiple users and one student enrolls for multiple courses,
hence the cardinality between course and student is Many to Many.
The department offers many courses and each course belongs to only one department.
Hence the cardinality between the department and course is One to Many.
One department has multiple instructors and one instructor belongs to only one
department. Hence the cardinality between the department and Instructor is One to
Many.
In each department there is a “Head of department” and one instructor is “Head of
department”. Hence the cardinality is One to One
One course is taught by one instructor but the instructor teaches many courses. Hence
the cardinality between course and instructor is Many to One.
P
r
e
p
a
r
e
d
b
y
P
r
BHABANI SHANKAR PRADHAN 77
o
Converting ER Diagram to Relational Schema
The way relationships are represented depends on the cardinality and the
degree of the relationship.
The possible cardinalities are
1:1
1:M
M:N
The degrees are:
Unary
Binary
Ternary
BHABANI SHANKAR PRADHAN 81
Converting Relationship(Binary 1:1)
For Many to many relationship and ternary relationship we create separate table otherwise the existing table
(created for Entities) is modified to express the relationship.
For Many to many relationship and ternary relationship we create separate table otherwise the existing table
(created for Entities) is modified to express the relationship.
For Many to many relationship and ternary relationship we create separate table otherwise the existing table
(created for Entities) is modified to express the relationship.
For Many to many relationship and ternary relationship we create separate table otherwise the existing table
(created for Entities) is modified to express the relationship.
Extended ER Modeling:-
ER modeling concepts are sufficient for representing many traditional database
application, but there are some more complex application are present such as
telecommunication, GIS (Geographic Information System) ,CAD/CAM . These type
of data database required complex requirement.
The extended features are:
Specialization
Generalization
Aggregation.
This is the inheritance concept, where we can say that all unity that is a member
of a subclass which inherits all the attributes of the entity as a member of the super
class and it also inherits all the relationship in which the super class participates.
For example, the entity type EMPLOYEE describes the type of each employee entity, and also
refers to the COMPANY database. The entity type EMPLOYEE (super class) has some sub
groupings such as ENGINEER, TECHNICAN, SECRETARY (subclass).
SPECIALIZATION:-
A Relational data model includes a set of queries to manipulate the database, for which
there are two formal languages are in use
Relational Algebra
Relational calculus
It uses operators to perform queries. An operator can be either unary or binary.
They accept relations as their input and yield relations as their output.
It selects tuples that satisfy the given predicate from a relation.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic
formula which may use connectors like and, or, and not. These terms may use relational
operators like − =, ≠, ≥, < , >, ≤.
Example-1
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.
Example-2
σsubject = "database" and price = "450"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450.
Example-3
σsubject = "database" and price = "450" or year > "2010"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or
those books published after 2010.
Roll
Name Dept. Age Address Gender
No.
101 Sachin CSE 22 Gunupur Male
102 Rahul IT 24 Rayagada Male
103 Sourav ECE 23 Gunupur Male
104 Laxman CSE 24 Bhubaneswar Male
Query:
Find all the tuples from Student relation of CSE department
σ Dept. = “CSE“ (STUDENT)
We can combine several conditions by using the connections (and), (or), (not)
Query:
Find the tuples from Student relation of CSE department who are staying at Gunupur.
σ Dept. = “CSE“ Address = “Gunupur” (STUDENT)
Find the tuples from Student relation whose age is greater than 22 and are staying at
Gunupur.
σ Age > 22 Address = “Gunupur” (STUDENT)
The Project operation selects certain columns from the table/Relation and discards the other
column/attributes.
Projection is denoted by the upper case Greek letter Pi (∏)
Example-1
∏subject, author (Books)
Selects and projects columns named as subject and author from the relation Books.
Example-2
Find out the Roll no, name and address of the relation student
∏ Roll_no, Name, Address (STUDENT)
Combination of Select and Project operation is required when there is a query which
requires some condition and for selecting some particular attributes which satisfies the
condition.
Query: Retrieve the Roll and Name of the student having age greater than 23.
The results of relational algebra are also relations but without any name. The rename operation allows us to
rename the output relation. 'rename' operation is denoted with small Greek letter rho ρ.
Allows us to name, and therefore to refer to, the results of relational-algebra expressions, allows us to refer
to a relation by more than one name.
Example:
x (E)
Binary Operation
The relation R (A1, A2,…..An) and S(B1, B2,……Bn) are said to be union compatible if
they have the same degree n and if dom(Ai) = dom(Bi) for 1<=i<=n.
This means that two relations must have same number of attributes and should have
compatible pair of attributes with same type of domains.
The result of this operation denoted by R U S between two relation R and S that includes all tuples that are
either in R or in S or in both R and S. Example
Query: find out all the customer name of the bank who have either
Duplicate tuples are eliminated. an account or both account and loan.
∏ Cust_name (ACCOUNT) U ∏ Cust_name (LOAN)
ACCOUNT LOAN
AC_NO CUST_NAME LOAN_NO CUST_NAME
CUST_NAME
A1001 Sachin L2001 Sachin
Sachin
A1002 Rahul L2002 Rahul
Rahul
A1003 Saurav L2003 Rahul
Saurav
A1004 Sachin L2004 Saurav
Laxman
A1005 Laxman L2005 Saurav
Intersection Operation ( ∩ )
The intersection operation denoted by R ∩ S between two relation R and S that includes all tuples that are in
both R and S. Example
Query: find out all the customer name of the bank who have both
account and loan.
∏ Cust_name (ACCOUNT) ∩ ∏ Cust_name (LOAN)
ACCOUNT LOAN
AC_NO CUST_NAME LOAN_NO CUST_NAME
CUST_NAME
A1001 Sachin L2001 Sachin
Sachin
A1002 Rahul L2002 Rahul
Rahul
A1003 Saurav L2003 Rahul
Saurav
A1004 Sachin L2004 Saurav
A1005 Laxman L2005 Saurav
Notation r – s
Defined as:
r – s = {t | t r and t s}
The result of this operation, denoted by R-S, is a relation that includes all tuples that are in R but not in S
CUST_NAME
∏ Cust_name (ACCOUNT) _ ∏ Cust_name (LOAN)
Laxman
BHABANI SHANKAR PRADHAN 113
Lecture – 11 Prepared by Prof. Ranjeet Kumar Panigrahi Binary Relational Operations
It’s a binary operation but the Relations on which it is applied need not have to be
Union compatible.
This operation is used to combine tuples from two relations.
The Cartesian Product is denoted by a cross symbol (X)
Mathematically the relation R and S with the Cartesian Product will be:
R(A1, A2, ….., An) X S(B1, B2, ……, Bn)
than the resultant relation Q with n X m attributes :
The Cartesian Product of CUSTOMER and LOAN contains total of 9 tuples which associates
every tuple of CUSTOMER with every tuple of LOAN and the resultant relation would be
It’s a binary operation which takes two relations as input and produces a resultant
relation as an output
The JOIN operation will check the key attribute is present in both the relation or not, if
present than it will retrieve the required information from both the tables by comparing the
value of the common key attribute.
The following are the type of JOIN operations:
BHABANI SHANKAR PRADHAN 117
Natural Join Operation
The NATURAL JOIN operation first form the Cartesian product than it forms a selection
forcing equality on those attributes that appear in both relation and finally removes the
duplicate attributes.
Mathematically, for the relation R and S the join operation will be R S
Example1: find out the name of the customer who have account at Gunupur branch
Query: R1 σ Branch = “Gunupur” (CUSTOMER LOAN)
RESULT ∏ NAME (R1)
OR
∏ NAME(σ Branch = “Gunupur” (CUSTOMER LOAN))
BHABANI SHANKAR PRADHAN 118
OUTER JOIN
The OUTER JOIN operation is an extension of the join operation to deal with missing
information.
In our previous example, we have lost the branch name and the loan amount information
of Sourav in the LOAN relation. Similarly the name of the account number A4 is absent in
the ACCOUNT relation.
We can use the outer join operation to avoid this loss of information.
There are three types of outer join are there: •Left outer join
•Right outer join
•Full outer join
The LEFT OUTER JOIN takes all tuples in the left relation that did not match with any
tuple in the right relation, fills the tuples with NULL values for all other attributes from the
right relation.
The RIGHT OUTER JOIN takes all tuples in the right relation that did not match with any
tuple in the left relation, fills the tuples with NULL values for all other attributes from the left
relation.
The FULL OUTER JOIN takes all tuples from both the relation and fills the missing values
with NULL values.
The Aggregate function takes a collection of values and return a single value as a result.
It is denoted by G. (𝓖)
For example: the aggregate function sum takes a collection of values and returns the sum of values.
EX: 𝓖 SUM(SALARY) (EMPLOYEE) SUM(SALARY)
70000
2. “Find the IDs of suppliers who supply some red or green part.”
OR
“Find the names of suppliers who supply some red part or are based at 21 George Street.”
Give the following queries in the relational algebra using the relational schema
student(id, name)
enrolledIn(id, code)
subject(code, lecturer)
1. What are the names of students enrolled in cs3020?
11. What are the names of students who are taking a subject not taught by Roger?
Solve the following queries in the relational algebra using the relational schema
lives(person-name, street, city)
works(person-name, company-name, salary)
located-in(company-name, city)
manages(person-name, manager-name)
1. Find the name of all employees (i.e., persons) who work for the City Bank Company (which is a specific company in the
database).
2. Find the name and city of all employees who work for City Bank. Similar to previous query, except we have to access
the lives table to extract the city of the employee. The join condition is the same person name in the two tables Lives
and Works.
3. Find the name, street and city of all employees who work for City Bank and earn more than $10,000. Similar to
previous query except an additional condition on salary attribute.
4. Find all persons who do not work for City Bank.
Display all the employee details or all the tuple of EMPLOYEE relation
{t | EMPLOYEE(t)}
Display all the employee details whose salary is above 30000
{t | EMPLOYEE(t) AND t.SALARY > 30000}
Retrieve the name of the employee whose salary is above 30000
{t.NAME | EMPLOYEE(t) AND t.SALARY > 30000}
In tuple relational calculus, we have to specify the requested attributes for each selected t,
than we specify the condition for selecting a tuple following the bar ( |).
Query : find the name and account number of the relation CUSTOMER and LOAN who have loan account
{t.NAME, t.AC_NO | CUSTOMER (t) and (∃d) (LOAN(d)) and t.AC_NO = d.AC_NO}
Query: List the first and last names of clients that appear as client1 in a match of any type.
Query : find the name and account number of the relation CUSTOMER and LOAN who
have no loan account
{t.NAME, t.AC_NO | CUSTOMER (t) and (∀d) (NOT (LOAN(d))) or NOT (t.AC_NO = d.AC_NO)}
The domain relational calculus uses domain variable that take an value from an
attribute domain, rather than values for an entire tuple.
The domain relational calculus is in the form of
{<x1, x2, …, xn>|COND(x1, x2, ……, xn)}
Where,
x1, x2, …….., xn represents the domain variable
COND represents the formula composed of atoms.
CUSTOMER LOAN
AC_NO NAME AGE GENDER ADDRESS AC_NO BRANCH AMOUNT
Find the NAME, AC_NO, BRANCH and AMOUNT whose branch = “Gunupur”.
{abyz|(∃c) (∃d) (∃e) (∃x) (CUSOMER(abcde) AND LOAN(xyz) AND y = “Gunupur” AND a = x)}
QBE queries are expressed “by example” instead of giving a procedure for obtaining
the desire answer, the user gives an example of what is desired.
The queries in QBE is expressed by skeleton tables which shows the relation schema.
he user selects those skeletons needed for a given query and fills in the skeleton with
example rows.
An example row consists of constants and example elements which are domain
variables, and QBE uses an underscore character before domain variable as _x.
Query: Print the name of the customer whose age is greater than 50.
Query: find the account number and address where the account is jointly on Ram and
Shyam.
It is very much difficult to express all the constraints in the domain variables within the
skeleton tables.
To overcome this difficulty, QBE includes a condition box feature that allows the logical
expression to appear in a condition box.
find the account number of the customers within an amount between 25000 to 30000
but not exactly 28000.
AC_NO BRANCH AMOUNT CONDITION
P. _x _x = (>=25000 AND <=3000 AND ¬ 28000)
E.F. Codd (Edgar Frank Codd) of IBM had written an article “A relational model for
large shared data banks” in June 1970 in the Association of Computer Machinery (ACM)
Journal
One of the most significant implementations of the relational model was “System R,”
which was developed by IBM during the late 1970s. System R was intended as a “proof
of concept” to show that relational database systems could really build and work
efficiently. It gave rise to major developments such as a structured query language called
SQL which has since become an ISO standard and de facto standard relational
language.
Various commercial relational DBMS products were developed during the 1980s such
as DB2, SQL/DS, and Oracle. In relational data model the data are stored in the form of
tables.
BHABANI SHANKAR PRADHAN 144
CODD’S 12 Rules:
In 1985, Codd published a list of rules that became a standard way of evaluating a
relational system. After publishing the original article Codd stated that there are no
systems that will satisfy every rule. Nevertheless the rules represent relational ideal and
remain a goal for relational database designers.
Rule-2: Guaranteed Access Rule: Each and every datum (atomic value) in a relational
database is guaranteed to be logically accessible by resorting to a combination of table
name, primary key value, and column name:
• Every data element should be unambiguously accessible.
BHABANI SHANKAR PRADHAN 145
CODD’S 12 Rules:
Rule-3: Systematic Treatment of Null Values: Null values (distinct from the empty character string or a string of
blank characters and distinct from zero or any other number) are supported in fully relational DBMS for
representing missing information and inapplicable information in a systematic way, independent of data type.
Rule-4: Dynamic On-line Catalogue Based on the Relational Model: The database description is represented
at the logical level in the same way as ordinary data, so that authorized users can apply the same relational
language to its interrogation as they apply to the regular data:
• The database description should be accessible to the users.
Rule-5: Comprehensive Data Sublanguage Rule: A relational system may support several languages and
various modes of terminal use (for example the fill-in-the-blanks mode). However, there must be at least one
language whose statements are expressible, per some well-defined syntax, as character strings and whose
ability to support all the following is comprehensive: data definition, view definition, data manipulation
(interactive and by program), integrity constraints, and transaction boundaries:
• A database supports a clearly defined language to define the database, view the definition, manipulate the
data, and restrict some data values to maintain integrity.
BHABANI SHANKAR PRADHAN 146
CODD’S 12 Rules:
Rule-6: View Updating Rule: All views that are theoretically updatable are also
updatable by the system:
• Data should be able to be changed through any view available to the user.
Rule-7: High-level Insert, Update, and Delete: The capacity of handling a base relation
or a derived relation as a single operand applies not only to the retrieval of data but
also to the insertion, update, and deletion of data:
• All records in a file must be able to be added, deleted, or updated with singular commands
There is one more rule called Rule Zero which states that “For any system that is claimed
to be a relational database management system, that system must be able to manage
data entirely through capabilities.”
Example
Consider the employee relation, which is characterized by the attributes, employee ID, employee name,
employee age, employee experience, employee salary, etc. In this employee relation:
Superkeys can be employee ID, employee name, employee age, employee experience, etc.
Candidate keys can be employee ID, employee name, employee age.
Primary key is employee ID.
Note: If we declare a particular attribute as the primary key, then that attribute value cannot be NULL.
Also it has to be distinct.
Foreign Key
Foreign key is set of fields or attributes in one relation that is used to “refer” to a tuple in another relation.
BHABANI SHANKAR PRADHAN 151
Relational Integrity:
Data integrity constraints refer to the accuracy and correctness of data in the database. Data integrity
provides a mechanism to maintain data consistency for operations like INSERT, UPDATE, and DELETE. The
different types of data integrity constraints are Entity, NULL, Domain, and Referential integrity.
Entity Integrity:
Entity integrity implies that a primary key cannot accept null value. The primary key of the relation
uniquely identifies a row in a relation. Entity integrity means that in order to represent an entity in the
database it is necessary to have a complete identification of the entity’s key attributes.
Null Integrity:
Null implies that the data value is not known temporarily. Consider the relation PERSON. The attributes of
the relation PERSON are name, age, and salary. The age of the person cannot be NULL.
Domain Integrity Constraint:
The domain integrity constraints are used to specify the valid values that a column defined over the domain
can take. We can define the valid values by listing them as a set of values (such as an enumerated data
type in a strongly typed programming language), a range of values, or an expression that accepts the
valid values.
BHABANI SHANKAR PRADHAN 152
Relational Integrity:
Referential Integrity:
In the relational data model, associations between tables are defined through the use
of foreign keys. The referential integrity rule states that a database must not contain any
unmatched foreign key values.
It is to be noted that referential integrity rule does not imply a foreign key cannot be
null.
There can be situations where a relationship does not exist for a particular instance, in
which case the foreign key is null.
A referential integrity is a rule that states that either each foreign key value must
match a primary key value in another relation or the foreign key value must be null.
BHABANI SHANKAR PRADHAN 153
Database design process
Efficiency: Efficiency is generally considered to be the most important. The design should
make full and efficient use of the facilities provided. If the database is made online,
then the users should interact with the database without any time delay.
Integrity: The term integrity means that the database should be as accurate as possible.
Privacy: The database should not allow unauthorized access to files.
Security: The database, once loaded, should be safe from physical corruption whether
from hardware or software failure or from unauthorized access.
Implementation: The conceptual model should be simple and effective so that mapping
from conceptual model to logical model is easy.
Flexibility: The database should not be implemented in a rigid way that assumes the
business will remain constant forever. Changes will occur and the database must be
capable of responding readily to such change.
EMPLOYEE DEPARTMENT
ENAME ENO DOB ADDRESS DNUMBER DNAME DNUMBER DMGRENO
Avoid relations that contain matching attributes that are not primary key or foreign key
combinations, because joining on such attributes may produce spurious tuples.
Functional dependencies are the relationships among the attributes within a relation. Functional dependencies
provide a formal mechanism to express constraints between attributes.
If attribute A functionally depends on attribute B, then for every instance of B you will know the respective
value of A.
Attribute “B” is functionally dependent upon attribute “A” (or collection of attributes) if a value of “A”
determines or single value of attributes “B” at only one time functional dependency helps to identify how
attributes are related to each other.
Suppose a relation / table R contains two attributes X and Y , we can say that Y is functional depends upon
attributes X if and only if X uniquely determines the value of Y.
Example
lets us consider the following example :-
EMPLOYEE
ENAME ENO DOB ADDRESS DNUMBER
Here, by using the key attributes ENO we can derive the other attributes ENAME, ADDRESS and DOB and by
using the ENO and DOB we can derive the attribute AGE.
It can be represented by
FD1: ENO → {ENAME, ADDRESS, DOB}
FD2: {ENO, DOB} → AGE
EMPLOYEE
ENAME ENO ADDRESS DOB AGE
FD1
FD2
Each FD is displayed as a horizontal line. The left hand side attributes of the FD are connected by vertical
lines to the lines represented the FD, while the right-hand side attributes are connected by arrows pointing
towards the attributes.
Definition :- A functional dependency, denoted by X→Y, between two sets of attributes X and Y that are
subset of R, specifies a constraint and that is for any two tuples t1 and t2 that have t1[X] =t 2[X] then they must
also have t1[Y] = t2[Y].
BHABANI SHANKAR PRADHAN 164
FULL FUNCTIONAL DEPENDENCY:-
A functional dependency (FD) X→Y is a full functional dependency (FFD) if removal of any attributes A
from X means that the dependency does not hold any more.
{ENO, DOB} → AGE
Here, the combination of ENO and DOB uniquely identifies the AGE of the employee. Here age cannot be
determined by the use of DOB or ENO only.
PARTIAL FUNCTIONAL DEPENDENCY:-
A Functional dependency (FD) X→Y is partial functional dependency if some attributes A ∈ X can be
removed from X and the dependency still holds.
π title, rating(Movies)
π actor(Actors) ∪ πdirector(Directors)
BHABANI SHANKAR PRADHAN 168
Solutions Using Relational Algebra
e1 = π title(σactor=‘McDormand’ (Acts))
e2 = π title(σdirector=‘Coen’ (Movies))
result = e1 ∩ e2
σactor=‘Maguire(Acts) – σ actor=‘McDormand’(Acts)
e1 = ρT(title2)(πtitle(σdirector=‘Coen(Movies)))
Query: Find (director, actor) pairs where the director is younger than the actor
π title(σdirector=‘Coen(Movies))
Suppose, F is set of functional dependencies that are specified on relation schema R. It is not possible to
specify all possible functional dependencies for a given situation .
For example if each department has one manager so that DEPT_NO uniquely determines
MANAGER_ENO (DEPT_NO →MGR_ENO), and a manager has a unique phone number called
MGR_PHONE (MGR_ENO→MGR_PHONE), then these two dependencies together imply that
DEPT_NO →MGR_PHONE.
This is an inferred FD and need not be explicitly stated in addition to the two given FDs.
Therefore formally it is useful to define a concept called closure that includes all possible dependencies that
can be inferred from the given set F.
DEFINATION
The set of all dependencies that includes F as well as all dependencies that can be inferred from F is called
the closure of F, it is denoted by F+.
Example:
To determine a systematic way to infer dependencies we must discover a set of inference rules that can be
used to infer new dependencies from a given set of dependences F.
BHABANI SHANKAR PRADHAN 172
ARMSTRONG’S AXIOMS OR INFERENCE RULES: -
IR1(REFLEXIVE RULE): -
For a relation R having attributes X and Y.
If X ⊇ Y, then X →Y
Proof: - Suppose X ⊇ Y and two tuples t1 and t2 exist in some relation,
such that, t1[X]=t2[X] then, t1[Y]=t2[Y],
because X ⊇ Y
Hence, X →Y must hold in this relation.
IR2(AUGMENTATION RULE): -
For a relation R having attributes X and Y
If {X→Y} then XZ → YZ
Proof: - Let t1[X]=t2[X] (1)
Then, t1[Y]=t2[Y] (2)
And we can say that t1[XZ]=t2[XZ] (3)
From equation 1 and 2, we can say that, t1[Z]=t2[Z] (4)
From equation 2 and 4 we can say that, t1[YZ]=t2[YZ]
So from the above equation we can say that {X→Y}, then XZ → YZ.
IR3(TRANSITIVE RULE): -
For a relation R having attributes X and Y ,
If {X →Y, Y→Z} then, X→Z
Proof: -
Assume that X →Y and Y→Z both hold in relation R then
we can say that, t1[X]=t2[X] and t1[Y]=t2[Y]
and we can say that t1[Z]=t2[Z]
hence, X→Z must hold in relation R.
BHABANI SHANKAR PRADHAN 175
ARMSTRONG’S AXIOMS OR INFERENCE RULES: -
IR4(DECOMPOSITION /PROJECTIVE : -
For a relation R having attributes X and Y, If {X→YZ} then, X→Y
Proof: -
Given, X→YZ 1
YZ→Y from IR1 2
Thus from 1 &2 , we can say that X→Y
Proof
Given X→Y and X→Z 1
X→XY from IR2 2
XY→YZ from IR3 3
X→YZ using IR3 AND EQUATION 2 AND 3
BHABANI SHANKAR PRADHAN 177
ARMSTRONG’S AXIOMS OR INFERENCE RULES: -
A systematic way to determine the addition functional dependencies is first to determine each set of
attributes X that appears in the left hand side of some functional dependencies in F and then to determine the
set of all attributes that are dependent on X.
Thus, for each set of attributes X, we determine the set X+ of attributes that are functionally determined by
X based on F, X+ is called the closure of X under F.
Example-
F = ENO → ENAME
DNO →{DNAME, DLOCATION}
X+ (CLOSURE SET) ={ ENO} +={ENO, ENAME}
{DNO} +={DNO, DNAME, DLOCATION}
R(ABCDEF) R(ABCDEFGH)
AB→C A→BC
BC→AD CD→E
D→E E→C
CF→B D→AEH
(AB)+ = ? ABH→BD
DH→BC
BCD→H ?
BHABANI SHANKAR PRADHAN 180
COVER
COVERED BY FUNCTIONAL DEPENDENCY: -
A set of functional dependency F is said to cover another set of functional dependencies E
if every FD in E is also in F+ that is if every dependency in E can be inferred from F then
we can say that E is covered by F.
EQUIVALENT: -
Two sets of functional dependencies E and F are said to be equivalent if E+=F+. Hence
equivalence means that every FD in E cab be inferred from F and every FD in F can be
inferred from E that is E is equivalent to F if both the condition E covers F and F covers E
hold.
BHABANI SHANKAR PRADHAN 181
EQUIVALENCE OF FUNCTIONAL DEPENDENCY
R(ACDEH)
F: A→C G: A→CD F⊇G
AC→D E→AH F⊆G
E→AD F=G
F≠G
E→H
F: { G: { F⊇G
A→B A→BC F⊆G
AB→C D→AB
F=G
D→AC }
D→E F≠G
}
Whenever a user updates the database, the system must check whether any of the functional dependencies
are getting violated in this process. If there is a violation of dependencies in the new database state, the
system must roll back. Working with a huge set of functional dependencies can cause unnecessary added
computational time. This is where the canonical cover comes into play.
A canonical cover of a set of functional dependencies F is a simplified set of functional dependencies that
has the same closure as the original set F.
Canonical cover: A canonical cover Fc of a set of functional dependencies F such that ALL the following
properties are satisfied:
Each left side of a functional dependency in Fc is unique. That is, there are no two dependencies α1↦β1
and α2↦β2 in such that α1→α2.
Problem-
The following functional dependencies hold true for the relational scheme R ( W , X , Y , Z )
X→W
WZ → XY
Y → WXZ
X→W
For WZ → X:
WZ → X
Considering WZ → X, (WZ)+ = { W , X , Y , Z } WZ → Y
Y→W
Ignoring WZ → X, (WZ)+ = { W , X , Y , Z }
Y→X
Now, Y→Z
X→W
Clearly, the two results are same.
WZ → Y
Thus, we conclude that WZ → X is non-essential and can be eliminated.
Y→W
Eliminating WZ → X, our set of functional dependencies reduces to-
Y→X
Y→Z
Now, we will consider this reduced set in further checks.
X→W
For WZ → Y:
WZ → X
WZ → Y
Y→W
Considering WZ → Y, (WZ)+ = { W , X , Y , Z }
Y→X
Ignoring WZ → Y, (WZ)+ = { W , Z } Y→Z
Now,
Clearly, the two results are different.
Thus, we conclude that WZ → Y is essential and can not be eliminated.
For Y → W:
X→W
WZ → X
Considering Y → W, (Y)+ = { W , X , Y , Z } WZ → Y
Y→W
Ignoring Y → W, (Y)+ = { W , X , Y , Z } Y→X
Y→Z
Now, X→W
Clearly, the two results are same. WZ → Y
Y→X
Thus, we conclude that Y → W is non-essential and can be eliminated. Y→Z
For Y → X:
Considering Y → X, (Y)+ = { W , X , Y , Z }
Ignoring Y → X, (Y)+ = { Y , Z }
Now,
Clearly, the two results are different.
Thus, we conclude that Y → X is essential and can not be eliminated.
For Y → Z:
X→W
Considering Y → Z, (Y)+ = { W , X , Y , Z }
WZ → Y
Ignoring Y → Z, (Y)+ = { W , X , Y }
Y→X
Now,
Y→Z
Clearly, the two results are different.
Thus, we conclude that Y → Z is essential and can not be eliminated.
From here, our essential functional dependencies are-
Solution
Step-03:
Consider the functional dependencies having more than one attribute on their left side.
Check if their left side can be reduced.
In our set,
Only WZ → Y contains more than one attribute on its left side.
Considering WZ → Y, (WZ)+ = { W , X , Y , Z }
Now,
Consider all the possible subsets of WZ.
Check if the closure result of any subset matches to the closure result of WZ.
(W)+ = { W }
(Z)+ = { Z }
Clearly,
None of the subsets have the same closure result same as that of the entire left side.
Thus, we conclude that we can not write WZ → Y as W → Y or Z → Y.
Thus, set of functional dependencies obtained in step-02 is the canonical cover.
Finally, the canonical cover is-
X→W
WZ → Y
Y→X
Y→Z
Canonical Cover
BHABANI SHANKAR PRADHAN 193
Canonical Cover of Functional Dependencies:
Example1:
Consider the following set F of functional dependencies:
F= {
A→BC
B→C
A→B
AB→C
}
Example2:
Consider another set F of functional dependencies:
F={
A→BC
CD→E
B→D
E→A
}
BHABANI SHANKAR PRADHAN 194
KEYS
R(ABCD)
A super-key is a key if it can find every attribute of the relation. FD SK CK
If a proper subset of a super-key is a super-key than its not a A→BCD √ √
candidate key. AB→CD √ X
STUDENT
ROLL NO NAME AGE BRANCH ID BRANCH NAME HOD NAME HOD PH
1 A 20 121 CS XYZ 123
2 B 21 121 CS XYZ 123
3 C 19 121 CS XYZ 123
Normalization is the process of analyzing the given relation schemas based on their functional
dependencies and primary keys to achieve the desired properties like
Minimizing redundancy
Minimizing the insertion, deletion and updation anomalies.
R(A B C )
A relation is said to be BCNF if 1NF
and only if all the determinants AB→C
2NF C→B
are candidate keys.
3NF
Restrictions
2NF P→NP
3NF NP→NP BCNF
BCNF P/NP→P
AB→C
C→B
R
A C C B Restrictions
A B C
a x x 1 2NF P→NP
a 1 x BCNF
b y y 2 3NF
b 2 y NP→NP
c z z 2
c 2 z BCNF P/NP→P
c w w 3
c 3 w
d w
d 3 w
e w
e 3 w
R(A B C D E F G H I J)
AB→C
R( A B C D E F G H I J ) A→DE
B→F
F→GH
D→IJ
For a good database design the database schema must satisfy the normal form by decomposing the
relation schema and must satisfy certain properties: -
Dependency preservation property
Lossless / non additive join property
Dependency preservation property: -
By using the functional dependencies, the universal relation schema R is divided/decomposed into a set of relation
schema D ={R1, R2 ……,Rn}
This property states that each functional dependency X→Y specified in F that appeared directly in one of the
relation schema R must be appeared in one of the relation in the decomposition D, So that no dependency are lost.
It is sufficient that the union of the dependencies that hold in the individual relation in D be equivalent to f of R.
Suppose
For a relation R ={A1 ,A2 ,…..An}, The set of FD in F (X →Y) Decomposition (D) = {R1, R2 ……,Rn}
Then the property says that
(R1(F) U R2(F) U……….U Rn(F))+ =F+
BHABANI SHANKAR PRADHAN 207
Examples for Decomposition Property
If a table having FD set of F, is decomposed into two tables R1 and R2 having FD set F1 and F2 than,
F1 ⊆ F+
R(ABC) R(ABCD)
F2 ⊆ F+
A→B AB→CD
B→C D→A
(F1∪F2)+=F+ C→A
R1(AD) R2(BC)
R1(AB) R2(BC)
This property ensures that no spurious tuples are generated when a natural join operation is applied to the
relations in the decomposition.
Let R be the relation schema,
F be the set of functional dependencies on R. R1 and R2 be the decomposition of R
r be the relation instance with schema R
Then we can say that the decomposition is a loss less decomposition if
R1 ⋈ R2 = R
If we project r onto R1 and R2 and compute the natural join of the projection results, we get back exactly r.
The decomposition that is not a lossless decomposition is called as lossy decomposition.
This property ensures that the extra or less tuple generation problem doesn’t occur after decomposition.
If a relation R is decomposed into two relations R1 and R2, then it will be lossless iff
Attribute(R1) ∪ attribute(R2)=attribute(R)
Attribute(R1) ∩ attribute(R2) ≠Φ A B C D
1 a p x
Attribute(R1) ∩ attribute(R2) → Attribute(R1)
2 b q y
OR
Attribute(R1) ∩ attribute(R2) → Attribute(R2)
A B C D
1 a p x
2 b q y
Attribute(R1) ∪ attribute(R2)=attribute(R) A B C
Attribute(R1) ∩ attribute(R2) ≠Φ 1 a p
2 b q
Attribute(R1) ∩ attribute(R2) → Attribute(R1)
3 a r
OR
A B B C
Attribute(R1) ∩ attribute(R2) → Attribute(R2)
1 a a p
2 b b q
3 a a r
A B C D E R(VWXYZ) Attribute(R1) ∪
a 122 1 p w Z→Y attribute(R2)=attribute(R)
b 234 2 q x Y→Z
X→YV Attribute(R1) ∩ attribute(R2) ≠Φ
a 568 1 r y
VW→X Attribute(R1) ∩ attribute(R2) →
c 347 3 s z
Attribute(R1)
R1(VWX),R2(XYZ) OR
R1(AB),R2(CD)
R1(VW), R2(YZ)
R1(ABC), R2(DE) Attribute(R1) ∩ attribute(R2) →
R1(VWX),R2(YZ)
R1(ABC),R2(CDE)
R1(VW),R2(WXYZ) Attribute(R2)
R1(ABCD),R2(ACDE)
Fourth normal form (4NF) is a level of database normalization where there are no non-trivial multivalued
dependencies other than a candidate key. It builds on the first three normal forms (1NF, 2NF and 3NF) and the
Boyce-Codd Normal Form (BCNF). It states that, in addition to a database meeting the requirements of BCNF, it
must not contain more than one multivalued dependency.
Properties – A relation R is in 4NF if and only if the following conditions are satisfied:
It should be in the Boyce-Codd Normal Form (BCNF).
the table should not have any Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of Fourth Normal Form (4NK)
because it creates unnecessary redundancies and can contribute to inconsistent data. To bring this up to 4NF, it
is necessary to break this information into two tables.
BHABANI SHANKAR PRADHAN 214
Fourth normal form (4NF):
Example – Consider the database table of a class which has two relations R1 contains student ID(SID) and
student name (SNAME) and R2 contains course id(CID) and course name (CNAME).
SID SNAME
S1 A SID SNAME CID CNAME
S2 B S1 A C1 C
R1 X R2
S1 A C2 D
CID CNAME S2 B C1 C
C1 C S2 B C2 D
C2 D
A relation R is in 5NF if and only if every join dependency in R is implied by the candidate keys of R. A
relation decomposed into two relations must have loss-less join Property, which ensures that no spurious or extra
tuples are generated, when relations are reunited through a natural join.
• Primary Storage − The memory storage that is directly accessible to the CPU comes under this
category. CPU's internal memory (registers), fast memory (cache), and main memory (RAM) are directly
accessible to the CPU, as they are all placed on the motherboard or CPU chipset. This storage is typically
very small, ultra-fast, and volatile. Primary storage requires continuous power supply in order to maintain its
state. In case of a power failure, all its data is lost.
• Secondary Storage − Secondary storage devices are used to store data for future use or as
backup. Secondary storage includes memory devices that are not a part of the CPU chipset or
motherboard, for example, magnetic disks, optical disks (DVD, CD, etc.), hard disks, flash drives, and
magnetic tapes.
• Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since such storage
devices are external to the computer system, they are the slowest in speed. These storage devices are
mostly used to take the back up of an entire system. Optical disks and magnetic tapes are widely used as
tertiary storage.
• Memory Hierarchy
A computer system has a well-defined hierarchy of memory. A CPU has direct access to it main memory as
well as its inbuilt registers. The access time of the main memory is obviously less than the CPU speed. To
minimize this speed mismatch, cache memory is introduced. Cache memory provides the fastest access time
and it contains data that is most frequently accessed by the CPU.
Note: The memory with the fastest access is the costliest one. Larger storage devices offer slow speed and
they are less expensive, however they can store huge volumes of data as compared to CPU registers or
cache memory.
Hard disk drives are the most common secondary storage devices in present computer systems. These are
called magnetic disks because they use the concept of magnetization to store information. Hard disks
consist of metal disks coated with magnetizable material. These disks are placed vertically on a spindle. A
read/write head moves in between the disks and is used to magnetize or de-magnetize the spot under it. A
magnetized spot can be recognized as 0 (zero) or 1 (one).
3/26/2023 DEPARTMENT
BHABANI
OF CSE,SHANKAR
GIET UNIVERSITY,
PRADHANGUNUPUR 222 222
File Organization
The File is a collection of records. Using the primary key, we can access the records. The type and frequency of
access can be determined by the type of file organization which was used for a given set of records.
File organization is a logical relationship among various records. This method defines how file records are mapped
onto disk blocks.
File organization is used to describe the way in which the records are stored in terms of blocks, and the blocks are
placed on the storage medium.
The first approach to map the database to the file is to use the several files and store only one fixed length record
in any given file. An alternative approach is to structure our files so that we can contain multiple lengths for records.
Files of fixed length records are easier to implement than the files of variable length records.
B+ tree file organization is the advanced method of an indexed sequential access method. It uses a tree-like
structure to store records in File.
It uses the same concept of key-index where the primary key is used to sort the records. For each primary key, the
value of the index is generated and mapped with the record.
The first column is the Search key that contains a copy of the primary key or candidate key of the table.
These values are stored in sorted order so that the corresponding data can be accessed quickly.
Note: The data may or may not be stored in sorted order.
The second column is the Data Reference or Pointer which contains a set of pointers holding the address of
the disk block where that particular key value can be found.
BHABANI SHANKAR PRADHAN 239
Indexing in Databases
Dense Index:
For every search key value in the data
file, there is an index record.
This record contains the search key and
also a reference to the first data record
with that search key value.
BHABANI SHANKAR PRADHAN 241
Indexing in Databases
Sparse Index:
The index record appears only for a few
items in the data file. Each item points to a
block as shown.
To locate a record, we find the index
record with the largest search key value less
than or equal to the search key value we
are looking for.
We start at that record pointed to by the index record, and proceed along with the pointers in the
file (that is, sequentially) until we find the desired record.
BHABANI SHANKAR PRADHAN 242
Indexing in Databases
There are primarily three methods of indexing:
Clustered Indexing
Non-Clustered or Secondary Indexing
Multilevel Indexing
Clustered Indexing
When more than two records are stored in the same file these types of storing known as cluster
indexing. By using the cluster indexing we can reduce the cost of searching reason being multiple
records related to the same thing are stored at one place and it also gives the frequent joining of
more than two tables(records).
BHABANI SHANKAR PRADHAN 243
Indexing in Databases
Clustering index is defined on an ordered data file. The data file is ordered on a non-key field.
In some cases, the index is created on non-primary key columns which may not be unique for each
record. In such cases, in order to identify the records faster, we will group two or more columns
together to get the unique values and create index out of them. This method is known as the
clustering index. Basically, records with similar characteristics are grouped together and indexes
are created for these groups.
For example, students studying in each semester are grouped together. i.e. 1st Semester students,
2nd semester students, 3rd semester students etc are grouped.
With the growth of the size of the database, indices also grow. As the index is stored in
the main memory, a single-level index might become too large a size to store with multiple
disk accesses. The multilevel indexing segregates the main block into various smaller blocks
so that the same can stored in a single block. The outer blocks are divided into inner blocks
which in turn are pointed to the data blocks. This can be easily stored in the main memory
with fewer overheads.
….
.
index file by the main memory. 91 .
91
No of access required = log 2n+1 .
.
N.K -- -- --
Non-key attribute Index file Block Pointer 1
1
The main file is sorted(on a non-key 1
N.K B.P 2
attribute) 1 3
2 4
There will be one entry for each unique 4
3
value of the non key attribute. 5
4 5
If no. of block acquired by index file is n, 5 6
6 6
then block access required will be ≥ log2n+1 .
.
.
BHABANI SHANKAR PRADHAN 252
Main File
Secondary Indexing
S.K -- -- --
The main file is un-sorted Anchor attribute Index file Block Pointer 1
27
S.K B.P
Can be done on key as well as non-key .
1 3
attribute. 2 91
Called secondary because normally one 3 .
.
4
indexing is already done. 2
5
It’s an example of dense indexing. 5
6
.
No of entry in index file = no of entry in .
6
main file. .
….
No of Access = [log 2n]+1 .
91
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into blocks and the blocks
are distributed among disks. Each disk receives a block of data to write/read in parallel. It enhances the
speed and performance of the storage device. There is no parity and backup in Level 0.
There is no duplication of data. Hence, a block once lost cannot be recovered.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity is generated and stored on
a different disk. Note that level 3 uses byte-level striping, whereas level 4 uses block-level striping. Both level
3 and level 4 require at least three disks to implement RAID.
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for data block stripe are
distributed among all the data disks rather than storing them on a different dedicated disk.
RAID 6
RAID 6 is an extension of level 5. In this level,
two independent parities are generated and
stored in distributed fashion among multiple
disks. Two parities provide additional fault
tolerance. This level requires at least four disk
drives to implement RAID.
Then basing upon the query execution plan the code generator generates the code to execute
the plan.
Finally the runtime database processor has the task of running the query code whether in
compiled/interpreted mode.
If the runtime error occurs an error message is generated by the runtime data base processor.
DEPARTMENT PROJECT
DNO DNAME DMGRENO PNO PNAME PLOCATION DNUM
Q1. Find out the employee name and his department name whose address is ‘GUNUPUR’ and working under marketing
department.
Q2. Find out the employees name, department name, date of birth and the project name where every project located in
‘MUMBAI’
BHABANI SHANKAR PRADHAN 269
Examples: (Solution – Q1)
D.DMRENO = E.ENO
The corresponding query tree will be
PROJECT
Example : Example 1:
[ENAME] [DNAME]
[NAME, ROLL, ADDRESS]
[E.ENO = D.DMGRENO]
STUDENT EMPLOYEE DEPARTMENT
[E.ADDRESS = ‘GUNUPUR]
GUNUPUR
MCA
Example 2:
The query passer will typically generate a standard initial query tree to correspond to an SQL query ,
without doing any optimization.
This is the initial query tree or canonical query tree represent a relational algebra expression that is very
inefficient if executed directly, because of the CARTESIAN PRODUCT(X) operations.
Now, the job of the query optimizer is to transform the initial query tree into a final query tree that is
efficient to execute.
In order to maintain consistency in a database, before and after the transaction, certain properties are
followed, these are called ACID properties.
Atomicity
By this, we mean that either the entire transaction
takes place at once or doesn’t happen at all.
There is no midway i.e. transactions do not occur
partially. Each transaction is considered as one
unit and either runs to completion or is not
executed at all. It involves the following two
operations.
—Abort: If a transaction aborts, changes made to
database are not visible.
—Commit: If a transaction commits, changes made
are visible.
Atomicity is also known as the ‘All or nothing rule’.
If the transaction fails after completion of T1 but before completion of T2.( say, after write(X) but
before write(Y)), then amount has been deducted from X but not added to Y. This results in an inconsistent
database state. Therefore, the transaction must be executed in entirety in order to ensure correctness of
database state.
Consistency
This means that integrity constraints must be maintained so that the database is consistent before and after
the transaction. It refers to the correctness of a database.
Referring to the example above, the total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, database is consistent. Inconsistency occurs in case T1 completes but T2 fails. As a result T is
incomplete.
Durability:
This property ensures that once the transaction has completed execution, the updates and modifications to
the database are stored in and written to disk and they persist even if a system failure occurs. These updates
now become permanent and are stored in non-volatile memory. The effects of the transaction, thus, are never
lost.
The ACID properties, in totality, provide a mechanism to ensure correctness and consistency of a database in
a way such that each transaction is a group of operations that acts a single unit, produces consistent results,
acts in isolation from other operations and updates that it makes are durably stored.
Write(A)
Read(B)
B:=B+temp
Write(B)
Read(B) Read(A)
B:=B+50 Temp:=A*0.1
Write(B) A:=A-temp
Write(A)
Read(B)
B:=B+temp
Write(B)
The database system must control concurrent execution of transactions, to ensure that the database
state remains consistent.
In the serializability there are mainly two concepts are involved:-
Conflict Serializability
View Serializability
Conflict Serializability:-
Let us consider a schedule S in which there are two consecutive instructions, Ii and Ij, of transactions Ti and
Tj, respectively(i!=j).
If Ii and Ij refer to the same data item Q then the order of two transactions Ti and Tj may affect.
Consider two schedules S and S| , where the same set of transactions participates in both schedules. The
schedule S and S| are said to be view equivalent if three conditions are met:-
For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must,
in schedule S|, also read the initial value of Q.
For each data item Q, if transaction Ti executes read(Q) in schedule S, and if that value was produced
by a write(Q) operation executed by transaction Tj, then the read(Q) operation of transactions Ti must in
schedule S| , also read the value of Q that was produced by the same write(Q) operation of transaction Tj.
For each data item Q, the transaction that performs the final write (Q) operation in schedule S must
perform the final write (Q) operation in schedule S|
If a transaction Ti fails, we need to undo the effect of this transaction to ensure that atomicity property of
the transaction. In a system that allows concurrent execution, it is necessary also to ensure that any
transaction Tj that is dependent on Ti(that is Tj has read data item written by Ti) is also aborted.
A recoverable schedule is one where, for each pair of transactions Ti and Tj such that Tj reads a data
item previously written by Ti, the commit operation of Ti appears before the commit operation of Tj.
If a schedule is recoverable, to recover correctly from the failure of transaction Ti, we may have to roll
back several transactions.
If a single transaction failure leads to a series of transaction rollback, it is a called as cascading rollback.
A cascade less rollback is one where, for each pair of transaction Ti and Tj such that Tj reads a data item
previously written by Ti, the commit operation of Ti appears before the read operation of Tj.
BHABANI SHANKAR PRADHAN 294
PRECEDENCE GRAPH:-
This is the most simple and efficient method for determining conflict serializability of a schedule.
Precedence graph is a directed graph of a schedule.
Suppose the graph consists of a pair G=(V,E), where v is the set of vertices and E is set of edges.
The set of vertices consists of all the transactions participating in the schedule.
The set of edges consists of all edges Ti -> Tj for which one of three conditions holds:-
Ti executes write (Q) before Tj executes read (Q).
Ti executes read(Q) before Tj executes write(Q)
Ti executes write (Q) before Tj executes write (Q).
If an edge Ti ->Tj exists in the precedence graph then in any serial schedule S| equivalent to S, Ti must
appear before Tj.
When several transactions execute concurrently in the database, the system must control the interaction among
the concurrent transactions, this control is achieved through one of a variety of mechanism called concurrency
control schema.
The following are the concurrency control schema:-
LOCK-BASED PROTOCOLS:-
In the concurrent execution of transactions, the data item must be accessed in a mutually exclusive
manner; that is, while one transaction is accessing a data item, no other transaction can modify that data
item.
The most common method used to implement this requirement is to allow a transaction to access a data
item only if it is currently holding a lock on that item.
COMPATIBILITY FUNCTION:-
Lock-S(B) Grant-S(B,T1)
Read(B)
B:=B+100
Unlock(B)
BHABANI SHANKAR PRADHAN 300
CONCURRENCY CONTROL:-
LOCK CONVERSION:-
The lock conversion is a mechanism for upgrading a shared lock to an exclusive lock and downgrading an
exclusive lock to a shared lock.
The conversion from shared to exclusive mode is denoted by upgrade and from exclusive to shared by
downgrade.
The upgrading can be taken place in the growing phase, whereas downgrading can take place in only
the shrinking phase.
System crash –
A hardware, software or network error occurs comes under this category this types of failures basically
occurs during the execution of the transaction. Hardware failures are basically considered as Hardware
failure.
System error –
Some operation that is performed during the transaction is the reason for this type of error to occur, such as
integer or division by zero. This type of failures is also known as the transaction which may also occur
because of erroneous parameter values or because of a logical programming error. In addition to this user
may also interrupt the execution during execution which may lead to failure in the transaction.
Local error : This basically happens when we are doing the transaction but certain conditions may occur that
may lead to cancellation of the transaction. This type of error is basically coming under Local error. The simple
example of this is that, when we want to debit money from an insufficient balance account which leads to the
cancellation of our request or transaction.
Concurrency control enforcement : The concurrency control method may decide to abort the transaction, to
start again because it basically violates serializability or we can say that several processes are in a deadlock.
Disk failure : This type of failure basically occur when some disk loses their data because of a read or write
malfunction or because of a disk read/write head crash. This may happen during a read /write operation of
the transaction.
There are both automatic and non-automatic ways for both, backing up of data and recovery from any
failure situations. The techniques used to recover the lost data due to system crash, transaction errors, viruses,
catastrophic failure, incorrect commands execution etc. are database recovery techniques. So to prevent data
loss recovery techniques based on deferred update and immediate update or backing up data can be used.
Recovery techniques are heavily dependent upon the existence of a special file known as a system log. It
contains information about the start and end of each transaction and any updates which occur in the
transaction.
The log keeps track of all transaction operations that affect the values of database items. This information is
needed to recover from transaction failure.
The log is kept on disk start_transaction(T): This log entry records that transaction T starts the execution.
read_item(T, X): This log entry records that transaction T reads the value of database item X.
write_item(T, X, old_value, new_value): This log entry records that transaction T changes the value of the database item
X from old_value to new_value. The old value is sometimes known as a before an image of X, and the new value is known
as an afterimage of X.
commit(T): This log entry records that transaction T has completed all accesses to the database successfully and its effect
can be committed (recorded permanently) to the database.
abort(T): This records that transaction T has been aborted.
checkpoint: Checkpoint is a mechanism where all the previous logs are removed from the system and stored permanently
in a storage disk. Checkpoint declares a point before which the DBMS was in consistent state, and all the transactions were
committed.
BHABANI SHANKAR PRADHAN 310
Database Recovery Techniques in DBMS
There are two major techniques for recovery from non-catastrophic transaction failures: deferred updates
and immediate updates.
Deferred update – This technique does not physically update the database on disk until a transaction has reached
its commit point. Before reaching commit, all transaction updates are recorded in the local transaction workspace. If a
transaction fails before reaching its commit point, it do not have to change, hence UNDO is not needed. It may be
necessary to REDO the effect of the operations that are recorded in the local transaction workspace, because their
effect may not yet have been written in the database.
Immediate update – In the immediate update, the database may be updated by some operations of a transaction
before the transaction reaches its commit point. However, these operations are recorded in a log on disk before they
are applied to the database, making recovery still possible. If a transaction fails to reach its commit point, the effect
of its operation must be undone i.e. the transaction must be rolled back hence we require both undo and redo.
Deadlock prevention protocol is used to ensure that the system will never enter a deadlock state.
There are two approaches to deadlock prevention :-
The first approach ensures that no cyclic wait can occur by ordering the request for locks that means each
transaction locks all its data items before it begins execution .Either all are locked in one step or none are
locked.
The second approach for preventing deadlock is to use a preemption and transaction rollbacks in
preemption ,when a transaction T2 requests a lock that transaction T1 holds the lock granted to T1 may
be preempted by rolling back of T1,and granting the lock to T2. To control the preemption , we assign a
unique timestamp to each transaction should wait or roll back.
Two different deadlock prevention schemes using timestamps has been proposed:-
The wait-die scheme is a non-preemptive technique .When a transaction Ti requests a
data item currently held by Tj, Ti allowed to wait only if it has a timestamp smaller than
that of Tj(that is Ti older than Tj)otherwise ,Ti is rolled back(dies).
The wound-wait is a non-preemptive technique. It is counterpart of the wait-die
scheme .when a transaction Ti requests a data item currently held by Tj, Ti is allowed to
wait only if it has a timestamp larger than that of Tj(that is ,Ti is younger than
Tj),otherwise Tj is rolled back(dies).
BHABANI SHANKAR PRADHAN 315
Deadlock detection:-
We can allow the system to enter a deadlock state, and then try to recover by using a deadlock detection
and deadlock recovery scheme
An algorithm that examines the state of the system is invoked periodically to determine whether a deadlock
has occurred .if one has ,then the system must attempt to recover from the deadlock
Deadlocks can be described in terms of a directed graph called as wait-for-graph.
The wait-for-graph consists of a pair G=(V,E),where V is a set of vertices and E is a set of edges. The set of
vertices consists of all the transaction in the system.
When transaction Ti requests a data item currently being held by transaction Tj, then the edge ti-->Tj is
inserted in the wait-for –graph.
A deadlock exists in the system if and only if the wait-for-graph contains a cycle. Each transaction involved
deadlocked in the cycle. Each transaction involved in the cycle is said to be deadlocked
To detect deadlocks, the system needs to maintain the wait-for-graph, and periodically to invoke an
algorithm that searches for a cycle in the graph.