DBMS Notes 4 - TutorialsDuniya
DBMS Notes 4 - TutorialsDuniya
COM
DataBase Management
System Notes
Contributor: Abhishek Sharma
[Founder at TutorialsDuniya.com]
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
m
co
a.
iy
un
sD
al
ri
to
Tu
UNIT-4
SCHEMA REFINEMENT and NORMAL FORMS
Syllabus: Problems caused by redundancy, Decompositions, problem related to
decomposition, reasoning about FDS, FIRST, SECOND, THIRD Normal forms, BCNKF,
Lossless join Decomposition, Dependency preserving Decomposition, Schema refinement in
Data base Design, Multi valued Dependencies, FORTH Normal Form.
m
approach based on decompositions. Redundant storage of information (i.e. duplication of
data) is main cause of problem. This redundancy is eliminated by decompose the relation.
co
same information repeatedly. It means, storing the same data more than one place with in
a database is can lead several problems. Such as
1) Redundant Storage: It removes the multi-valued attribute. It means, some tuples or
a.
information is stored repeatedly.
2) Update Anomalies: Suppose, if we update one row (or record) then DBMS will update
more than one similar row, causes update anomaly.
iy
For example, if we update the department name those who getting the salary 40000 then
that will update more than one row of employee table is causes update anomaly.
un
3) Insertion Anomalies: In insertion anomaly, when allows insertion for already existed
record again, causes insertion anomaly.
4) Deletion Anomalies: In deletion anomaly, when more than one record is deleted
instead of specified or one, causes deletion anomaly.
sD
In this example,
to
m
by two or more relation schemas that each contains a subset of the attributes of
R and together include all attributes in R.
(or)
co
In simple, words, “The process of breaking larger relations into smaller relations is
known as decomposition”.
Consider the above example that can be decomposing the following hourly_emps
a.
relation into two relations.
Hourly_emp (eno, ename, salary, rating, houly_wages, hours_worked)
iy
This is decomposed into
Hourly_empsd( eno, ename, salary, rating, hours_worked) and
Wages(rating, houly_wages)
un
Eno ename salary rating hours_worked
111 suresh 25000 8 40
222 eswar 30000 8 30
sD
rating hourly_wages
8 10
ri
5 7
The answer for the second question is, number of normal forms exists. Every
relation schema is in one of these normal forms and these normal forms can help to
decide whether to decompose the relation schema further or not.
The Disadvantage of decomposition is that it enforces us to join the decomposed
relations in order to solve the queries of the original relation.
m
unnamed rows. Emp-id ename dept name salary
For example, a relation named 100 Simpson Marketing 48000
Employee contains following 140 smith Accounting 52000
co
110 Lucero Info-systems 43000
attributes, emp-id, ename, 190 Davis Finance 55000
dept name and salary. 150 martin marketing 42000
a.
marketing 42000
What are the Properties of relations?:
The properties of relations are defined on two dimensional tables. They are:
iy
Each relation (or table) in a database has a unique name.
An entry at the intersection of each row and column is atomic or single. These can be
no multiplied atttributes in a relation.
un
Each row is unique, no two rows in a relation are identical.
Each attribute or column within a table has a unique name.
The sequence of columns (left to right) is insignificant the column of a relation can be
sD
interchanged without changing the meaning use of the relation.
The sequence of rows (top to bottom) is insignificant. As with column, the rows of
relation may be interchanged or stored in any sequence.
Q. Removing multi valued attributes from tables: the “second property of
al
relations” in the above is applied to this table. In this property, there is no multi valied
attributes in a relation. This rule is applied to the table or relation to eliminate the one or
more multi valued attribute. Consider the following the example, the employee table
ri
contain 6 records. In this the course title has multi valued values/attributes. The
employee 100 taken two courses vc++ and ms-office. The record 150 did not taken any
to
course. So it is null.
Emp-id name dept-name salary course_title
100 Krishna cse 20000 vc++, msoffice
Tu
Now, this multi valued attributes are eliminated and shown in the following
employee2 table.
Emp-id name dept-name salary course_title
100 Krishna cs 20000 vc++
100 Krishna cs 20000 MSoffice
140 Rajasekhar cs 18000 C++
140 Rajasekhar cs 18000 DBMS
140 Rajasekhar cs 18000 DS
m
Partial functional dependency : It is a functional dependency in which one or more non-
key attributes are functionally dependent on part of the primary key. Consider the
following graphical representation, in that some of the attributes are partially depend on
co
primary key.
a.
iy
In this example, Ename, Dept_name, and salary are fully functionally depend on Primary
key of Emp_id. But Course_title and date_completed are partial functional dependency.
In this case, the partial functional dependency creates redundancy in that relation.
un
Q. What is Normal Form? What are steps in Normal Form?
NORMALIZATION: Normalization is the process of decomposing relations to produce
smaller, well-structured relation.
To produce smaller and well structured relations, the user needs to follow six
sD
normal forms.
Steps in Normalization:
A normal form is state of relation that result from applying simple rules from
al
1) First Normal Form: Any multi-valued attributes (also called repeating groups) have
been removed,
2) Second Normal Form: Any partial functional dependencies have been removed.
Tu
1) Normalized relation (table) does not contain repeating groups whereas, un-
normalized relation (table) contains one or more repeating groups.
2) Normalized relation consists of a primary key. There is no primary key presents in
un-normalized relation.
3) Normalization removes the repeating group which occurs many times in a table.
4) With the help of normalization process, we can transform un-normalized table to
First Normal Form (1NF) by removing repeating groups from un-normalized tables.
5) Normalized relations (tables) gives the more simplified result whereas un-
m
normalized relation gives more complicated results.
6) Normalized relations improve storage efficiency, data integrity and scalability. But
un-normalized relations cannot improvise the storage efficiency and data integrity.
co
7) Normalization results in database consistency, flexible data accesses.
Q. FIRST NORMAL FORM (1NF): A relation is in first normal form (1NF) contains no
multi-Valued attributes. Consider the example employee, that contain multi valued
a.
attributes that are removing and converting into single valued attributes
Multi valued attributes in course title
Emp-id name dept-name salary course_title
iy
100 Krishna cse 20000 vc++,
msoffice
140 Raja it
un 18000 C++,
DBMS,
DS
Removing the multi valued attributes and converting single valied using First NF
sD
of the relation and right side attributes on right hand side i.e. Y is the non-key attributes.
In some situation some non key attributes are partial functional dependency on primary
key. Consider the following example for partial functional specification and also that
Tu
In this example, Ename, Dept_name, and salary are fully functionally depend on Primary
key of Emp_id. But Course_title and date_completed are partial functional dependency.
In this case, the partial functional dependency creates redundancy in that relation.
To avoid this, convert this into Second Normal Form. The 2NF will decompose
the relation into two relations, shown in graphical representation.
EMPLOYEE
Emp_id Ename Dept_name Salry
COURSE
Course_title Date_Completed Emp_id
m
In the above graphical representation
co
the EMPLOYEE relation satisfies rule of 1 NF in Second Normal form and
the COURSE relation satisfies rule of 2 NF by decomposing into two relation.
THIRD NORMAL FORM(3NF): A relation that is in Second Normal form and has no
a.
transitive dependencies present.
Transitive dependency : A transitive is a functional dependency between two non key
iy
attributes. For example, consider the relation Sales with attributes cust_id, name, sales
person and region that shown in graphical representation.
un
Cust_id Cname Sales_person Region
Salry
sD
Anomaly.
c) Modification Anomaly: If salesperson Smith is reassigned to the East region, several
rows must be changed to reflect that fact. This causes, update anomaly.
To avoid this Anomaly problem, the transitive dependency can be removed by
decomposition of SALES into two relations in 3NF.
Consider the following example, that removes Anomaly by decomposing into two relations.
m
Sales_person Region Cust_id Cname Salry
co
Q. BOYCE/CODD NORMAL FORM(BCNF): A relation is in BCNF if it is in
3NF and every determinant is a candidate key.
a.
+
FD in F of the form X A where X с S and A є S, X is a super key of R.
Boyce-Codd normal form removes the remaining anomalies in 3NF that are resulting
iy
from functional dependency, we can get the result of relation in BCNF.
For example, STUDENT-ADVIDSOR IN 3NF
STUDENT-ADVISOR
un
Student-id major-subject faculty-advisor
sD
1 MATHS B
2 MATHS B
3 MATHS B
ri
4 STATISTICS A
5 STATISTICS A
to
In the above relation the primary key in student-id and major-subject. Here the part of
the primary key major-subject is dependent upon a non key attribute faculty–advisor. So,
Tu
To convert a relation to BCNF the first step in the original relation is modified that the
determinant(non key attributes) becoms a component of the primary key of new relation.
The attribute that is dependent on determinant becomes a non key attributes .
STUDENT-ADVISOR
m
The second step in the conversion process is decompose the relation to eleminate the
co
partial functional dependency. This results in two relations. these relations are in 3NF
and BCNF . since there is only one candidate key. That is determinant.
Two relations are in BCNF.
a.
ADVISOR STUDENT
iy
In these two relations the student relation has a composite key , which contains
un
attributes student-id and faculty-advisor. Here faculty–advisor a foreign key which is
referenced to the primary key of the advisor relation.
Two relations are in BCNF with simple data. Student_id Faculty_Advisor
Faculty Advisor Major_subject 1 B
sD
2 B
B MATHS 3 A
A PHYSICS 4 A
5 A
al
Example: Consider a relation schema ABCD and suppose that are FD A BCD and the
MVD B C are given as shown in Table B C A D tuples
It shows three tuples from relation ABCD that satisfies
the given MVD B C. From the definition of a MVD, b c1 a1 d1 - tuple t1
given tuples t1 and t2, it follows that tuples t3 must also
be included in the above relation. Now, consider b c2 a2 d2 - tuple t2
tuples t2 and t3. From the given FD A BCD and b c1 a2 d2 - tuple t3
the fact that these tuples have the same A-value,
we can compute
the c1 = c2. Therefore, we see that the FD B C must hold over ABCD whenever the
FD A BCD and the MVD B C holds. If B C holds, the relation is not in BCNF but
the relation is in 4 NF.
The fourth normal from is useful because it overcomes the problems of the various
approaches in which it represents the multi-valued attributes in a single relation.
m
A relation schema R is said to be in Fifth Normal Form (5NF) if, for every join dependency
* (R1, . . . . Rn) that holds over R, one of the following statements is true.
*Ri = R for some I, or
co
* The JD is implied by the set of those FDs over R in which the left side is a key for R.
It deals with a property loss less joins.
a.
Q. LOSSELESS-JOIN DECOMPOSITION:
Let R be a relation schema and let F be a set FDs (Functional Dependencies) over R. A
iy
decomposition of R into two schemas with attribute X and Y is said to be lossless-join
decomposition with respect to F, if for every instance r of R that satisfies the
dependencies in Fr.
un
πx (r) πy (r) = r
In simple words, we can recover the original relation from the decomposed
relations.
sD
In general, if we take projection of a relation and recombine them using natural join, we
obtain some additional tuples that were not in the original relation.
S P D S P P D
al
s1 p1 d1 s1 p1
P1 d1
s2 p2 d2 s2 p2 p2 d2
ri
s2 p1 d3 s2 p1 p1 d3
to
S P D
s1 p1 d1
Tu
s1 p1 d3
s2 p2 d2
s2 p1 d1
s2 p1 d3
The decomposition of relation schema r i.e. SPD into SP i.e. PROJECTING πsp (r
) and PD i.e., projecting πPD (r) is therefore lossless decomposition as it gains back all
original tuples of relation ‘r’ as well as with some additional tuples that were not in original
relation ‘r’.
Q. Dependency-Preserving Decomposition:
Dependency-preserving decomposition property allows us to check integrity constraints
efficiently. In simple words, a dependency-preserving decomposition allows us to enforce
m
all FDs by examining a single relation on each insertion or modification of a tuple.
Let R be a relation schema that is decomposed in two schemas with attributes X
and Y and let F be a set of FDs over R. The projection of F on X is the set of FD s in the
co
+
closure F that involves only attributes in X. We denote the projection of F on attributes
+
X as Fx. Note that a dependency U V in F is in Fx only if all the attributes in U and Y
are in X. The decomposition of relation schema R with FDs F into schemas with attributes
a.
+
X and Y is dependency-preserving if ( F x U F y) = F+.
That is, if we take the dependencies in Fx and Fy and compute the closure of their
iy
union, then we get back all dependencies in the closure of F. To enforce F x. we need to
examine only relation X (that inserts on that relation) to enforce F y, we need to examine
only relation Y.
un
Example: Suppose that a relation R with attributes ABC is decomposed into relations with
attributes AB and BC. The set F of FDs over r includes A B , B C and C A. Here, A
B is in FAB and B C is in FBC and both are dependency-preserving. Where as C A is
not implied by the dependencies of FAB and FBC. Therefore C A is not dependency-
sD
preserving.
Consequently, FAB also contains B A as well as A B and FBC contains C
B as well as B C. Therefore FAB U FBC contain A B , B C, B A and C B.
al
Now, the closure of the dependencies in FAB and FBC includes C A (because, from
C B, B A and transitivity rule, we compute as C A).
ri
to
Tu
Transaction
A transaction can be defined as a group of tasks. A single task is the minimum processing
unit which cannot be divided further.
Let’s take an example of a simple transaction. Suppose a bank employee transfers Rs 500
from A's account to B's account. This very simple and small transaction involves several
m
low-level tasks.
A’s Account
co
Open_Account(A)
Old_Balance = A.balance
New_Balance = Old_Balance - 500
A.balance = New_Balance
a.
Close_Account(A)
B’s Account
iy
Open_Account(B)
Old_Balance = B.balance
New_Balance = Old_Balance + 500
B.balance = New_Balance
un
Close_Account(B)
ACID Properties
A transaction is a very small unit of a program and it may contain several lowlevel tasks. A
sD
unit, that is, either all of its operations are executed or none. There must be no
state in a database where a transaction is left partially completed. States should be
defined either before the execution of the transaction or after the
ri
the database. If the database was in a consistent state before the execution of a
transaction, it must remain consistent after the execution of the transaction as well.
Durability − The database should be durable enough to hold all its latest updates
even if the system fails or restarts. If a transaction updates a chunk of data in a
Tu
database and commits, then the database will hold the modified data. If a
transaction commits but the system fails before the data could be written on to the
disk, then that data will be updated once the system springs back into action.
Isolation − In a database system where more than one transaction are being
executed simultaneously and in parallel, the property of isolation states that all the
transactions will be carried out and executed as if it is the only transaction in the
system. No transaction will affect the existence of any other transaction.
Transaction Log
In the field of databases in computer science, a transaction log (also transaction
journal, database log, binary log or audit trail) is a history of actions executed
by a database management system used to guarantee ACID properties
over crashes or hardware failures. Physically, a log is a file listing changes to the
database, stored in a stable storage format.
If, after a start, the database is found in an inconsistent state or not been shut down
m
properly, the database management system reviews the database logs
for uncommittedtransactions and rolls back the changes made by these transactions.
Additionally, all transactions that are already committed but whose changes were not
co
yet materialized in the database are re-applied. Both are done to
ensure atomicity and durability of transactions.
a.
A database log record is made up of:
Log Sequence Number (LSN): A unique ID for a log record. With LSNs, logs can be
iy
recovered in constant time. Most LSNs are assigned in monotonically increasing order,
which is useful in recovery algorithms, like ARIES.
Prev LSN: A link to their last log record. This implies database logs are constructed
in linked list form.
un
Transaction ID number: A reference to the database transaction generating the log
record.
Type: Describes the type of database log record.
sD
Information about the actual changes that triggered the log record to be written.
All log records include the general log attributes above, and also other attributes depending
on their type (which is recorded in the Type attribute, as above).
Update Log Record notes an update (change) to the database. It includes this extra
information:
ri
Before and After Images: Includes the value of the bytes of page before and after
the page change. Some databases may have logs which include one or both images.
Compensation Log Record notes the rollback of a particular change to the database.
Each corresponds with exactly one other Update Log Record (although the
Tu
corresponding update log record is not typically stored in the Compensation Log
Record). It includes this extra information:
undoNextLSN: This field contains the LSN of the next log record that is to be undone
for transaction that wrote the last Update Log.
Commit Record notes a decision to commit a transaction.
Abort Record notes a decision to abort and hence roll back a transaction.
Checkpoint Record notes that a checkpoint has been made. These are used to speed
up recovery. They record information that eliminates the need to read a long way into
the log's past. This varies according to checkpoint algorithm. If all dirty pages are
flushed while creating the checkpoint (as in PostgreSQL), it might contain:
redoLSN: This is a reference to the first log record that corresponds to a dirty page.
i.e. the first update that wasn't flushed at checkpoint time. This is where redo must
begin on recovery.
undoLSN: This is a reference to the oldest log record of the oldest in-progress
transaction. This is the oldest log record needed to undo all in-progress transactions.
m
Completion Record notes that all work has been done for this particular transaction.
(It has been fully committed or aborted)
Transaction Control
co
The following commands are used to control transactions.
COMMIT − to save the changes.
ROLLBACK − to roll back the changes.
SAVEPOINT − creates points within the groups of transactions in which to
a.
ROLLBACK.
SET TRANSACTION − Places a name on a transaction.
iy
Transactional Control Commands
Transactional control commands are only used with the DML Commands such as -
INSERT, UPDATE and DELETE only. They cannot be used while creating tables or dropping
un
them because these operations are automatically committed in the database.
The COMMIT Command
The COMMIT command is the transactional command used to save changes invoked by a
transaction to the database.
sD
The COMMIT command is the transactional command used to save changes invoked by a
transaction to the database. The COMMIT command saves all the transactions to the
database since the last COMMIT or ROLLBACK command.
The syntax for the COMMIT command is as follows.
al
COMMIT;
Example
Consider the CUSTOMERS table having the following records −
ri
+----+----------+-----+-----------+----------+
to
+----+----------+-----+-----------+----------+
Tu
| 6 | Komal | 22 | MP | 4500.00 |
+----+----------+-----+-----------+----------+
Following is an example which would delete those records from the table which have age =
25 and then COMMIT the changes in the database.
m
SQL> DELETE FROM CUSTOMERS
co
WHERE AGE = 25;
SQL> COMMIT;
Thus, two rows from the table would be deleted and the SELECT statement would produce
a.
the following result.
+----+----------+-----+-----------+----------+
iy
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 3 | kaushik | 23 | Kota
un
| 2000.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
sD
Example
Consider the CUSTOMERS table having the following records −
to
+----+----------+-----+-----------+----------+
Tu
+----+----------+-----+-----------+----------+
| 6 | Komal | 22 | MP | 4500.00 |
m
+----+----------+-----+-----------+----------+
Following is an example, which would delete those records from the table which have the
co
age = 25 and then ROLLBACK the changes in the database.
a.
WHERE AGE = 25;
iy
SQL> ROLLBACK;
Thus, the delete operation would not impact the table and the SELECT statement would
produce the following result.
un
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
sD
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
al
This command serves only in the creation of a SAVEPOINT among all the transactional
statements. The ROLLBACK command is used to undo a group of transactions.
The syntax for rolling back to a SAVEPOINT is as shown below.
ROLLBACK TO SAVEPOINT_NAME;
Following is an example where you plan to delete the three different records from the
CUSTOMERS table. You want to create a SAVEPOINT before each delete, so that you can
ROLLBACK to any SAVEPOINT at any time to return the appropriate data to its original
state.
Example
Consider the CUSTOMERS table having the following records.
+----+----------+-----+-----------+----------+
m
+----+----------+-----+-----------+----------+
co
| 2 | Khilan | 25 | Delhi | 1500.00 |
a.
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
iy
| 6 | Komal | 22 | MP | 4500.00 |
un
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
The following code block contains the series of operations.
sD
Savepoint created.
al
1 row deleted.
ri
Savepoint created.
1 row deleted.
Savepoint created.
1 row deleted.
Now that the three deletions have taken place, let us assume that you have changed your
mind and decided to ROLLBACK to the SAVEPOINT that you identified as SP2. Because SP2
was created after the first deletion, the last two deletions are undone −
m
Rollback complete.
Notice that only the first deletion took place since you rolled back to SP2.
co
SQL> SELECT * FROM CUSTOMERS;
+----+----------+-----+-----------+----------+
a.
| ID | NAME | AGE | ADDRESS | SALARY |
iy
+----+----------+-----+-----------+----------+
| 3 | kaushik | 23 | Kota |
un
2000.00 |
| 6 | Komal | 22 | MP | 4500.00 |
+----+----------+-----+-----------+----------+
ri
6 rows selected.
The RELEASE SAVEPOINT command is used to remove a SAVEPOINT that you have
created.
The syntax for a RELEASE SAVEPOINT command is as follows .
Tu
Concurrency Control
m
In the concurrency control, the multiple transactions can be executed simultaneously.
It may affect the transaction result. It is highly important to maintain the order of execution
co
of those transactions.
a.
manner. Following are the three problems in concurrency control.
1. Lost updates
iy
2. Dirty read
3. Unrepeatable read
1. Lost update problem
o
un
When two transactions that access the same database items contain their operations
in a way that makes the value of some database item incorrect, then the lost update
problem occurs.
o If two transactions T1 and T2 read a record and then update it, then the effect of
updating of the first record will be overwritten by the second update.
sD
Example:
al
ri
Here,
to
o A transaction T1 updates a record which is read by T2. If T1 aborts then T2 now has
values which have never formed part of the stable database.
Example:
m
o At time t2, transaction-Y writes A's value.
co
o At time t3, Transaction-X reads A's value.
o At time t4, Transactions-Y rollbacks. So, it changes A's value back to that of prior to
t1.
o So, Transaction-X now contains a value which has never become part of the stable
a.
database.
o Such type of problem is known as Dirty Read Problem, as one transaction reads a
dirty value which has not been committed.
iy
3. Inconsistent Retrievals Problem
o Inconsistent Retrievals Problem is also known as unrepeatable read. When a
un
transaction calculates some summary function over a set of data while the other
transactions are updating the data, then the Inconsistent Retrievals Problem occurs.
o A transaction T1 reads a record and then does some other processing during which
the transaction T2 updates the record. Now when the transaction T1 reads the
record, then the new value will be inconsistent with the previous value.
sD
o Here, transaction-X produces the result of 550 which is incorrect. If we write this
produced result in the database, the database will become an inconsistent state
because the actual sum is 600.
o Here, transaction-X has seen an inconsistent state of the database.
Concurrency Control Protocol
Concurrency control protocols ensure atomicity, isolation, and serializability of concurrent
transactions. The concurrency control protocol can be divided into three categories:
m
1. Lock based protocol
2. Time-stamp protocol
3. Validation based protocol
co
Concurrency Control Problems
The coordination of the simultaneous execution of transactions in a multiuser database
a.
system is known as concurrency control. The objective of concurrency control is to ensure
the serializability of transactions in a multiuser database environment. Concurrency control
is important because the simultaneous execution of transactions over a shared database can
iy
create several data integrity and consistency problems. The three main problems are lost
updates, uncommitted data, and inconsistent retrievals.
un
1. Lost Updates:
The lost update problem occurs when two concurrent transactions, T1 and T2, are updating
the same data element and one of the updates is lost (overwritten by the other
transaction). Consider the following PRODUCT table example.
sD
Transaction Operation
T1: Purchase 100 units PROD_QOH = PROD_QOH + 100
to
m
But suppose that a transaction is able to read a product’s PROD_QOH value from the table
co
before a previous transaction (using the same product) has been committed.
The sequence depicted in the following Table shows how the lost update problem can arise.
Note that the first transaction (T1) has not yet been committed when the second
a.
transaction (T2) is executed. Therefore, T2 still operates on the value 35, and its
subtraction yields 5 in memory. In the meantime, T1 writes the value 135 to disk, which is
promptly overwritten by T2. In short, the addition of 100 units is “lost” during the process.
iy
un
sD
2. Uncommitted Data:
al
The phenomenon of uncommitted data occurs when two transactions, T1 and T2, are
executed concurrently and the first transaction (T1) is rolled back after the second
ri
transaction (T2) has already accessed the uncommitted data—thus violating the isolation
property of transactions.
to
To illustrate that possibility, let’s use the same transactions described during the lost
updates discussion. T1 has two atomic parts to it, one of which is the update of the
inventory, the other possibly being the update of the invoice total (not shown). T1 is forced
Tu
to roll back due to an error during the updating of the invoice’s total; hence, it rolls back all
the way, undoing the inventory update as well. This time, the T1 transaction is rolled back
to eliminate the addition of the 100 units. Because T2 subtracts 30 from the original 35
units, the correct answer should be 5.
Transaction Operation
T1: Purchase 100 units PROD_QOH = PROD_QOH + 100 (Rolled back)
T2: Sell 30 units PROD_QOH = PROD_QOH – 30
The following Table shows how, under normal circumstances, the serial execution of those
transactions yields the correct answer.
m
co
a.
The following Table shows how the uncommitted data problem can arise when the
ROLLBACK is completed after T2 has begun its execution.
iy
un
sD
al
3. Inconsistent Retrievals:
ri
Inconsistent retrievals occur when a transaction accesses data before and after another
transaction(s) finish working with such data. For example, an inconsistent retrieval would
to
occur if transaction T1 calculated some summary (aggregate) function over a set of data
while another transaction (T2) was updating the same data. The problem is that the
transaction might read some data before they are changed and other data after they are
Tu
1. T1 calculates the total quantity on hand of the products stored in the PRODUCT table.
2. At the same time, T2 updates the quantity on hand (PROD_QOH) for two of the PRODUCT
table’s products.
The two transactions are shown in the following Table:
m
While T1 calculates the total quantity on hand (PROD_QOH) for all items, T2 represents the
correction of a typing error: the user added 10 units to product 1558-QW1’s PROD_QOH but
co
meant to add the 10 units to product 1546-QQ2’s PROD_QOH. To correct the problem, the
user adds 10 to product 1546-QQ2’s PROD_QOH and subtracts 10 from product 1558-
QW1’s PROD_QOH. The initial and final PROD_QOH values are reflected in the following
Table
a.
iy
un
sD
The following table demonstrates that inconsistent retrievals are possible during the
transaction execution, making the result of T1’s execution incorrect. The “After” summation
shown in Table 10.9 reflects the fact that the value of 25 for product 1546-QQ2 was read
after the WRITE statement was completed. Therefore, the “After” total is 40 + 25 = 65. The
al
“Before” total reflects the fact that the value of 23 for product 1558-QW1 was read before
the next WRITE statement was completed to reflect the corrected update of 13. Therefore,
the “Before” total is 65 + 23 = 88.
ri
to
Tu
The computed answer of 102 is obviously wrong because you know from the previous Table
that the correct answer is 92. Unless the DBMS exercises concurrency control, a multiuser
database environment can create havoc within the information system.
m
operations that take the database from one consistent state to another. Finally, you know
that database consistency can be ensured only before and after the execution of
transactions.
co
A database always moves through an unavoidable temporary state of inconsistency
during a transaction’s execution if such transaction updates multiple tables/rows. (If
the transaction contains only one update, then there is no temporary inconsistency.)
a.
That temporary inconsistency exists because a computer executes the operations
serially, one after another. During this serial process, the isolation property of
iy
transactions prevents them from accessing the data not yet released by other
transactions.
The scheduler establishes the order in which the operations with in concurrent
un
transactions are executed. The scheduler interleaves the execution of database
operations to ensure serializability. To determine the appropriate order, the
scheduler bases its actions on concurrency control algorithms, such as locking or
time-stamping methods. The scheduler also makes sure that the computer’s CPU is
used efficiently.
sD
The DBMS determines what transactions are serializable and proceeds to interleave
the execution of the transaction’s operations. Generally, transactions that are not
serializable are executed on a first-come, first-served basis by the DBMS. The
scheduler’s main job is to create a serializable schedule of a transaction’s
al
operations.
A serializable schedule is a schedule of a transaction’s operations in which the
interleaved execution of the transactions (T1, T2, T3, etc.) yields the same results as
ri
cycles.
The scheduler facilitates data isolation to ensure that two transactions do not update
the same data element at the same time. Database operations might require READ
and/or WRITE actions that produce conflicts. For example, The following Table shows
the possible conflict scenarios when two transactions, T1 and T2, are executed
concurrently over the same data. Note that in Table 10.11, two operations are in
conflict when they access the same data and at least one of them is a WRITE
operation.
m
CONCURRENCY CONTROL WITH LOCKING METHODS
co
A lock guarantees exclusive use of a data item to a current transaction. In other words, transaction T2
a.
does not have access to a data item that is currently being used by transaction T1. A transaction
acquires a lock prior to data access; the lock is released (unlocked) when the transaction is completed so
that another transaction can lock the data item for its exclusive use.
iy
Most multiuser DBMSs automatically initiate and enforce locking procedures. All lock information is
managed by a lock manager.
un
Lock Granularity
Indicates the level of lock use. Locking can take place at the following levels: database, table, page,
sD
LOCK TYPES
al
Regardless of the level of locking, the DBMS may use different lock types:
ri
1. Binary Locks
2. Shared/Exclusive Locks
Tu
An exclusive lock exists when access is reserved specifically for the transaction that locked the object
.The exclusive lock must be used when the potential for conflict exists. A shared lock exists when
concurrent transactions are granted read access on the basis of common lock. A shared lock produces
no conflict as long as all the concurrent transactions are read only.
DEADLOCKS
A deadlock occurs when two transactions wait indefinitely for each other to unlock data.
Deadlock preventation . A transaction requesting a new lock is aborted when there is the
possibility that a deadlock can occur. if the transaction is aborted , all changes made by this
transaction are rolled back and all locks obtained by the transaction are released .The
transaction is then rescheduled for execution.
Deadlock detection. The DBMS periodically tests the database for deadlocks. if a deadlock is
found one of the transactions is aborted (rolled back and restarted) and the other transaction
m
are continues.
Deadlock avoidance. The transaction must obtain all of the locks it needs before it can be
executed .This technique avoids the rollback of conflicting transactions by requiring that locks be
co
obtained in succession
a.
Two-Phase Locking (2PL) is a concurrency control method which divides the execution
phase of a transaction into three parts.
It ensures conflict serializable schedules.
iy
If read and write operations introduce the first unlock operation in the transaction, then it
is said to be Two-Phase Locking Protocol. un
This protocol can be divided into two phases,
1. In Growing Phase, a transaction obtains locks, but may not release any lock.
2. In Shrinking Phase, a transaction may release locks, but may not obtain any lock.
This protocol not only requires two-phase locking but also all exclusive-locks should be held
until the transaction commits or aborts.
It is not deadlock free.
Tu
It ensures that if data is being modified by one transaction, then other transaction cannot
read it until first transaction commits.
Most of the database systems implement rigorous two – phase locking protocol.
2. Rigorous Two-Phase Locking
Rigorous Two – Phase Locking Protocol avoids cascading rollbacks.
This protocol requires that all the share and exclusive locks to be held until the transaction
commits.
3. Conservative Two-Phase Locking Protocol
Conservative Two – Phase Locking Protocol is also called as Static Two – Phase Locking
Protocol.
This protocol is almost free from deadlocks as all required items are listed in advanced.
It requires locking of all data items to access before the transaction starts.
Timestamp-based Protocols
The most commonly used concurrency protocol is the timestamp based protocol. This
m
protocol uses either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions
at the time of execution, whereas timestamp-based protocols start working as soon as a
co
transaction is created.
Every transaction has a timestamp associated with it, and the ordering is determined by
the age of the transaction. A transaction created at 0002 clock time would be older than all
other transactions that come after it. For example, any transaction 'y' entering the system
a.
at 0004 is two seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the
system know when the last ‘read and write’ operation was performed on the data item.
iy
Concurrency control with time stamp ordering
The timestamp-ordering protocol ensures serializability among transactions in their
un
conflicting read and write operations. This is the responsibility of the protocol system that
the conflicting pair of tasks should be executed according to the timestamp values of the
transactions.
The timestamp of transaction Ti is denoted as TS(Ti).
sD
Operation rejected.
o If TS(Ti) < W-timestamp(X)
Operation rejected and Ti rolled back.
o Otherwise, operation executed.
Tu
m
This situation is known as a deadlock.
Deadlocks are not healthy for a system. In case a system is stuck in a deadlock, the
transactions involved in the deadlock are either rolled back or restarted.
co
Deadlock Prevention
To prevent any deadlock situation in the system, the DBMS aggressively inspects all the
operations, where transactions are about to execute. The DBMS inspects the operations
and analyzes if they can create a deadlock situation. If it finds that a deadlock situation
a.
might occur, then that transaction is never allowed to be executed.
There are deadlock prevention schemes that use timestamp ordering mechanism of
transactions in order to predetermine a deadlock situation.
iy
Wait-Die Scheme
In this scheme, if a transaction requests to lock a resource (data item), which is already
un
held with a conflicting lock by another transaction, then one of the two possibilities may
occur −
If TS(Ti) < TS(Tj) − that is Ti, which is requesting a conflicting lock, is older than T j − then Ti is
allowed to wait until the data-item is available.
If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later with a random
sD
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme
al
In this scheme, if a transaction requests to lock a resource (data item), which is already
held with conflicting lock by some another transaction, one of the two possibilities may
occur −
ri
If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back − that is Tiwounds Tj. Tj is restarted later with
a random delay but with the same timestamp.
If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.
to
This scheme, allows the younger transaction to wait; but when an older transaction
requests an item held by a younger one, the older transaction forces the younger one to
abort and release the item.
Tu
In both the cases, the transaction that enters the system at a later stage is aborted.
Deadlock Avoidance
Aborting a transaction is not always a practical approach. Instead, deadlock avoidance
mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-
for graph" are available but they are suitable for only those systems where transactions are
lightweight having fewer instances of resource. In a bulky system, deadlock prevention
techniques may work well.
Crash Recovery
DBMS is a highly complex system with hundreds of transactions being executed every
second. The durability and robustness of a DBMS depends on its complex architecture and
its underlying hardware and system software. If it fails or crashes amid transactions, it is
expected that the system would follow some sort of algorithm or techniques to recover lost
data.
Failure Classification
To see where the problem has occurred, we generalize a failure into various categories, as
m
follows −
Transaction failure
A transaction has to abort when it fails to execute or when it reaches a point from where it
co
can’t go any further. This is called transaction failure where only a few transactions or
processes are hurt.
Reasons for a transaction failure could be −
Logical errors − Where a transaction cannot complete because it has some code
a.
error or any internal error condition.
System errors − Where the database system itself terminates an active transaction
because the DBMS is not able to execute it, or it has to stop because of some
iy
system condition. For example, in case of deadlock or resource unavailability, the
system aborts an active transaction.
System Crash
un
There are problems − external to the system − that may cause the system to stop abruptly
and cause the system to crash. For example, interruptions in power supply may cause the
failure of underlying hardware or software failure.
Examples may include operating system errors.
Disk Failure
sD
In early days of technology evolution, it was a common problem where hard-disk drives or
storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head crash or
any other failure, which destroys all or a part of disk storage.
al
Storage Structure
We have already described the storage system. In brief, the storage structure can be
divided into two categories −
ri
Volatile storage − As the name suggests, a volatile storage cannot survive system
crashes. Volatile storage devices are placed very close to the CPU; normally they
are embedded onto the chipset itself. For example, main memory and cache
to
memory are examples of volatile storage. They are fast but can store only a small
amount of information.
Non-volatile storage − These memories are made to survive system crashes.
They are huge in data storage capacity, but slower in accessibility. Examples may
Tu
include hard-disks, magnetic tapes, flash memory, and non-volatile (battery backed
up) RAM.
Recovery and Atomicity
When a system crashes, it may have several transactions being executed and various files
opened for them to modify the data items. Transactions are made of various operations,
which are atomic in nature. But according to ACID properties of DBMS, atomicity of
transactions as a whole must be maintained, that is, either all the operations are executed
or none.
When a DBMS recovers from a crash, it should maintain the following −
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be
rolled back.
No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as
maintaining the atomicity of a transaction −
m
Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and
co
later, the actual database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a
transaction. It is important that the logs are written prior to the actual modification and
a.
stored on a stable storage media, which is failsafe.
Log-based recovery works as follows −
The log file is kept on a stable storage media.
iy
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
un
When the transaction modifies an item X, it write logs as follows −
<Tn, X, V1, V2>
It reads Tn has changed the value of X, from V1 to V2.
sD
When the transaction finishes, it logs −
<Tn, commit>
The database can be modified using two approaches −
Deferred database modification − All logs are written on to the stable storage
al
and then start recovering. To ease this situation, most modern DBMS use the concept of
'checkpoints'.
Checkpoint
Tu
Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. As time passes, the log file may grow too big to be
handled at all. Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point before
which the DBMS was in consistent state, and all the transactions were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the
following manner −
m
co
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
a.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
If the recovery system sees a log with <T n, Start> but no commit or abort log
iy
found, it puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before
saving their logs.
un
sD
al
ri
to
Tu
m
Different Kind of memory in a Computer System (or) Memory Hierarchy: Memory in a computer system
is arranged in a hierarchy is shown as follows.
co
CPU
a.
CACHE MEMORY
Primary Storage
iy
MAIN MEMORY
un
FLASH MEMROY
sD
MAGNETIC DISK
Secondary Storage
OPTICAL DISK
al
TAPE
ri
Tertiary Storage
to
At the top, there is Primary storage that has cache, flash and main memory to provide very fast access
the data.
The secondary storage devices are Magnetic disks that are slower and permanent devices.
Tu
The Tertiary Storage is a permanent and slowest device when compared with Magnetic disk.
Cache memory: The cache is the fastest but costliest memory available. It is not concern for databases.
Main Memory: The processor requires the data to be stored in main memory. Although main memory
contains Giga byte of storage capacity but it is not sufficient for databases.
Flash Memory: Flash memory stores data even if the power fails. Data can be retrieved as fast as in main
memory, however writing data to flash memory is a complex task and overwriting data cannot be done
directly. It is used in small computers.
Magnetic Disk Storage: Magnetic disk is the permanent data storage medium. It enables random access
of data and it is called “Direct-access” storage. Data from disk is transferred into main memory for
processing. After modification, the data is loaded back onto the disk.
Optical Disk: Optical disks are Compact Disks (CDs), Digital Video Disks (DVDs). These are commonly
used for permanent data storage. CD s are used for providing electronically published information and for
distributing software such as multimedia data. They have a larger capacity that is upto 640 MB. These
are relatively cheaper. To storage large volume of data CDs are replaced with DVDs. DVDs are in various
capacities based on manufacturing.
m
Advantages:
1. Optical disks are less expensive.
2. Large amount of data can be stored.
co
3. CDs and DVDs are longer durability than magnetic disk drives.
4. These provide nonvolatile storage of data.
5. These can store any type of data such text data, music, video etc.
a.
Tape Storage (or) Tertiary Storage media: Tape (or) Tertiary storage provides only sequential access to
the data and the access to data is much slower. They provide high capacity removable tapes. They can
have capacity of about 20 Giga bytes to 40 GB. These devices are also called “tertiary storage” or “off
iy
the storage”. In a larger database system, tape (tertiary) storage devices are using for backup storage of
data.
Magnetic tapes are fragmented into vertical columns referred as frames and horizontal rows referred as
un
tracks. The data is organized in the form of column string with one data/frame. Frames are in turn
fragmented into rows or tracks. One frame can store one byte of data and individual track can store a
single bit. The rest of the tract is treated as a parity track.
Advantages:
sD
1. Magnetic tapes are very less expensive and durable compared with optical disks.
2. These are reliable and a good tape drive system performs a read/write operation successfully.
3. These are very good choice for archival storage and any number of times data can be erased and
reused.
al
Disadvantages:
The major disadvantage of tapes is that they are sequential access devices.
They work very slow when compared to magnetic disks and optical disks.
ri
2. The unit for data transfer between disk and main memory is a block;if a single item on a block is
needed,the entire block is transferred.Reading or writing a disk block is called an I/O(for
input/output)operation.
Tu
3. The time to read or Write a block varies, depending on the location of the location of the data:
Access time=seek time+rotational delay+transfer time
4.the time for moving blocks to or from disk usually dominates the time taken for database operations.To
minimize this time, it is necessary to locate data records strategically on disk because of the geometry
and mechanics of disks.
Buffer Manager: The buffer manager is the software layer that is responsible for bringing pages from
physical disk to main memory as needed. The buffer manager manages the available main memory by
dividing into a collection of pages, which we called as buffer pool. The main memory pages in the buffer
pool are called frames.
The goal of the buffer manager is to ensure that the data requests made by programs are satisfied
by copying data from secondary storage devices into buffer. In fact, if a program performs an input
statement, it calls the buffer manager for input operation to satisfy the requests by reading from
existing buffers.
Similarly, if a program performs an output statement, it calls the buffer manager for output
operation to satisfy the requests by writing to the buffers. Therefore, we can say that input and output
operations occurs b/w the program and the buffer area only.
In addition to the buffer pool itself, the buffer manager maintains two variables for each frame in
m
the pool. They are ‘pin-count’ and ‘dirty’. The number of times the page is requested in the frame, each
time the pin-count variable is incremented for that frame (i.e. set to 1 (pin-count=1)). For satisfying each
request the pin-count is decreased for that frame (i.e. set to 0).
co
Thus, if a page is requested the pin-count is incremented, if it fulfills the request the pin-count is
decremented.
In addition to this, if the page has been modified the Boolean variable, ‘dirty’ is set as ‘on’.
a.
Otherwise set to ‘off’.
Buffer pool
iy
un
DataBase
Disk
Buffer Manager Writing the page to disk: When a page is requested the buffer manager does the
following.
sD
1. Checks the buffer pool that frame contains the requested page and if so, increment the pin-count
of that frame. If the page is not in the pool, the buffer manager brings it into the main memory
from disk and set the pin-count value as 1. Otherwise set to 0.
2. If the ‘dirty’ variable is set to ‘on’ then that page is modified and replaced by its previous page and
al
count variable is set to 1 of that frame. So, pin-count = 1 is called pinning the requested page in its frame.
When the request of the requestor is fulfilled, the pin-count variable is set to 0 of that frame. Thus, the
buffer manager will not read any page into a frame unit when pin-count becomes 0 (zero).
to
Allocation of Records to Blocks: The buffer management uses block of storage and it is replaced by
next record allocation when current record is deleted. The buffer manager also provides concurrency
control system to execute more than one process. In this case, the records are mapped onto disk blocks.
Tu
Types of File Organizations: Data is organized on secondary storage in terms of files. Each file has
several records.
The enormous data cannot be stored in main memory so, it is stored on magnetic disk. During the
process, the required data can be shifted to main memory from disk. The unit of information being
transferred b/w main memory and disk is called a page.
Tapes are also used to store the data in the database. But this can be accessed sequentially. So,
most of the time is wasted to transfer each page.
m
Buffer manager is software used for reading data into memory and writing data onto magnetic
disks. Each record on a file is identified by record id or rid. Whenever a page needs to be processed, the
buffer manager retrieves the page from the disk based on its record id.
co
Disk space manager is software that allocates space for records on the disk. When DBMS requires
an additional space, it calls disk space management to allocate the space. DBMS also informs disk space
manager when it’s not going to use the space.
a.
Most widely used file organizations are 1. Heap (Unordered ) Files. 2. Sequential (ordered) files.
3. Hash files.
1. Heap file: It is the simplest file organization and stores the files in the order they arrive. It also
iy
called unordered file.
Inserting a record: Records are inserted in the same order as they arrive.
Deleting a record: The record is to be deleted, first access that record and then marked as
deleted.
un
Accessing a Record: A linear search is performed on the files starting from the first record
until the desired record is found.
2. Ordered File: Files are arranged in sequential order. The main advantage of this file organization
sD
is that we can now use binary search as the file are sorted.
Insertion of Record: This is a difficult task. Because, first we need to identify the space where
record need insert and file is arranged in an order. If the space is available then record can
directly be inserted. If space is not sufficient, then that record moved to next page.
al
Deletion of Files: This task is also difficult to delete the record. In this first find deleted
record and then remove the empty space of deleted record.
Access of files: This is similar as we can use binary search on files.
ri
3. Hash files: Using this file organization, files are not organized sequentially, instead they are
arranged randomly. The address of the page where the record is to be stored is calculated using a
to
‘hash function’.
Index: An index is a data structure which organizes data records on disk to optimize certain kinds of
retrieval operations. Using an index, we easily retrieve the records which satisfy search conditions on the
Tu
search key fields of the index. The ‘data entry’ is the term which we use to refer to the records stored in
an index file. We can search an index efficiently for finding desired data entries and use them for
obtaining data records. There are three alternatives for to store a data in an index,
1) A data entry K* is the actual data record with search key value of k
2) A data entry is a <k, rid> pair (Here rid is the row id or record id and k is key value).
3) A data entry is a <k, rid-list> pair (rid-list is a list of record ids of the data records with search
key value k).
Types of Index: There two index techniques to organize the file. They are 1. Clustered. 2. Un-clustered.
Clustered: A file organization in which the data records are ordered in same way as data entries in index
is called clustered.
Un-clustered: A file organization in which the data records are ordered in a different way as data entries
in index is called un-clustered.
,
Indexed Sequential files : Indexed sequential file overcomes the disadvantages of sequential file, where
in it is not possible to directly access a particular record. But, in index sequential file organization, it is
possible to access the record both sequentially and randomly. Similar to sequential file, the records in
indexed sequential file are organized in sequence based on primary key values. In addition to this, indexed
sequential file consists of the following two features that distinguishes it from a sequential file.
m
i)Index: It is used so as to support random access. It provides a lookup capability to reach quickly to the
desired record.
ii)Overflow file: The overflow file is similar to the log file used in the sequential file. Indexed sequential
co
file greatly reduces the time required to access a single record without sacrificing the sequential nature
of the file. In order to process a file sequentially, the records of the main file are initially processed in a
sequence until a pointer to the overflow file is found. Then accessing continues in the overflow file until a
a.
null pointer is encountered.
Hash File Organization: Hash file organization helps us to locate records very fast with a given search
key value. For example, “Find the sailor record for “SAM”, if the file is hashed on name field. In hashed
iy
files, the pages are grouped into bucket. Every bucket has bucket number which allows us to find the
primary page for that bucket. Then the record belonging to that bucket can be determined by using hash
function to the search fields, when inserting the record into the appropriate bucket with overflow pages
un
are maintained in a linked list. For searching a record with a given search key value, apply the hash
function to identify the bucket to which such records belongs and look at. This organization is called
static hashed file.
sD
multiple lengths for records, however, files of fixed length records are easier to implement than the files
of variable-length records.
Fixed-length Records: As an example, consider a file of account records for our bank database. Each
ri
In the above definition, each character occupies Record 4 102 ongole 2200
1 byte and a real occupies 8 bytes. So totally the Record 5 222 chimakurthy 1200
Above record occupies 38 bytes long. Record 6 333 kandukur 2600
Record 7 343 kandukur 2200
There is a problem to delete a record from this structure. The space occupied by the record to be
deleted must be filled with some other record of the file (or) we must have a way of marking deleted
records so that they can be ignored. This is shown in fig.
Acc.no branch name balance
Record 1 101 ongole 2000
kandukur 4300
102 ongole 2200
Visit TUTORIALSDUNIYA.COM for Notes, Tutorials, Programs,
222 Question
chimakurthy 1200 Papers
TutorialsDuniya.com
Download FREE Computer Science Notes, Programs, Projects,
Books PDF for any university student of BCA, MCA, B.Sc,
B.Tech CSE, M.Sc, M.Tech at https://www.tutorialsduniya.com
m
Acc.no branch name balance
co
Record 7 343 kandukur 2200
Record 3 301 kandukur 4300
Record 4 102 ongole 2200
Record 5
a.
222 chimakurthy 1200
Record 6 333 kandukur 2600
On insertion of a new record, we use the record pointer by the header.
iy
We change the head pointer to point to the next available record. If
no space is available, we add the new record to the end of the file.
Thus, Insertion and deletion for files of fixed-length records are simple to implement, because the space
un
made available by a deleted record is exactly the space needed to insert a record.
Location
There is a header at the beginning of each block, containing following information,
1) The header contains number of record entries.
2) The end of free space in the block
3) In an array, entries contain the location and size of each record.
The actual records are allocated contiguously in the block, starting from the end of the block. The free
space in the block is contiguous, b/w the final entry in the header array and the first records. If a record
m
is inserted, space is allocated for it at the end of free space and an entry containing its size and location is
added to the header.
co
If a record is deleted, the space that it occupies is freed and its entry is set to delete. When a record
is to be deleted then space is replaced by final entry record.
a.
Fixed-Length Representation: The Fixed-Length representation is another way to
implement variable-length records efficiently in a file system to use one or more fixed-length records.
iy
There are two ways to do this, 1) Reserved Space 2) List Representation
1) Reserved Space: If there is a maximum record length that is never exceeded, we can use fixed-length
un
records of that length. Unused space is filled with a special null, or end-of-record, symbol. This is shown
in following figure (1).
2) List Representation: We can represent variable-length record by lists of fixed-length records, chained
together by pointer. This is shown in fig(2).
sD
The reserved-space method is useful when most records have a length close to the maximum.
Tu
The disadvantage of the structure is that we waste space in all records except the first in a chain. The
first method needs to have the branch_name, but subsequent records do not.
Overflow-block: Which contain records other than those that are the first records of a chain.
Thus, all records within a block have the same length, even though not all records in the file have the same
length.
Types of Indexing: Indexes can perform the DBMS. By the index can locate the desired
record directly without scanning each record in the file.
m
An index can be defined as a data structure that allows faster retrieval of data. Each index is
based on certain attribute of the field. This is given in ‘search key’.
An index can refer the data based on several search keys. It means, the term data entry refer to
co
the records stored in an index file. The data entry can be a,
1) Search Key. 2) Search key with record id. 3) Search key with list of record ids).
a.
A file organization when the records are stored and referred in same way as data entries in index
is called “clustered”.
iy
A file organization when the records are stored and referred in a different way as data entries in
index is called “un-clustered”.
un
Differences b/w clustered and un-clustered:
A file organization when the records are stored and referred in same way as data entries in index is
called clustered. Whereas un-clustered referred refer in a different way as data entries in index is called
un-clustered.
sD
A clustered index is an index which uses alternative (1) whereas un-clustered index uses alternative (2)
and (3).
Clustered index refer few pages when we require retrieving the records. Whereas un-clustered index
al
search key specifies the different order to retrieve the records from the sequential order of the file is
called un-clustered index.
to
primary index. A primary index is an ordered file whose records are of fixed length with two fields. The
first field of the record of file must be defined with a constraint “primary key” is called the primary key
of the data file and the second field is a pointer to a disk block.
There is one index entry in the index file for each block in the data file. Each index entry has the
value of the primary key field for the first record in a block and a pointer of that block as its two field
values. The two field values of index entry are referred to as key[i], primary key[i].
Primary indexes are further divided into dense index and sparse index.
1) Dense Index: An index record appears for every search key value in the file. The index record
contains the search-key value and a pointer to the first record with that search-key.
2) Sparse Index: An index record is created for only some of the values. As it is true in dense indexes,
each index record contains a search-key value and a pointer to the first data record with the largest
search-key value that is less than or equal to the search-key value for which we are looking. We start at
the record pointed to by that index entry and follow the pointers in the file, until we find the desired
record.
Secondary Index: An index that is not a primary key index is called a secondary index. That is an
index on a set of fields that does not include the primary key is called a secondary index. A secondary
m
index on a candidate key looks just like a dense primary index, except that the records pointed to
successive values in the index are not stored sequentially. In general, secondary indices are different
from primary indices. It the search key of a primary index in not a primary key, it suffices the index
co
pointing to the first record with a particular value for the search key, since the other records can be
fetched by a sequential scan of the file.
A secondary index must contain pointers to all the records, because if the search ey of a secondary
index is not a primary key, it is not enough to point to just the first record.
a.
Thus, the records are ordered by the search key of the primary index but same search-key value
could be anywhere in the file.
iy
kalyan 1001 4000
4000
jaishnav 1022 3400
3400
un
priyanka 2301 2240
2240
kishore 1001 4500
4500
4050 sumahi 4001 4050
sD
chaitanya 3030 4200
4200
Anjali 1001 4500
4500
4050 swapna 4001 4050
mamatha 3030 4200
al
The above fig. shows the structure of a secondary index that uses an extra level of indirection on the
account file, on the search key balance.
A sequential scan in primary index is efficient because records in the file are stored physically in
ri
the same order as the index order. We cannot store a file physically ordered both by the search key of
the primary index and the search key of a secondary index. Because secondary-key order and physical-key
to
order differ, but if we attempt to scan the file sequentially in secondary-key order, the reading of each
record is likely to require the reading of a new block from disk. If a secondary index stores only some of
the search key values, records with intermediate search key values may be anywhere in the file and in
Tu
general, we cannot find them without searching the entire file. Secondary indices must therefore be
dense, with an index entry for every search-key value and a pointer to every record in the file but not the
sparse.
Secondary indices improve the performance of queries that use keys other than the search key of
the primary index. They also impose a significant overhead on modification of the database. The design
of a database decides which secondary indices are desirable on the basis of an estimate of the relative
frequency of queries and modifications.
Index Data Structures: The two methods in which file data entries can be organized in two
ways.
1) Hash-based indexing, which uses search key
1) Hash Based Indexing: This type of indexing is used to find records quickly, by providing a search
key value.
In this, a group of file records stored in pages based on bucket method. The first bucket contains
a primary page and along with other pages is chained together. In order to determine the bucket for a
m
record, a special function is called a hash function along with a search key is used. By providing a bucket
number, we can obtain the primary page in one or more disk I/O operations.
co
Records Insertion into the Bucket: The records are inserted into the bucket by assigning
(allocating) the need “overflow” pages.
Record Searching: a hash function is used to find first, the bucket containing the records and then by
scanning all the pages in a bucket, the record with a given search key can be found.
a.
Suppose, if the record doesn’t have search key value then all the pages in the file needs to be
scanned.
iy
Record Retrieval: By applying a hash function to the record’s search key, the page containing the needed
record can be identified and retrieved in one disk I/O.
Consider a file student with a hash key rno. Applying the hash function to the rno, represents the
un
page that contains the needed record. The hash function ‘h’ uses the last two digits of the binary value of
the rno as the bucket identifier. A search key index of marks obtained i.e., marks contains <mrks, rid>
pairs as data entries in an auxiliary index file which is shown in the fig. The rid (record id) points to the
record whose search key value is mrks.
sD
2) Tree-based indexing: In Tree Based indexing the records arranged in tree-like structure. The data
entries are started according to the search key values and they are arranged in a hierarchical number to
find the correct page of the data entries.
al
Examples:
1) Consider the student record with a search key rno arranged in a tree-structured index. In order
to retrieve the nodes (A’1, B’1, L’11, L’12 and L’13) that need to perform disk I/O.
ri
The lowest leaf level contains these records. The additional records with rno’s < 19 and > 42 are added
to the left side of the leaf node L’11 and to the right of the leaf node L’13.
The root node is responsible for the start of search and these searches are then directed to the
to
correct leaf pages by the non-leaf pages which contain node pointers separated by the search key
values. The data entries in a subtree smaller than the key value ki are pointed to by the right node
pointer of ki, shown in fig.
Tu
2)In order to find the students whose roll numbers lies b/w ‘19’ and ‘24’, the direction of the search is
shown in the fig.
Suppose we want to find all the students roll numbers lying b/w 17
and 40, we first direct the search to the node A’ 1 and after
analyzing its contents, then forwarded the search to B’ 1 followed
by the leaf node L’11, which actually contains the required data
entry. The other leaf nodes L’12 and L’13 also contains the data
entries that fulfills our search criteria. For this, all the leaf
pages must be designed using double linked list.
Thus, L’12 can be fetched using next pointer on L’11 and L’13 can be
obtained using the next pointer on L’12. Number of disk I/Os =
Length of the path from the root to a leaf (occurs in search) +
The number of satisfying data entries leaf pages.
Closed and Open Hashing: File organization based on the technique of hashing allows us to
avoid accessing an index structures. Hashing also provides a way of constructing indexes. There are two
m
types of hashing techniques. They are,
1) Static/ Open hashing. 2) Dynamic/Closed hashing.
In a hash file organization, we obtain the address of the disk block containing a desired record directly by
co
computing a function on the search key value of the record. In our description of hashing, we shall use the
term bucket to denote a unit of storage that can store one or more records. A bucket is typically a disk
block, but could be chosen to be smaller or larger then a disk block.
a.
Static hashing can obtain the address of the disk block containing a desired record
directly by computing hash function on the search key value of the record. In
iy
static hashing, the number buckets is static (fixed). The static hashing scheme is
illustrated as shown in fig.
un
The pages containing the index data can be viewed as a collection of buckets, with
one page and possible additional overflow pages for overflow buckets. A file
consists of buckets 0 through N – 1 for N buckets. Buckets contain data entries
which can be any of the three choices K*, < k, rid > pair, <k, rid-list> pair.
sD
To search for a data entry, we apply a hash function ‘h’ to identify the bucket to which it belongs and then
search this bucket.
To insert a data entry, we use the hash function to identify the correct bucket and then put the data entry
al
there. If there is no space for this data entry, we allocate a new overflow bucket, put the data entry and
add to the overflow page.
To delete a data entry, we use the hash function to identify the correct bucket, locate the data entry by
ri
searching the bucket and then remove it. If this data entry is the last in an overflow page, the overflow
page is removed and added to a list of free pages.
to
Thus, the number of buckets in a static hashing file is known when the file is created the pages can be
stored as successive disk pages.
The main problem with static hashing is that the number of buckets is fixed.
If a file shrinks greatly, a lot of space is wasted.
If a file grows a lot, long overflow chains develop, resulting in poor performance.
Dynamic Hashing: The dynamic hashing technique allow the hash function to be modified dynamically to
accommodate the growth or shrinkage of the database, because most databases grow larger over time and
static hashing techniques presents serious problems to deal with them.
Thus, if we are using static hashing on such growing databases, we have three options:
1) Choose a hash function based on the current file size. This option will result in performance
degradation as the database grows.
2) Choose a hash function based on predicted size of the file for future. This option will result in the
wastage of space.
3) Periodically reorganize the hash structure in response to file growth.
Thus, using dynamic hashing techniques is the best solution. They are two types,
1. Extensible Hashing Scheme: Uses a directory to support inserts and deletes efficiently with no
m
overflow pages.
2. Linear Hashing Scheme: Uses a clever policy for creating new buckets and supports inserts and
deletes efficiently without the use of a directory.
co
Comparison of File Organization: To compare file organizations, we consider the following
operations that can be performed on a record. They are,
1) Record Insertion: For inserting a record we need to identify and fetch the page from the disk. The
a.
record is then added and the (modified) page is written back to the disk.
2) Record Deletion: It follows the same procedure as record insertion except that after identifying
and fetching a page, the record with the given rid is deleted and again the changed page is added
iy
back to the disk.
3) Record Scanning: In this all the file pages must be fetched from the disk and are stored in a pool of
buffers. Then the corresponding record can be retrieved.
un
4) Record Searching Based on Equality Selection: In this, all the records that satisfies a given
equality selection criteria are fetched from the disk.
Example: To find a student record based on the following equality selection criteria “student whose
roll number (rno) is 15 and whose marks (mrks) are 90* is the topper of the class.
sD
5) Record Searching Based on Selected Range: In this, all the records that satisfies a given equality
selection are fetched.
Example: Find all the records of the students whose secured marks are greater than 50.
al
B = The total number of pages without any space wastage when records are grouped into it.
R = The total number of records present in a page.
D = The average time needed to R/W (Read/Write) a disk page.
to
For calculating I/O costs (which is the base for costs of the database operations) we take,
D = 15 ms, C and H = 100 ns.
Heap Files:
1) Cost of Scanning: The cost of scanning heap files is given by B(D + RC). It means, Scanning R records of
B pages with time C per record would take BRC and scanning B pages with time D per page would take BD.
Therefore the total cost of scanning is BD + BRC B(D + RC)
2) Cost of Insertion: The cost to insert a record in heap file is given as 2D + C. It means, to insert a
record, first we need to fetch the last page of the file that can take time ‘D’ then we need to add the
record that takes time ‘C’ and finally the page is written back to the disk from main memory will take time
‘D’. So, the total cost is, D + D + C 2D + C.
3) Cost of Deletion: The cost to delete a record from a heap file is given as, D + C + D = 2D + C. It means,
in order to delete a record, first search the record by reading the page that can take time
4) Record Searching based on some quality criteria: Searching exactly one record that meet the equality
that involves scanning half of the files based on the assumption to find the record.
This takes time = ½ x scanning cost ½ x B (D + RC).
m
In case of multiple records the entire file need to be scanned.
5) Record searching with a Range selection: It is the same as the cost of scanning because it is not known
in advance how many records can satisfy the particular range. Thus, we need to scan the entire file that
co
would take B(D + RC).
Sorted Files:
1) Cost of scanning: The cost of scanning sorted files is given by B(D + RC) because all the pages need to
a.
be scanned in order to retrieve a record. i.e. cost of scanning sorted files = Cost of scanning heap files.
2) Cost Insertion: The cost of insertion in sorted files is given by search cost + B(D + RC).
iy
It includes, Finding correct position of the record + Adding of record + Fetching of pages + rewriting the
pages.
3) Cost of Deletion: The cost of deletion in sorted files is given by
Search cost + B(D + RC).
un
It includes, Searching a record + removing record + rewriting the modified page.
Note: The record deletion is based on equality.
sD
4) Cost of Searching with equality selection Criteria: This cost of sorted files is equal to D log2 B = It is
the time required for performing a binary search for a page that contain the records.
If many records satisfy, then record is equal to, D log2 B + C log2 R + Cost of sequential reading of all the
records.
al
fetched.
Clustered Files:
to
1) Cost of Scanning: The cost of scanning clustered files is same as the cost of scanning sorted files
except that it has more number of pages and this cost is given as scanning B pages with time ‘D’ per page
takes BD and scanning R records of B pages with time C per record takes BRC. Therefore the total cost is,
Tu
1.5B(D + RC).
2) Cost of Insertion: The cost of insertion in clustered files is, Search + Write (D logF1.5B + Clog2R) + D.
3) Cost of Deletion: It is same as the cost of insertion and includes,
the cost of searching for a record + removing of a record + rewriting the modified page.
i.e. D logF 1.5B + Clog2R + D
4) Equality Selection Search:
i) For a single Qualifying Record: The cost of finding a single qualifying record in clustered files is the
sum of the binary searches involved in finding the first page in D logF 1.5B and finding the first
matching record in C log2 R. i.e. D logF1.5B + C log2 R.
ii) For several Qualifying Records: If more than one record satisfies the selection criteria then they
are assumed to be located consecutively.
Cost required to find record is equal to,
D logF 1.5B + C log2R + cost involved in sequential reading of all records.
5) Range Selection Search: This cost in an equality search under several matched records.
m
Heap File with Un-clustered Tree Index:
1) Scanning: For scanning a student’s file,
co
i) Scan the index’s leaf level.
ii) Get the relevant record from the file for each data entry.
iii) Obtain sorted data records according to < rno, mrks >.
The cost of reading all the data entries is 0.15B (D + 6.7RC) I/Os. For each index entry a record has
a.
to be fetched in one I/O.
2) Insertion: The record is first inserted at 2D + C in students heap file and the associated entry in the
iy
index. The correct leaf page can be found in D logF 0.15B + C log2 6.7 R followed by the addition of a new
entry and rewriting in D.
5) Range Selection Based-search: This is same as search with range selection in clustered files except
from having data pages it has data entries.
ri
1) Scan : The total cost is the sum of the cost in the retrieval of all data entries and one I/O cost for each
data record. It is given as, 0.125B(D + BRC) + BR(D + C).
2) Insertion: It involves the cost of inserting a record i.e., 2D + C in the heap file, the cost of finding the
Tu
page cost of adding a new entry and rewriting of the page, it is expressed as,
2D + C + (H + 2D + C).
3) Deletion: It involves the cost of finding the data record and the data entry at H + 2D + 4RC and writing
back the changed page to the index and file at 2D. The total cost is, (H + 2D + 4RC) + 2D.
4) Equality Selection Search: The total cost in the search accounts to,
i) The page containing the qualifying entries is identified at the cost H.
ii) Retrieval of the page, assuming that it is the only page present in the bucket occurs at D.
iii) The cost of finding an entry after scanning half the records on the page is 4RC.
iv) Fetching a record from the file is D. The total cost is,
H + D + 4RC + D ( H + 2D + 4RC)
In case of many matched records the cost is,
H + D + 4RC + one I/O for each record that qualifies.
5) Range Selection Search: The cost of this is B(D + RC).
m
File Organization Advantages Disadvantages
co
1) Heap file - Good storage efficiency - slow searches
- Rapid scanning - slow deletion
- Insertion is fast
a.
2) Sorted file - Good storage efficiency - Insertion is slow.
- Search is faster than heap file. - Slow deletion.
iy
3) Clustered file - Good storage efficiency - Space overhead
- Searches fast
- Efficient insertion and deletion
un
4) Heap file with - Fast insertion, deletion and searching - Scanning and range searches are slow.
Un-clustered
tree index.
sD
5) Heap file with - Fast insertion, deletion and searching - Doesn’t support range searches.
Un-clustered
Hash index.
al
Dangling Pointer: Dangling pointer is a pointer that does not point to a valid object of the appropriate
type. Dangling pointers arise when an object is deleted or de-allocated, without modifying the value of the
ri
pointer, so that the pointer still points to the memory location of the de-allocated memory.
In abject-oriented database, dangling pointer occur if we move or delete a record to which another
to
record contains as pointer that pointer no longer points to the desired record.
Detecting Dangling pointer in object-oriented Databases: Mapping objects to files is similar to mapping
Tu
tupples to files in a relational system, object data can be stored using file structures. Objects are
identified by an object identifier (OID), the storage system needs a mechanism to locate given its OID.
Logical identifiers do not directly specify an objects physical location, must maintain an index that maps an
OID to the object’s actual location.
Physical identifiers encode the location of the object so the object can be found directly. Physical OIDs
have the following parts.
1) A volume or file identifier.
2) A page identifier within the volume or file.
m
work load.
Work Load Impact: Data entries that qualify particular selection criteria can be retrieved
effectively by means of indexes. Two selection types are,
co
1) Equality 2) Range Selection.
1) Equality:
a.
An equality query for a composite search key is defined as a search key in which each field is associated
with a constant.
For Example data entries in a student file where rno = 15 and mrks = 90 can be retrieved by using equality
iy
query.
Thus, Tree-based indexing supports both the selection criteria as well as inserts, deletes and updates
whereas only equality selection is supported by hash-based indexing apart form insertion, deletion and
updation.
al
Disadvantages:
to
The stored file pages are in accordance with the disk’s order hence sequential retrieval of such pages is
quicker which is not possible in tree-structured indexes.
Tu