Database Management Systems (R15a0509)
Database Management Systems (R15a0509)
UNIT I:
Database System Applications, Purpose of Database Systems, View of Data – Data Abstraction –
Instances and Schemas – Data Models – the ER Model – Relational Model – Other Models –
Database Languages – DDL – DML – database Access for applications Programs – Database
Users and Administrator – Transaction Management – Database Architecture – Storage Manager
– the Query Processor.
Introduction to the Relational Model – Structure – Database Schema, Keys – Schema Diagrams.
Database design and ER diagrams – ER Model - Entities, Attributes and Entity sets –
Relationships and Relationship sets – ER Design Issues – Concept Design – Conceptual Design
with relevant Examples. Relational Query Languages, Relational Operations.
UNIT II:
Relational Algebra – Selection and projection set operations – renaming – Joins – Division –
Examples of Algebra overviews – Relational calculus – Tuple Relational Calculus (TRC) –
Domain relational calculus (DRC).
Overview of the SQL Query Language – Basic Structure of SQL Queries, Set Operations,
Aggregate Functions – GROUPBY – HAVING, Nested Sub queries, Views, Triggers,
Procedures.
UNIT III:
Normalization – Introduction, Non loss decomposition and functional dependencies, First,
Second, and third normal forms – dependency preservation, Boyce/Codd normal form.
Higher Normal Forms - Introduction, Multi-valued dependencies and Fourth normal form, Join
dependencies and Fifth normal form
UNIT IV:
Transaction Concept- Transaction State- Implementation of Atomicity and Durability –
Concurrent Executions – Serializability- Recoverability – Implementation of Isolation – Testing
for serializability- Lock –Based Protocols – Timestamp Based Protocols- Validation- Based
Protocols – Multiple Granularity.
UNIT V:
Recovery and Atomicity – Log – Based Recovery – Recovery with Concurrent Transactions
– Check Points - Buffer Management – Failure with loss of nonvolatile storage-Advance
Recovery systems- ARIES Algorithm, Remote Backup systems.
File organization – various kinds of indexes - B+ Trees- Query Processing – Relational Query
Optimization.
TEXT BOOKS:
1. Database System Concepts, Silberschatz, Korth, McGraw hill, Sixth Edition.(All UNITS
except III th)
2. Database Management Systems, Raghu Ramakrishnan, Johannes Gehrke, TATA
McGrawHill 3rd Edition.
REFERENCE BOOKS:
1. Fundamentals of Database Systems, Elmasri Navathe Pearson Education.
2. An Introduction to Database systems, C.J. Date, A.Kannan, S.Swami Nadhan, Pearson,
Eight Edition for UNIT III.
Outcomes:
Demonstrate the basic elements of a relational database management system
Ability to identify the data models for relevant problems
Ability to design entity relationship and convert entity relationship diagrams into
RDBMS and formulate SQL queries on the respect data
Apply normalization for the development of application software’s
MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY
INDEX
3 I DATABASE MODELS 4
4 I DATABASE LANGUAGES 8
5 I DATABASE ARCHITECTURE 11
II RELATIONAL ALGEBRA AND CALCULUS
6 22
7 II SQL 27
8 II AGGREGATE FUNCTIONS 30
9 II NESTED QUERIES 33
10 II VIEWS 36
15 IV LOCK-BASED PROTOCOLS 59
16 IV MULTIPLE GRANUALARITY 62
19 V FILE ORGANIZATION 68
20 V B+ TREES 73
UNIT-1
Introduction to Database Management System
The database is a collection of inter-related data which is used to retrieve, insert and delete the data efficiently.
It is also used to organize the data in the form of a table, schema, views, and reports, etc.
For example: The college Database organizes the data about the admin, staff, students and faculty etc.
Using the database, you can easily retrieve, insert, and delete the information.
Characteristics of DBMS
o It uses a digital repository established on a server to store and manage the information.
o It can provide a clear and logical view of the process that manipulates data.
o DBMS contains automatic backup and recovery procedures.
o It contains ACID properties which maintain data in a healthy state in case of failure.
o It can reduce the complex relationship between data.
o It is used to support manipulation and processing of data.
o It is used to provide security of data.
o It can view the database from different viewpoints according to the requirements of the user.
Advantages of DBMS
o Controls database redundancy: It can control data redundancy because it stores all the data in one
single database file and that recorded data is placed in the database.
o Data sharing: In DBMS, the authorized users of an organization can share the data among multiple
users.
o Easily Maintenance: It can be easily maintainable due to the centralized nature of the database system.
o Reduce time: It reduces development time and maintenance need.
o Backup: It provides backup and recovery subsystems which create automatic backup of data
from hardware and software failures and restores the data if required.
Disadvantages of DBMS
o Cost of Hardware and Software: It requires a high speed of data processor and large memory size to
run DBMS software.
o Size: It occupies a large space of disks and large memory to run them efficiently.
o Complexity: Database system creates additional complexity and requirements.
o Higher impact of failure: Failure is highly impacted the database because in most of the organization,
all the data stored in a single database and if the database is damaged due to electric failure or database
corruption then the data may be lost forever.
DBMS VS FILES
Abstraction is one of the main features of database systems. Hiding irrelevant details from user and
providing abstract view of data to users, helps in easy and efficient user-database interaction. In the
previous tutorial, we discussed the three level of DBMS architecture, The top level of that architecture is
“view level”. The view level provides the “view of data” to the users and hides the irrelevant details such
as data relationship, database schema, constraints, security etc from the user.
Database systems are made-up of complex data structures. To ease the user interaction with database, the
developers hide internal irrelevant details from users. This process of hiding irrelevant details from user is
called data abstraction.
Physical level: This is the lowest level of data abstraction. It describes how data is actually stored in database.
You can get the complex data structure details at this level.
Logical level: This is the middle level of 3-level data abstraction architecture. It describes what data is stored in
database.
View level: Highest level of data abstraction. This level describes the user interaction with database system.
Example: Let’s say we are storing customer information in a customer table. At physical level these records
can be described as blocks of storage (bytes, gigabytes, terabytes etc.) in memory. These details are often
hidden from the programmers.
At the logical level these records can be described as fields and attributes along with their data types, their
relationship among each other can be logically implemented. The programmers generally work at this level
because they are aware of such things about database systems.
At view level, user just interact with system with the help of GUI and enter the details at the screen, they are not
aware of how the data is stored and what data is stored; such details are hidden from them.
Definition of schema: Design of a database is called the schema. Schema is of three types: Physical schema,
logical schema and view schema.
For example: In the following diagram, we have a schema that shows the relationship between three tables:
Course, Student and Section. The diagram only shows the design of the database, it doesn’t show the data
present in those tables. Schema is only a structural view(design) of a database.
DBMS Instance
Definition of instance: The data stored in database at a particular moment of time is called instance of
database. Database schema defines the variable declarations in tables that belong to a particular database; the
value of these variables at a moment of time is called the instance of that database.
Data Model is a logical structure of Database. It describes the design of database to reflect entities, attributes,
relationship among data, constrains etc.
Object based logical Models – Describe data at the conceptual and view levels.
1. E-R Model
2. Object oriented Model
Record based logical Models – Like Object based model, they also describe data at the conceptual and view
levels. These models specify logical structure of database with records, fields and attributes.
1. Relational Model
2. Hierarchical Model
3. Network Model – Network Model is same as hierarchical model except that it has graph-like structure
rather than a tree-based structure. Unlike hierarchical model, this model allows each record to have more
than one parent record.
Physical Data Models – These models describe data at the lowest level of abstraction.
1. Entity
Weak Entity:
An entity that cannot be uniquely identified by its own attributes and relies on the relationship with other
entity is called weak entity. The weak entity is represented by a double rectangle. For example – a bank
account cannot be uniquely identified without knowing the bank to which the account belongs, so bank
account is a weak entity.
An attribute describes the property of an entity. An attribute is represented as Oval in an ER diagram. There
are four types of attributes:
1. Key attribute
2. Composite attribute
3. Multivalued attribute
4. Derived attribute
3. Relationship
A relationship is represented by diamond shape in ER diagram, it shows the relationship among entities. There
are four types of relationships:
1. One to One
2. One to Many
3. Many to One
4. Many to Many
Object-Based Data Model. Object-oriented programming (especially in Java, C++, or C#) has become
the dominant software-development methodology. This led to the development of an object- oriented data
model that can be seen as extending the E-R model with notions of encapsulation, methods (functions),
and object identity.
Semi-structured Data Model. The semi-structured data model permits the specification of data where
individual data items of the same type may have different sets of attributes. This is in contrast to the data
models mentioned earlier, where every data item of a particular type must have the same set of attributes.
The Extensible Markup Language (XML) is widely used to represent semi- structured data.
Historically, the network data model and the hierarchical data model preceded the relational data model.
These models were tied closely to the underlying implementation, and complicated the task of modeling data.
As a result they are used little now, except in old database code that is still in service in some places.
o A DBMS has appropriate languages and interfaces to express database queries and updates.
o Database languages can be used to read, store and update the data in the database.
These commands are used to update the database schema that's why they come under Data definition
language.
DML stands for Data Manipulation Language. It is used for accessing and manipulating data in a database. It
handles user requests.
(But in Oracle database, the execution of data control language does not have the feature of rolling
back.)
There are the following operations which have the authorization of Revoke:
TCL is used to run the changes made by the DML statement. TCL can be grouped into a logical transaction.
Naive users: They are the unsophisticated users who interact with the system by using permanent
applications that already exist. Example: Online Library Management System, ATMs (Automated Teller
Machine), etc.
Application programmers: They are the computer professionals who interact with system through
DML. They write application programs.
Sophisticated users: They interact with the system by writing SQL queries directly through the query
processor without writing application programs.
Specialized users: They are also sophisticated users who write specialized database applications that do
not fit into the traditional data processing framework. Example: Expert System, Knowledge Based
System, etc.
A Database Transaction is a logical unit of processing in a DBMS which entails one or more database access
operation. In a nutshell, database transactions represent real-world events of any enterprise. All types of
database access operation which are held between the beginning and end transaction statements are considered
as a single logical transaction in DBMS. During the transaction the database is inconsistent. Only once the
database is committed the state is changed from one consistent state to another.
ACID Properties are used for maintaining the integrity of database during transaction processing. ACID in
DBMS stands for Atomicity, Consistency, Isolation, and Durability.
Atomicity: A transaction is a single unit of operation. You either execute it entirely or do not execute it
at all. There cannot be partial execution.
Consistency: Once the transaction is executed, it should move from one consistent state to another.
Isolation: Transaction should be executed in isolation from other transactions (no Locks). During
concurrent transaction execution, intermediate transaction results from simultaneously executed
transactions should not be made available to each other. (Level 0,1,2,3)
Durability: · After successful completion of a transaction, the changes in the database should persist.
Even in the case of system failures.
A database system is partitioned into modules that deal with each of the responsibilities of the overall
system. The functional components of a database system can be broadly divided into the storage manager
and the query processor components. The storage manager is important because databases typically
require a large amount of storage space. The query processor is important because it helps the database
system simplify and facilitate access to data.
Storage Manager :
Storage Manager is a program that provides an interface between the data stored in the database and the
queries received. It is also known as Database Control System. It maintains the consistency and integrity of
the database by applying the constraints and executes the DCL statements. It is responsible for updating,
storing, deleting, and retrieving data in the database.
Query processor:
The query processor in a database management system receives as input a query request in the form of SQL
text, parses it, generates an execution plan, and completes the processing by executing the plan and returning
the results to the client.
Query Processing is the activity performed in extracting data from the database. In query processing, it takes
various steps for fetching the data from the database. The steps involved are:
Entity
Any thing that has an independent existence and about which we collect data. It is also known as entity type.
Entity instance
Entity instance is a particular member of the entity type.
Example for entity instance : A particular employee
Regular Entity
An entity which has its own key attribute is a regular entity.
Example for regular entity : Employee.
Weak entity
An entity which depends on other entity for its existence and doesn't have any key attribute of its own is a
weak entity.
Example for a weak entity : In a parent/child relationship, a parent is considered as a strong entity and the
child is a weak entity.
Domain of Attributes
The set of possible values that an attribute can take is called the domain of the attribute. For example, the
attribute day may take any value from the set {Monday, Tuesday ... Friday}. Hence this set can be termed
as the domain of the attribute day.
Key attribute
The attribute (or combination of attributes) which is unique for every entity instance is called key
attribute.
E.g the employee_id of an employee, pan_card_number of a person etc.If the key attribute consists of
two or more attributes in combination, it is called a composite key.
Simple attribute
If an attribute cannot be divided into simpler components, it is a simple attribute.
Example for simple attribute : employee_id of an employee.
Composite attribute
Example for composite attribute : Name of the employee which can be split into First_name, Middle_name,
and Last_name.
If an attribute can take only a single value for each entity instance, it is a single valued attribute. example
for single valued attribute : age of a student. It can take only one value for a particular student.
If an attribute can take more than one value for each entity instance, it is a multi-valued attribute. Multi- valued
example for multi valued attribute : telephone number of an employee, a particular employee may have
multiple telephone numbers.
Stored Attribute
An attribute which need to be stored permanently is a stored attribute
Example for stored attribute : name of a student
Derived Attribute
An attribute which can be calculated or derived based on other attributes is a derived attribute.
Example for derived attribute : age of employee which can be calculated from date of birth and current date.
In ER modeling, notation for derived attribute is given below.
Relationships
Associations between entities are called relationships
Example : An employee works for an organization. Here "works for" is a relation between the entities
employee and organization.In ER modeling, notation for relationship is given below.
However in ER Modeling, To connect a weak Entity with others, you should use a weak relationship
notation as given below
Relationship cardinalities specify how many of each entity type is allowed. Relationships can have four
possible connectivities as given below.
One employee is assigned with only one parking space and one parking space is assigned to only
one employee. Hence it is a 1:1 relationship and cardinality is One-To-One (1:1)
One organization can have many employees , but one employee works in only one organization.
Hence it is a 1:N relationship and cardinality is One-To-Many (1:N)
In ER modeling, this can be mentioned using notations as given below
Disadvantages
1. Physical design derived from E-R Model may have some amount of ambiguities or inconsistency.
2. Sometime diagrams may lead to misinterpretations
Relational Model
The relational model is today the primary data model for commercial data processing applications. It
attained its primary position because of its simplicity, which eases the job of the programmer, compared to
earlier data models such as the network model or the hierarchical model.
Structure of Relational Databases:
A relational database consists of a collection of tables, each of which is assigned a unique name. For
example, consider the instructor table of Figure:1.5, which stores information about instructors. The table
has four column headers: ID, name, dept name, and salary. Each row of this table records information
about an instructor, consisting of the instructor’s ID, name, dept name, and salary.
When we talk about a database, we must differentiate between the database schema, which is the logical
design of the database, and the database instance, which is a snapshot of the data in the database at a
given instant in time. The concept of a relation corresponds to the programming- language notion of a
variable, while the concept of a relation schema corresponds to the programming-language notion of type
definition.
Keys
A superkey is a set of one or more attributes that, taken collectively, allow us to identify uniquely a tuple
in the relation. For example, the ID attribute of the relation instructor is sufficient to distinguish one
instructor tuple from another. Thus, ID is a superkey. The name attribute of instructor, on the other hand,
is not a superkey, because several instructors might have the same name.
A superkey may contain extraneous attributes. For example, the combination of ID and name is a superkey
for the relation instructor. If K is a superkey, then so is any superset of K. We are often interested in
superkeys for which no proper subset is a superkey. Such minimal superkeys are called candidate keys.
It is customary to list the primary key attributes of a relation schema before the other attributes; for
example, the dept name attribute of department is listed first, since it is the primary key. Primary key
attributes are also underlined. A relation, say r1, may include among its attributes the primary key of
another relation, say r2. This attribute is called a foreign key from r1, referencing r2.
Schema Diagrams
A database schema, along with primary key and foreign key dependencies, can be depicted by
schema diagrams. Figure 1.12 shows the schema diagram for our university organization.
RELATIONAL ALGEBRA
Relational algebra is a procedural query language, which takes instances of relations as input and yields
instances of relations as output. It uses operators to perform queries. An operator can be
either unary or binary. They accept relations as their input and yield relations as their output. Relational
algebra is performed recursively on a relation and intermediate results are also considered relations.
The fundamental operations of relational algebra are as follows −
Select
Project
Union
Set different
Cartesian product
Rename
r ∪ s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
For a union operation to be valid, the following conditions must hold −
Set intersection
Assignment
Natural join
Joins
The join operation is one of the most useful operations in relational algebra and is the most commonly used
way to combine information from two or more relations. Although a join can be defined as a cross-product
followed by selections and projections, joins arise much more frequently in practice than plain cross-
products.joins have received a lot of attention, and there are several variants of the join operation.
Condition Joins
The most general version of the join operation accepts a join condition c and a pair of relation instances as
arguments, and returns a relation instance. The join condition is identical to a selection condition in form.
The operation is defined as follows:
R ⊲⊳c S = σc(R × S)
Thus ⊲⊳ is defined to be a cross-product followed by a selection. Note that the condition c can (and
typically does) refer to attributes of both R and S.
Equijoin
A common special case of the join operation R ⊲⊳ S is when the join condition con-sists solely of
equalities (connected by ∧) of the form R.name1 = S.name2, that is, equalities between two fields in R and
S. In this case, obviously, there is some redun-dancy in retaining both attributes in the result.
A further special case of the join operation R ⊲⊳ S is an equijoin in which equalities are specified on all
fields having the same name in R and S. In this case, we can simply omit the join condition; the default is
that the join condition is a collection of equalities on all common fields.
Division
The division operator is useful for expressing certain kinds of queries, for example: “Find the names of
sailors who have reserved all boats.” Understanding how to use the basic operators of the algebra to define
division is a useful exercise.
RELATIONAL CALCULUS
o Relational calculus is a non-procedural query language. In the non-procedural query language, the user
is concerned with the details of how to obtain the end results.
o The relational calculus tells what to do but never explains how to do.
Notation:
Where
OUTPUT: This query selects the tuples from the AUTHOR relation. It returns a tuple with 'name' from
Author who has written an article on 'database'.
TRC (tuple relation calculus) can be quantified. In TRC, we can use Existential (∃) and Universal
Quantifiers (∀).
For example:
Output: This query will yield the same result as the previous one.
Notation:
Where
For example:
Rules:
o Structure query language is not case sensitive. Generally, keywords of SQL are written in uppercase.
o Statements of SQL are dependent on text lines. We can use a single SQL statement on one or multiple
text line.
o Using the SQL statements, you can perform most of the actions in a database.
o SQL depends on tuple relational calculus and relational algebra.
SQL process:
o When an SQL command is executing for any RDBMS, then the system figure out the best way to carry
out the request and the SQL engine determines that how to interpret the task.
o In the process, various components are included. These components can be optimization Engine, Query
engine, Query dispatcher, classic, etc.
o All the non-SQL queries are handled by the classic query engine, but SQL query engine won't handle
logical files.
Set Operation
The SQL Set operation is used to combine the two or more SQL SELECT statements.
1. Union
o The SQL Union operation is used to combine the result of two or more SQL SELECT queries.
o In the union operation, all the number of datatype and columns must be same in both the tables on which
UNION operation is being applied.
o The union operation eliminates the duplicate rows from its resultset.
Syntax
Syntax:
Syntax
Syntax:
Example
Aggregate Functions
o SQL aggregation function is used to perform the calculations on multiple rows of a single column
of a table. It returns a single value.
o It is also used to summarize the data.
1. COUNT FUNCTION
Syntax
COUNT(*)
or
COUNT( [ALL|DISTINCT] expression
)
2. SUM Function
Sum function is used to calculate the sum of all selected columns. It works on numeric fields only.
Syntax
SUM()
or
SUM( [ALL|DISTINCT] expression )
3. AVG function
The AVG function is used to calculate the average value of the numeric type. AVG function returns the
average of all non-Null values.
Syntax
AVG()
or
AVG( [ALL|DISTINCT] expression )
4. MAX Function
MAX function is used to find the maximum value of a certain column. This function determines the
largest value of all selected values of a column.
Syntax
MAX()
or
MAX( [ALL|DISTINCT] expression )
The GROUP BY statement groups rows that have the same values into summary rows, like "find the
number of customers in each country".
The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG)
to group the result-set by one or more columns.
Syntax
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s);
Examples
The following SQL statement lists the number of customers in each country:
Example
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;
HAVING Clause
o The HAVING clause was added to SQL because the WHERE keyword could not be used with aggregate
functions.
Syntax
SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);
Examples
The following SQL statement lists the number of customers in each country. Only include countries with
more than 5 customers:
A subquery can be nested inside other subqueries. SQL has an ability to nest queries within one another.
A subquery is a SELECT statement that is nested within another SELECT statement and which return
intermediate results. SQL executes innermost subquery first, then next level.
A Subquery or Inner query or a Nested query is a query within another SQL query and embedded
within the WHERE clause.
A subquery is used to return data that will be used in the main query as a condition to further restrict the
data to be retrieved.
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along with the
operators like =, <, >, >=, <=, IN, BETWEEN, etc.
There are a few rules that subqueries must follow −
Subqueries must be enclosed within parentheses.
A subquery can have only one column in the SELECT clause, unless multiple columns are in the main
query for the subquery to compare its selected columns.
An ORDER BY command cannot be used in a subquery, although the main query can use an ORDER
BY. The GROUP BY command can be used to perform the same function as the ORDER BY in a
subquery.
Subqueries that return more than one row can only be used with multiple value operators such as the IN
operator.
The SELECT list cannot include any references to values that evaluate to a BLOB, ARRAY, CLOB, or
NCLOB.
A subquery cannot be immediately enclosed in a set function.
The BETWEEN operator cannot be used with a subquery. However, the BETWEEN operator can be
used within the subquery.
In the nested queries that we have seen thus far, the inner subquery has been completely independent of the
outer query:
(Q1) Find the names of sailors who have reserved boat number 103.
SELECT S.sname
FROM Sailors S
WHERE EXISTS ( SELECT *
FROM Reserves R
WHERE R.bid = 103
AND R.sid = S.sid )
Set-Comparison Operators
(Q1) Find sailors whose rating is better than some sailor called Horatio.
SELECT S.sid
FROM Sailors S
WHERE S.rating > ANY ( SELECT S2.rating
FROM Sailors S2
WHERE S2.sname = ‘Horatio’ )
SELECT S.sid
FROM Sailors S
WHERE S.rating >= ALL ( SELECT S2.rating FROM
Sailors S2 )
(Q1) Find the names of sailors who have reserved both a red and a green boat.
SELECT S.sname
FROM Sailors S, Reserves R, Boats B
WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = ‘red’
AND S.sid IN ( SELECT S2.sid
FROM Sailors S2, Boats B2, Reserves R2
WHERE S2.sid = R2.sid AND R2.bid = B2.bid
AND B2.color = ‘green’ )
SELECT S.sname
FROM Sailors S
WHERE NOT EXISTS (( SELECT B.bid
FROM Boats B ) EXCEPT
(SELECT R.bid
FROM Reserves R
WHERE R.sid = S.sid ))
NULL VALUES
We have assumed that column values in a row are always known. In practice column values can be
unknown. For example, when a sailor, say Dan, joins a yacht club, he may not yet have a rating assigned.
Since the definition for the Sailors table has a rating column, what row should we insert for Dan? What is
needed here is a special value that denotes unknown.
SQL provides a special column value called null to use in such situations. We use null when the column
value is either unknown or inapplicable. Using our Sailor table definition, we might enter the row 〈 98,
Dan, null, 39 〉to represent Dan. The presence of null values complicates many issues, and we consider the
impact of null values on SQL in this section.
Comparisons Using Null Values
Consider a comparison such as rating = 8. If this is applied to the row for Dan, is this condition true or false?
Since Dan’s rating is unknown, it is reasonable to say that this comparison should evaluate to the value
unknown.
SQL also provides a special comparison operator IS NULL to test whether a column value is null; for
example, we can say rating IS NULL, which would evaluate to true on the row representing Dan. We can
also say rating IS NOT NULL, which would evaluate to false on the row for Dan.
Now, what about boolean expressions such as rating = 8 OR age < 40 and rating = 8 AND age < 40?
Considering the row for Dan again, because age < 40, the first expression evaluates to true regardless of the
value of rating, but what about the second? We can only say unknown.
A view is a table whose rows are not explicitly stored in the database but are computed as needed from a
view de nition. Consider the Students and Enrolled relations.
This view can be used just like a base table, or explicitly stored table, in de ning new queries or views.
ALTER TABLE modifies the structure of an existing table. To add a column called maiden-name
to Students, for example, we would use the following command:
TRIGGERS
A trigger is a procedure that is automatically invoked by the DBMS in response to specified changes to the
database, and is typically specified by the DBA. A database that has a set of associated triggers is called an
active database. A trigger description contains three parts:
1.Event: A change to the database that activates the trigger.
2.Condition: A query or test that is run when the trigger is activated.
3.Action: A procedure that is executed when the trigger is activated and its con-dition is true.
The examples shown below, written using Oracle 7 Server syntax for defining triggers, illustrate the basic
concepts behind triggers. The trigger called init count initializes a counter variable before every execution of
an INSERT statement that adds tuples to the Students relation. The trigger called incr count increments the
counter for each inserted tuple that satisfies the condition age < 18.
(identifying the modified table, Students, and the kind of modifying statement, an INSERT), and
the third field is the number of inserted Students tuples with age < 18. (The above trigger only
computes the count; an additional trigger is required to insert the appropriate tuple into the
statistics table.)
The process of breaking up or dividing a single relation into two or more sub relations is called as
decomposition of a relation.
Lossless Decomposition
If the information is not lost from the relation that is decomposed, then the decomposition will be
lossless.
The lossless decomposition guarantees that the join of relations will result in the same relation as it
was decomposed.
The relation is said to be lossless decomposition if natural joins of all the decomposition give the
original relation.
Example:
Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1, E2, E3, . . . . En;
With instance: e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en, then it is called as 'Lossless Join
Decomposition'.
Dependency Preservation
Dependency is an important constraint on the database.
Every dependency must be satisfied by at least one decomposed table.
If {A → B} holds, then two sets are functional dependent. And, it becomes more useful for
checking the dependency easily if both sets in a same relation.
This decomposition property can only be done by maintaining the functional dependency.
In this property, it allows to check the updates without computing the natural join of the database
structure.
NORMALIZATION AND TYPES OF NORMALIZATION
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations. It is also used
to eliminate the undesirable characteristics like Insertion, Update and Deletion Anomalies.
o Normalization divides the larger table into the smaller table and links them using relationship.
o The normal form is used to reduce redundancy from the database table.
Types of normal forms:
FUNCTIONAL DEPENDENCY
Functional Dependency (FD) determines the relation of one attribute to another attribute in a database
management system (DBMS) system. Functional dependency helps you to maintain the quality of data in the
database. A functional dependency is denoted by an arrow →. The functional dependency of X on Y is
represented by X → Y. Functional Dependency plays a vital role to find the difference between good and
bad database design.
Multivalued dependency:
Trivial functional dependency:
Non-trivial functional dependency:
Transitive dependency:
Multivalued dependency: Multivalued dependency occurs when there are more than
one independent multivalued attributes in a table.
The Trivial dependency is a set of attributes which are called a trivial if the set of attributes are
included in that attribute.
Example:
Consider a table with two columns Employee_Id and Employee_Name.
{Employee_id, Employee_Name} → Employee_Id is a trivial functional dependency as
Employee_Id is a subset of {Employee_Id, Employee_Name}.
Also, Employee_Id → Employee_Id and Employee_Name → Employee_Name are trivial dependencies t
oo.
Non-trivial functional dependency
o A → B has a non-trivial functional dependency if B is not a subset of A.
o When A intersection B is NULL, then A → B is called as complete non-trivial.
Example:
ID → Name,
Name → DOB
Partial Dependency – If the proper subset of candidate key determines non-prime attribute, it is called
partial dependency.
Example 1 – Consider table-3 as following below.
STUD_NO COURSE_NO COURSE_FEE
1 C1 1000
2 C2 1500
1 C4 2000
4 C3 1000
4 C1 1000
2 C5 2000
{Note that, there are many courses having the same course fee. }
[Database Management System] Page 41
Here,
COURSE_FEE cannot alone decide the value of COURSE_NO or STUD_NO;
COURSE_FEE together with STUD_NO cannot decide the value of COURSE_NO;
COURSE_FEE together with COURSE_NO cannot decide the value of STUD_NO;
Hence,
COURSE_FEE would be a non-prime attribute, as it does not belong to the one only candidate key
{STUD_NO, COURSE_NO} ;
But, COURSE_NO -> COURSE_FEE , i.e., COURSE_FEE is dependent on COURSE_NO, which
is a proper subset of the candidate key. Non-prime attribute COURSE_FEE is dependent on a proper
subset of the candidate key, which is a partial dependency and so this relation is not in 2NF.
To convert the above relation to 2NF,
we need to split the table into two tables such as :
Table 1: STUD_NO, COURSE_NO
Table 2: COURSE_NO, COURSE_FEE
Table 1 Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000
Employee_zip table:
Dependency-Preserving Decomposition
Consider the Contracts relation with attributes CSJDPQV from Section 15.4.1. The given FDs are C !
CSJDPQV, JP ! C, and SD ! P. Because SD is not a key the dependency SD ! P causes a violation of
BCNF.
Let R be a relation schema that is decomposed into two schemas with attribute sets X and Y, and let F
be a set of FDs over R. The projection of F on X is the set of + FDs in the closure F (not just F !) that
involve only attributes in X. We will denote + the projection of F on attributes X as FX . Note that a
dependency U ! V in Fis in FX only if all the attributes in U and V are in X.
The decomposition of relation schema R with FDs F into schemas with attribute sets X and Y is
need to enforce only the dependencies in FX and FY ; all FDs in F + are then sure to be satisfied. To
enforce FX , we need to examine only relation X (on inserts to that relation). To enforce FY , we
need to examine only relation Y.
o A table is in BCNF if every functional dependency X → Y, X is the super key of the table.
o For BCNF, the table should be in 3NF, and for every FD, LHS is super key.
Example: Let's assume there is a company where employees work in more than one department
EMPLOYEE table:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
EMP_DEPT table:
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
Now, this is in BCNF because left side part of both the functional dependencies is a key.
Step 1. As we can see, (AC) + ={A,C,B,E,D} but none of its subset can determine all attribute of
relation, So AC will be candidate key. A or C can’t be derived from any other attribute of the relation, so
there will be only 1 candidate key {AC}.
Step 2. Prime attribute are those attribute which are part of candidate key {A,C} in this example and
others will be non-prime {B,D,E} in this example.
Step 3. The relation R is in 1 st normal form as a relational DBMS does not allow multi-valued or
composite attribute.
The relation is in 2 nd normal form because BC->D is in 2nd normal form (BC is not proper subset of
candidate key AC) and AC->BE is in 2nd normal form (AC is candidate key) and B->E is in 2nd normal
form (B is not a proper subset of candidate key AC).
The relation is not in 3 rd normal form because in BC->D (neither BC is a super key nor D is a prime
attribute) and in B->E (neither B is a super key nor E is a prime attribute) but to satisfy 3 rd normal for,
either LHS of an FD should be super key or RHS should be prime attribute.
So the highest normal form of relation will be 2 nd Normal form.
Step 1. As we can see, (B)+ ={B,A,C,D,E}, so B will be candidate key. B can be derived from AC using
AC->B (Decomposing AC->BE to AC->B and AC->E). So AC will be super key but (C) + ={C} and
(A)+ ={A,C,B,E,D}. So A (subset of AC) will be candidate key. So there will be two candidate keys
{A,B}.
Step 2. Prime attribute are those attribute which are part of candidate key {A,B} in this example and
others will be non-prime {C,D,E} in this example.
Step 3. The relation R is in 1 st normal form as a relational DBMS does not allow multi-valued or
composite attribute.
The relation is in 2 nd normal form because B->A is in 2nd normal form (B is a super key) and A->C is in
2nd normal form (A is super key) and BC->D is in 2nd normal form (BC is a super key) and AC->BE is in
2nd normal form (AC is a super key).
The relation is in 3 rd normal form because LHS of all FD’s are super keys. The relation is in BCNF as all
LHS of all FD’s are super keys. So the highest normal form is BCNF.
Multivalued Dependency
o Multivalued dependency occurs when two attributes in a table are independent of each other but,
both depend on a third attribute.
o A multivalued dependency consists of at least two attributes that are dependent on a third attribute
that's why it always requires at least three attributes.
Example: Suppose there is a bike manufacturer company which produces two colors(white and black) of
each model every year.
Here columns COLOR and MANUF_YEAR are dependent on BIKE_MODEL and independent of each
other.
In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The
representation of these dependencies is shown below:
BIKE_MODEL → → MANUF_YEAR
BIKE_MODEL → → COLOR
o A relation will be in 4NF if it is in Boyce Codd normal form and has no multi-valued dependency.
o For a dependency A → B, if for a single value of A, multiple values of B exists, then the relation will
be a multi-valued dependency.
Example
STUDENT
21 Computer Dancing
21 Math Singing
34 Chemistry Dancing
74 Biology Cricket
59 Physics Hockey
The given STUDENT table is in 3NF, but the COURSE and HOBBY are two independent entity. Hence,
there is no relationship between COURSE and HOBBY.
So to make the above table into 4NF, we can decompose it into two tables:
STUDENT_COURSE
STU_ID COURSE
21 Computer
21 Math
34 Chemistry
74 Biology
59 Physics
STUDENT_HOBBY
STU_ID HOBBY
21 Dancing
21 Singing
34 Dancing
74 Cricket
59 Hockey
Join Dependency
o If the join of R1 and R2 over C is equal to relation R, then we can say that a join dependency (JD)
exists.
o A JD ⋈ {R1, R2,..., Rn} is said to hold over a relation R if R1, R2,....., Rn is a lossless-join
decomposition.
o The *(A, B, C, D), (C, D) will be a JD of R if the join of join's attribute is equal to the relation R.
o Here, *(R1, R2, R3) is used to indicate that relation R1, R2, R3 and so on are a JD of R.
Example
SUBJECT LECTURER SEMESTER
In the above table, John takes both Computer and Math class for Semester 1 but he doesn't take Math class
for Semester 2. In this case, combination of all these fields required to identify a valid data.
Suppose we add a new Semester as Semester 3 but do not know about the subject and who will be taking
that subject so we leave Lecturer and Subject as NULL. But all three columns together acts as a primary key,
so we can't leave other two columns blank.
So to make the above table into 5NF, we can decompose it into three relations P1, P2 & P3:
SEMESTER SUBJECT
Semester 1 Computer
Semester 1 Math
Semester 1 Chemistry
Semester 2 Math
P2
SUBJECT LECTURER
Computer Anshika
Computer John
Math John
Math Akash
Chemistry Praveen
P3
SEMSTER LECTURER
Semester 1 Anshika
Semester 1 John
Semester 1 John
Semester 2 Akash
Semester 1 Praveen
Example: Suppose an employee of bank transfers Rs 800 from X's account to Y's account. This small
transaction contains several low-level tasks:
X's Account
Open_Account(X)
Old_Balance = X.balance
New_Balance = Old_Balance - 800
X.balance = New_Balance
Close_Account(X)
Y's Account
Open_Account(Y)
Old_Balance = Y.balance
New_Balance = Old_Balance + 800
Y.balance = New_Balance
Close_Account(Y)
Operations of Transaction:
Read(X): Read operation is used to read the value of X from the database and stores it in a buffer in main
memory.
Write(X): Write operation is used to write the value back to the database from the buffer.
Let's take an example to debit transaction from an account which consists of following operations:
R(X);
X = X - 500;
W(X);
o The second operation will decrease the value of X by 500. So buffer will contain 3500.
o The third operation will write the buffer's value to the database. So X's final value will be 3500.
But it may be possible that because of the failure of hardware, software or power, etc. that transaction may
fail before finished all the operations in the set.
For example: If in the above transaction, the debit transaction fails after executing operation 2 then X's
value will remain 4000 in the database which is not acceptable by the bank.
Transaction property
The transaction has the four properties. These are used to maintain consistency in a database, before and
after the transaction.
Property of Transaction
1. Atomicity
2. Consistency
3. Isolation
4. Durability
1.Atomicity
o It states that all operations of the transaction take place at once if not, the transaction is aborted.
o There is no midway, i.e., the transaction cannot occur partially. Each transaction is treated as one unit
and either run to completion or is not executed at all.
Abort: If a transaction aborts then all the changes made are not visible.
Commit: If a transaction commits then all the changes made are visible.
Example: Let's assume that following transaction T consisting of T1 and T2. A consists of Rs 600 and B
consists of Rs 300. Transfer Rs 100 from account A to account B.
Read(A) Read(B)
A:= A-100 Y:= Y+100
Write(A) Write(B)
2.Consistency
o The integrity constraints are maintained so that the database is consistent before and after the
transaction.
o The execution of a transaction will leave a database in either its prior stable state or a new stable
state.
o The consistent property of database states that every transaction sees a consistent database instance.
o The transaction is used to transform the database from one consistent state to another consistent state.
For example: The total amount must be maintained before or after the transaction.
Therefore, the database is consistent. In the case when T1 is completed but T2 fails, then inconsistency will
occur.
3.Isolation
o It shows that the data which is used at the time of execution of a transaction cannot be used by the
second transaction until the first one is completed.
o In isolation, if the transaction T1 is being executed and using the data item X, then that data item
can't be accessed by any other transaction T2 until the transaction T1 ends.
o The concurrency control subsystem of the DBMS enforced the isolation property.
o The durability property is used to indicate the performance of the database's consistent state. It states
that the transaction made the permanent changes.
o They cannot be lost by the erroneous operation of a faulty transaction or by the system failure. When
a transaction is completed, then the database reaches a state known as the consistent state. That
consistent state cannot be lost, even in the event of a system's failure.
o The recovery subsystem of the DBMS has the responsibility of Durability property.
States of Transaction
1.Active state
o The active state is the first state of every transaction. In this state, the transaction is being executed.
o For example: Insertion or deletion or updating a record is done here. But all the records are still not
saved to the database.
2.Partially committed
o In the partially committed state, a transaction executes its final operation, but the data is still not
saved to the database.
o In the total mark calculation example, a final display of the total marks step is executed in this state.
3.Committed
4.Failed state
o If any of the checks made by the database recovery system fails, then the transaction is said to be in
the failed state.
o In the example of total mark calculation, if the database is not able to fire a query to fetch the marks,
then the transaction will fail to execute.
5.Aborted
o If any of the checks fail and the transaction has reached a failed state then the database recovery
system will make sure that the database is in its previous consistent state. If not then it will abort or
roll back the transaction to bring the database into a consistent state.
o After aborting the transaction, the database recovery module will select one of the two operations:
o While working on the database transactions, there occurs the requirement of using the database by
multiple users for performing different operations, and in that case, concurrent execution of the
database is performed.
o The thing is that the simultaneous execution that is performed should be done in an interleaved
manner, and no operation should affect the other executing operations, thus maintaining the
consistency of the database. Thus, on making the concurrent execution of the transaction operations,
there occur several challenging problems that need to be solved.
Serializability
o When multiple transactions are running concurrently then there is a possibility that the database may
be left in an inconsistent state.
o Serializability is a concept that helps us to check which schedules are serializable.
o A serializable schedule is the one that always leaves the database in consistent state.
What is a Serializable Schedule?
1. Conflict Serializability
2. View Serializability
Recoverability in DBMS
A transaction may not execute completely due to hardware failure, system crash or software issues. In that
case, we have to roll back the failed transaction. But some other transaction may also have used values
produced by the failed transaction. So we have to roll back those transactions as well.
Recoverable Schedules:
Schedules in which transactions commit only after all transactions whose changes they read commit
are called recoverable schedules. In other words, if some transaction T j is reading value updated or
written by some other transaction Ti, then the commit of Tj must occur after the commit of Ti.
Example 1:
S1: R1(x), W1(x), R2(x), R1(y), R2(y),
W2(x), W1(y), C1, C2;
Given schedule follows order of Ti->Tj => C1->C2. Transaction T1 is executed before T2 hence there
is no chances of conflict occur. R1(x) appears before W1(x) and transaction T1 is committed before T2
i.e. completion of first transaction performed first update on data item x, hence given schedule is
recoverable.
Testing of Serializability
Example
Assume a schedule S. For S, we construct a graph known as precedence graph. This graph has a pair G = (V,
E), where V consists a set of vertices, and E consists a set of edges. The set of vertices is used to contain all
the transactions participating in the schedule. The set of edges is used to contain all edges Ti ->Tj for which
one of the three conditions holds:
o If a precedence graph contains a single edge Ti → Tj, then all the instructions of Ti are executed
before the first instruction of Tj is executed.
Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an appropriate lock on it.
There are two types of lock:
1. Shared lock:
o It is also known as a Read-only lock. In a shared lock, the data item can only read by the transaction.
o It can be shared between the transactions because when the transaction holds a lock, then it can't
update the data on the data item.
2. Exclusive lock:
o In the exclusive lock, the data item can be both reads as well as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions do not modify the same data
simultaneously.
o Before initiating an execution of the transaction, it requests DBMS for all the lock on all those data
items.
o If all the locks are granted then this protocol allows the transaction to begin. When the transaction is
completed then it releases all the lock.
o If all the locks are not granted then this protocol allows the transaction to rolls back and waits until
all the locks are granted.
o The two-phase locking protocol divides the execution phase of the transaction into three parts.
o In the first part, when the execution of the transaction starts, it seeks permission for the lock it
requires.
o In the second part, the transaction acquires all the locks. The third phase is started as soon as the
transaction releases its first lock.
o In the third phase, the transaction cannot demand any new locks. It only releases the acquired locks.
Growing phase: In the growing phase, a new lock on the data item may be acquired by the transaction, but
none can be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released, but no new
locks can be acquired.
In the below example, if lock conversion is allowed then the following phase can happen:
o The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps. The
order of transaction is nothing but the ascending order of the transaction creation.
o The priority of the older transaction is higher that's why it executes first. To determine the timestamp
of the transaction, this protocol uses system time or logical counter.
o The lock-based protocol is used to manage the order between conflicting pairs among transactions at
the execution time. But Timestamp based protocols start working as soon as a transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the
system at 007 times and transaction T2 has entered the system at 009 times. T1 has the higher
priority, so it executes first as it is entered the system first.
o The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write' operation on a
data.
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
o If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the operation is
executed.
Where,
Validation phase is also known as optimistic concurrency control technique. In the validation based
protocol, the transaction is executed in the following three phases:
1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value of
various data items and stores them in temporary local variables. It can perform all the write
operations on temporary variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the actual
data to see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are written
to the database or system otherwise the transaction is rolled back.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.
o The serializability is determined during the validation process. It can't be decided in advance.
o While executing the transaction, it ensures a greater degree of concurrency and also less number of
conflicts.
Multiple Granularity
Multiple Granularity:
o It can be defined as hierarchically breaking up the database into blocks which can be locked.
o The Multiple Granularity protocol enhances concurrency and reduces lock overhead.
o It makes easy to decide either to lock a data item or to unlock a data item. This type of hierarchy can
be graphically represented as a tree.
o The second level represents a node of type area. The higher level database consists of exactly these
areas.
o The area consists of children nodes which are known as files. No file can be present in more than one
area.
o Finally, each file contains child nodes known as records. The file has exactly those records that are
its child nodes. No records represent in more than one file.
o Hence, the levels of the tree starting from the top level are as follows:
1. Database
2. Area
3. File
4. Record
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the atomicity of the
transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled back.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining the
atomicity of a transaction −
Maintaining the logs of each transaction, and writing them onto some stable storage before actually
modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and later, the actual
database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction. It is
important that the logs are written prior to the actual modification and stored on a stable storage media,
which is failsafe.
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
<Tn, commit>
Deferred database modification − All logs are written on to the stable storage and the database is
updated when a transaction commits.
Immediate database modification − Each log follows an actual database modification. That is, the
database is modified immediately after every operation.
When more than one transaction are being executed in parallel, the logs are interleaved. At the time of
recovery, it would become hard for the recovery system to backtrack all logs, and then start recovering. To
ease this situation, most modern DBMS use the concept of 'checkpoints'.
Checkpoint
Keeping and maintaining logs in real time and in real environment may fill out all the memory space
available in the system. As time passes, the log file may grow too big to be handled at all. Checkpoint is a
mechanism where all the previous logs are removed from the system and stored permanently in a storage
disk. Checkpoint declares a point before which the DBMS was in consistent state, and all the transactions
were committed.
Recovery
When a system with concurrent transactions crashes and recovers, it behaves in the following manner −
The recovery system reads the logs backwards from the end to the last checkpoint.
If the recovery system sees a log with <T n, Start> but no commit or abort log found, it puts the
transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the transactions in the
redo-list and their previous logs are removed and then redone before saving their logs.
BUFFER MANAGEMENT
The buffer manager is the software layer that is responsible for bringing pages from physical disk to main
memory as needed. The buffer manages the available main memory by dividing the main memory into a
collection of pages, which we called as buffer pool. The main memory pages in the buffer pool are
called frames.
Page Requests from Higher Level
Buffer Manager
o A Buffer Manager is responsible for allocating space to the buffer in order to store data into the
buffer.
o If a user request a particular block and the block is available in the buffer, the buffer manager
provides the block address in the main memory.
o If the block is not available in the buffer, the buffer manager allocates the block in the buffer.
o If free space is not available, it throws out some existing blocks from the buffer to allocate the
required space for the new block.
o The blocks which are thrown are written back to the disk only if they are recently modified when
writing on the disk.
o However, the internal actions of the buffer manager are not visible to the programs that may create
any problem in disk-block requests. The buffer manager is just like a virtual machine.
Until now, we have considered only the case where a failure results in the loss of information residing in
volatile storage while the content of the nonvolatile storage remains intact. Although failures in which the
content of nonvolatile storage is lost are rare, we nevertheless need to be prepared to deal with this type of
failure. In this section, we discuss only disk storage. Our discussions apply as well to other nonvolatile
storage types.
The basic scheme is to dump the entire content of the database to stable storage periodically—say, once per
day. For example, we may dump the database to one or more magnetic tapes. If a failure occurs that results
in the loss of physical database blocks, the system uses the most recent dump in restoring the database to a
previous consistent state. Once this restoration has been accomplished, the system uses the log to bring the
database system to the most recent consistent state.
More precisely, no transaction may be active during the dump procedure, and a procedure similar to
checkpointing must take place:
1. Output all log records currently residing in main memory onto stable storage.
Database systems, like any other computer system, are subject to failures but the data stored in it must be
available as and when required. When a database fails it must possess the facilities for fast recovery. It must
also have atomicity i.e. either transactions are completed successfully and committed (the effect is recorded
permanently in the database) or the transaction should have no effect on the database.
There are both automatic and non-automatic ways for both, backing up of data and recovery from any failure
situations. The techniques used to recover the lost data due to system crash, transaction errors, viruses,
catastrophic failure, incorrect commands execution etc. are database recovery techniques. So to prevent data
loss recovery techniques based on deferred update and immediate update or backing up data can be used.
[Database Management System] Page 66
Recovery techniques are heavily dependent upon the existence of a special file known as a system log. It
contains information about the start and end of each transaction and any updates which occur in
the transaction. The log keeps track of all transaction operations that affect the values of database items.
This information is needed to recover from transaction failure.
The log is kept on disk start_transaction(T): This log entry records that transaction T starts the
execution.
read_item(T, X): This log entry records that transaction T reads the value of database item X.
write_item(T, X, old_value, new_value): This log entry records that transaction T changes the value
of the database item X from old_value to new_value. The old value is sometimes known as a before
an image of X, and the new value is known as an afterimage of X.
commit(T): This log entry records that transaction T has completed all accesses to the database
successfully and its effect can be committed (recorded permanently) to the database.
abort(T): This records that transaction T has been aborted.
checkpoint: Checkpoint is a mechanism where all the previous logs are removed from the system and
stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in
consistent state, and all the transactions were committed.
Algorithm for Recovery and Isolation Exploiting Semantics (ARIES)
Algorithm for Recovery and Isolation Exploiting Semantics (ARIES) is based on the Write Ahead Log
(WAL) protocol. Every update operation writes a log record which is one of the following :
1. Undo-only log record:
Only the before image is logged. Thus, an undo operation can be done to retrieve the old data.
2. Redo-only log record:
Only the after image is logged. Thus, a redo operation can be attempted.
3. Undo-redo log record:
Both before images and after images are logged.
In it, every log record is assigned a unique and monotonically increasing log sequence number (LSN). Every
data page has a page LSN field that is set to the LSN of the log record corresponding to the last update on
the page. WAL requires that the log record corresponding to an update make it to stable storage before the
data page corresponding to that update is written to disk. For performance reasons, each log write is not
immediately forced to disk. A log tail is maintained in main memory to buffer log writes. The log tail is
flushed to disk when it gets full. A transaction cannot be declared committed until the commit log record
makes it to disk.
Once in a while the recovery subsystem writes a checkpoint record to the log. The checkpoint record
contains the transaction table and the dirty page table. A master log record is maintained separately, in stable
storage, to store the LSN of the latest checkpoint record that made it to disk. On restart, the recovery
Remote backup as a part of data protection in companies has a vital importance since it guarantees the
organization its operational continuity and functionality in the occurrence of loss or damage to critical data
that may affect them, such as:
File Organization
o The File is a collection of records. Using the primary key, we can access the records. The type and
frequency of access can be determined by the type of file organization which was used for a given set
of records.
o File organization is a logical relationship among various records. This method defines how file
records are mapped onto disk blocks.
o File organization is used to describe the way in which the records are stored in terms of blocks, and
the blocks are placed on the storage medium.
o The first approach to map the database to the file is to use the several files and store only one fixed
length record in any given file. An alternative approach is to structure our files so that we can contain
multiple lengths for records.
o Files of fixed length records are easier to implement than the files of variable length records.
o To perform insert, delete or update transaction on the records should be quick and easy.
File organization contains various methods. These particular methods have pros and cons on the basis of
access or selection. In the file organization, the programmer decides the best-suited file organization method
according to his requirement.
o B+ file organization
What is Indexing?
Indexing is a data structure technique which allows you to quickly retrieve records from a database file. An
Index is a small table having only two columns. The first column comprises a copy of the primary or
candidate key of a table. Its second column contains a set of pointers for holding the address of the disk
block where that specific key value stored.
An index -
Types of Indexing
Indexing in Database is defined based on its indexing attributes. Two main types of indexing methods are:
Primary Indexing
1.Primary Indexing
Primary Index is an ordered file which is fixed length size with two fields. The first field is the same a
primary key and second, filed is pointed to that specific data block. In the primary Index, there is always one
to one relationship between the entries in the index table.
The primary Indexing in DBMS is also further divided into two types.
Dense Index
Sparse Index
o In a dense index, a record is created for every search key valued in the database.
o This helps you to search faster but needs more space to store index records.
o In this Indexing, method records contain search key value and points to the real record on the
disk.
o It is an index record that appears for only some of the values in the file.
o Sparse Index helps you to resolve the issues of dense Indexing in DBMS.
o In this method of indexing technique, a range of index columns stores the same data block
address, and when data needs to be retrieved, the block address will be fetched.
o However, sparse Index stores index records for only some search-key values.
o It needs less space, less maintenance overhead for insertion, and deletions but It is slower
compared to the dense Index for locating records.
2.Secondary Index
o The secondary Index in DBMS can be generated by a field which has a unique value for each record,
and it should be a candidate key. It is also known as a non-clustering index.
o This two-level database indexing technique is used to reduce the mapping size of the first level. For
the first level, a large range of numbers is selected because of this; the mapping size always remains
small.
o In a bank account database, data is stored sequentially by acc_no; you may want to find all accounts
in of a specific branch of ABC bank.
o Here, you can have a secondary index in DBMS for every search-key. Index record is a record point
to a bucket that contains pointers to all the records with their specific search-key value.
Clustering Index
In a clustered index, records themselves are stored in the Index and not pointers. Sometimes the Index is
created on non-primary key columns which might not be unique for each record. In such a situation, you can
group two or more columns to get the unique values and create an index which is called clustered Index.
This also helps you to identify the record faster.
Example:
Let's assume that a company recruited many employees in various departments. In this case, clustering
indexing in DBMS should be created for all employees who belong to the same dept.
It is considered in a single cluster, and index points point to the cluster as a whole. Here, Department _no is
a non-unique key.
Multilevel Indexing in Database is created when a primary index does not fit in memory. In this type of
indexing method, you can reduce the number of disk accesses to short any record and kept on a disk as a
sequential file and create a sparse base on that file.
B-Tree Index
B-tree index is the widely used data structures for tree based indexing in DBMS. It is a multilevel format of
tree based indexing in DBMS technique which has balanced binary search trees. All leaf nodes of the B tree
signify actual data pointers.
Moreover, all leaf nodes are interlinked with a link list, which allows a B tree to support both random and
sequential access.
Advantages of Indexing
It helps you to reduce the total number of I/O operations needed to retrieve that data, so you don't
need to access a row in the database from an index structure.
Offers Faster search and retrieval of data to users.
Indexing also helps you to reduce tablespace as you don't need to link to a row in a table, as there is
no need to store the ROWID in the Index. Thus you will able to reduce the tablespace.
You can't sort data in the lead nodes as the value of the primary key classifies it.
Disadvantages of Indexing
To perform the indexing database management system, you need a primary key on the table with a
unique value.
You can't perform any other indexes in Database on the Indexed data.
You are not allowed to partition an index-organized table.
SQL Indexing Decrease performance in INSERT, DELETE, and UPDATE query.
B+ Tree
o The B+ tree is a balanced binary search tree. It follows a multi-level index format.
o In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes remain at the
same height.
o In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support random
access as well as sequential access.
Structure of B+ Tree
o In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the order n
where n is fixed for every B+ tree.
Leaf node
o The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
o Every leaf node of the B+ tree contains one block pointer P to point to next leaf node.
Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the intermediary node
which will direct to the leaf node that can contain a record for 55.
So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the end, we will be
redirected to the third leaf node. Here DBMS will perform a sequential search to find 55.
B+ Tree Insertion
o Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after 55.
It is a balanced tree, and a leaf node of this tree is already full, so we cannot insert 60 there.
o In this case, we have to split the leaf node, so that it can be inserted into tree without affecting the fill
factor, balance and order.
o The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split the
leaf node of the tree in the middle so that its balance is not altered. So we can group (50, 55) and (60,
65, 70) into 2 leaf nodes.
o If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should have 60
added to it, and then we can have pointers to a new leaf node.
o This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to
find the node where it fits and then place it in that leaf node.
Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from the
intermediate node as well as from the 4th leaf node too. If we remove it from the intermediate node, then the
tree will not satisfy the rule of the B+ tree. So we need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as follows: