Full Copy Dbms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 337

DATABASE

MANAGEMENT SYSTEMS

GCS PUBLISHERS
INDIA
DATABASE
MANAGEMENT SYSTEMS

Authors
Dr.S.Sathappan
Associate Professor, Department of Computer Science and Engineering
St.Martins Engineering College Telangana 500100, India.

Mrs.M.Prasanna Lakshmi
Assistant Professor, Department of Computer Applications
Velagapudi Ramakrishna Siddhartha Engineering College,
Vijayawada, Andhra Pradesh, India.

Mr.B Srinivas
Assistant Professor, Department of Computer Applications
Velagapudi Ramakrishna Siddhartha Engineering College,
Vijayawada, Andhra Pradesh, India.

Mr.Janardhana Rao Alapati


Assistant Professor ,Department Of CSE,
Narasaraopeta Engineering College,
Narasaraopet, Guntur District, Andhrapradesh

GCS PUBLISHERS
INDIA
Book Title Database Management Systems
Authors Dr.S.Sathappan
Mrs.M.Prasanna Lakshmi
Mr.B Srinivas
Mr.Janardhana Rao Alapati,

Book Subject Database Management Systems


Book Authors Volume
Category
Copy Right @ Authors
First Edition April , 2022
Book Size B5
Price Rs.499/-

Published by
GCS PUBLISHERS
INDIA.

ISBN Supported by International ISBN Agency,


United House, North Road, London, N7 9DP, UK. Tel. + 44 207 503 6418 &
Raja Ram Mohan Roy National Agency for ISBN
Government of India, Ministry of Human Resource Development,
Department of Higher Education, New Delhi – 110066 (India)

ISBN: 978-93-94304-23-9
PREFACE
This book aims to provide a broad DATABASE MANAGEMENT SYSTEMS for
the importance of DATABASE MANAGEMENT SYSTEMS is well known in
various engineering fields.
It provides a logical method of explaining various complicated concepts
and stepwise methods to explain essential topics. Each chapter is well
supported with the necessary illustrations. All the chapters in the book are
arranged in a proper sequence that permits each topic to build upon earlier
studies.

DBMS is an important research area. The techniques developed in this area


so far require to be summarized appropriately. In this book, the fundamental
theories of these techniques are introduced.
The brief content of this book is as follows-
CHAPTER 1 FUNDAMENTALS OF DBMS
CHAPTER 2 DATA BASE DESIGN AND DATA MODELS
CHAPTER 3 RELATIONAL MODEL
CHAPTER 4 RELATIONAL ALGEBRA AND RELATIONAL
CALCULUS
CHAPTER 5 BASIC OF SQL
CHAPTER 6 PL/SQL AND ADVANCED SQL
CHAPTER 7 QUERY PROCESSING
CHAPTER 8 QUERY OPTIMIZATION
CHAPTER 9 SCHEMA REFINEMENT
CHAPTER 10 TRANSACTION MANAGEMENT
CHAPTER 11 CONCURRENCY CONTROL
CHAPTER 12 RECOVERY SYSTEM & DATA ON EXTERNAL
STORAGE

This book is original in style and method. No pains have been spared to
make it as compact, perfect, and reliable as possible. Every attempt has been
made to make the book a unique one.

In particular, this book can be handy for practitioners and engineers


interested in this area. Hopefully, the chapters presented in this book have just
done that.
ACKNOWLEDGMENTS
Take it from me, writing a book takes time, patience, and motivation in
equal measures. The challenges can sometimes be overwhelming, and it
becomes straightforward to lose focus. However, analytics, patterns, and
uncovering the hidden meaning behind data have always attracted me. When
one considers the possibilities offered by comprehensive analytics and the
inclusion of what may seem to be unrelated databases, the effort involved
seems almost inconsequential.

We also have to acknowledge the many vendors in the Internet of Things


arena who inadvertently helped me along my journey to expose the value
contained in data.

Writing takes a great deal of energy and can quickly consume all of the
hours in a day. With that in mind, We have to thank the numerous editors
whom I have worked with on freelance projects while concurrently writing this
book. Without their understanding and flexibility, We could never have written
this book or any other.

When it comes to providing the ultimate encouragement and support, no


one can compare with my family time and be still willing to provide me with
whatever We needed to complete this book. We are very thankful to have such
a wonderful and supportive family.
CONTENTS
TITLE PG NO.
1 - FUNDAMENTALS OF DBMS 1-33
1.1 INTRODUCTION TO DBMS
1.2 EVALUATION OF DBMS
1.3 OBJECTIVES OF DBMS
1.4 OBJECTIVE TYPES IN DATABASE
1.5 RELATION DATABASES
1.6 TYPES OF DATABASES
1.7 DATABASE APPLICATIONS – DBMS
1.8 DBMS OVER A FILE SYSTEM
1.9 DISADVANTAGES OF DBMS
1.10 DBMS ARCHITECTURE
1.11 DBMS – THREE-LEVEL ARCHITECTURE
2-DATABASE DESIGN AND DATA MODELS 34-67
2.1 INTRODUCTION
2.2 THE LIFE CYCLE OF DATABASE CREATION
2.3 TECHNIQUES FOR DATABASE DESIGN
2.4 DATABASE DESIGN: STEPS TO CREATING A DATABASE
2.5 INTRODUCTION OF ER MODEL
2.6 NOTATION OF ER DIAGRAM
KEYS IN RELATIONAL DATABASE MANAGEMENT
2.7
SYSTEMS
2.8 ADDITIONAL E-R MODEL FEATURES
2.9 CONCEPTUAL DESIGN WITH E-R DATA MODEL
2.10 CONCEPTUAL DESIGN FOR LARGE ENTERPRISE
2.11 INTRODUCTION TO DATA MODELS IN DBMS
2.12 VIEW OF DATA IN DBMS
2.13 DATA INDEPENDENCE
THREE-SCHEMA ARCHITECTURE AND DATA
2.14
INDEPENDENCE
3 - RELATIONAL MODEL 68-94
3.1 INTRODUCTION
3.2 CODD PRINCIPLES
3.3 CONSTRAINTS OVER RELATIONAL DATABASE MODEL
3.4 DATA INTEGRITY
DATA INTEGRITY IS ENFORCED BY DATABASE
3.5
CONSTRAINTS
3.6 STRUCTURE OF RELATIONAL DATABASE
3.7 DATABASE SCHEME
3.8 TYPES OF KEY
3.9 RELATIONAL QUERY LANGUAGES
3.10 LOGICAL DATABASE DESIGN
3.11 ER MODEL TO RELATIONAL MODEL
4 - RELATIONAL ALGEBRA AND RELATIONAL CALCULUS
95-111
4.1 INTRODUCTION OF RELATIONAL ALGEBRA
4.2 RELATIONAL CALCULUS
5 - BASIC OF SQL 112-153
5.1 INTRODUCTION TO SQL
5.2 ROLE OF SQL IN RDBMS
5.3 PROCESSING SKILLS OF SQL
5.4 COMPONENTS OF SQL
5.5 SQL DATA TYPES
5.6 SQL FUNCTIONS
5.7 TYPE OF CONSTRAINTS
5.8 JOIN EXPRESSIONS
5.9 INTRODUCTION OF VIEWS
5.10 SQL – TRANSACTIONS
5.11 NESTED QUERIES
5.12 CORRELATED NESTED QUERIES
5.13 AGGREGATE OPERATORS
5.14 GROUP BY AND HAVING CLAUSES
5.15 NULL VALUES
5.16 COMPLEX INTEGRITY CONSTRAINTS IN SQL
6 - PL/SQL AND ADVANCED SQL 154-182
6.1 INTRODUCTION TO PL/SQL
6.2 PL/SQL FEATURES INCLUDE
6.3 STRUCTURE OF PL/SQL BLOCK
6.4 PL/SQL IDENTIFIERS
6.5 CURSORS IN PL/SQL
6.6 PL/SQL – PROCEDURES
6.7 OLAP QUERIES IN SQL
6.8 INTRODUCTION TO RECURSIVE SQL
6.9 TRIGGERS AND ACTIVE DATABASES
6.10 DESIGNING ACTIVE DATABASES
7 - QUERY PROCESSING 183-206
7.1 INTRODUCTION
7.2 QUERY PROCESSING
7.3 MEASURES OF QUERY COST
7.4 SELECTION OPERATION
7.5 COST ESTIMATION
7.6 HASH JOIN ALGORITHM
7.7 RECURSIVE PARTITIONING IN HASH JOIN
7.8 OVERFLOWS IN HASH JOIN
7.9 EVALUATION OF EXPRESSIONS
7.10 COST ESTIMATION OF MATERIALIZED EVALUATION
8 - QUERY OPTIMIZATION 207-225
8.1 INTRODUCTION
8.2 TYPES OF QUERY OPTIMIZATION
8.3 TRANSFORMING RELATIONAL EXPRESSIONS
8.4 EQUIVALENCE RULES
ESTIMATING STATISTICS OF EXPRESSION RESULTS IN
8.5
DBMS
8.6 CHOICE OF EVALUATION PLANS IN DBMS
8.7 ADVANCED QUERY OPTIMIZATION
9 - SCHEMA REFINEMENT 226-247
9.1 PROBLEMS CAUSED BY REDUNDANCY
9.2 DECOMPOSITIONS
9.3 PROBLEMS RELATED TO DECOMPOSITION
9.4 FUNCTIONAL DEPENDENCIES
9.5 REASONING ABOUT FDS
9.6 ATTRIBUTE CLOSURE
9.7 NORMAL FORMS
9.8 SCHEMA REFINEMENT IN DATABASE DESIGN
9.9 4NF (FOURTH NORMAL FORM) RULES
9.10 5NF (FIFTH NORMAL FORM) RULES
9.11 6NF (SIXTH NORMAL FORM)
10 - TRANSACTION MANAGEMENT 248-269
10.1 INTRODUCTION
10.2 TRANSACTION CONCEPT
10.3 A SIMPLE TRANSACTION MODEL
10.4 STORAGE STRUCTURE
10.5 TRANSACTION ATOMICITY AND DURABILITY
10.6 TRANSACTION ISOLATION
10.7 SERIALIZABILITY
10.8 TRANSACTION ISOLATION LEVELS
10.9 IMPLEMENTATION OF ISOLATION LEVELS
11 - CONCURRENCY CONTROL 270-292
11.1 LOCK-BASED PROTOCOLS
11.2 LOCKS
11.3 STARVATION
11.4 THE TWO-PHASE LOCKING PROTOCOL
11.5 IMPLEMENTATION OF LOCKING
11.6 GRAPH-BASED PROTOCOLS
11.7 DEAD LOCK HANDLING
11.8 MULTIPLE GRANULARITIES
11.9 TIMESTAMP-BASED PROTOCOLS
11.10 VALIDATION-BASED PROTOCOLS
12 - RECOVERY SYSTEM & DATA ON EXTERNAL STORAGE 293-322
12.1 FAILURE CLASSIFICATION
12.2 STORAGE
12.3 RECOVERY AND ATOMICITY
12.4 RECOVERY ALGORITHM
12.5 BUFFER MANAGEMENT
12.6 FAILURE WITH NONVOLATILE STORAGE LOSS 2
12.7 ARIES
12.8 REMOTE BACKUP METHODS
12.9 FILE ORGANIZATIONS AND INDEXING
12.10 INDEX DATA STRUCTURES
12.11 COMPARISON OF FILE ORGANIZATIONS
12.12 INDEXING BASED ON TREE STRUCTURE
Database Management Systems

CHAPTER 1
FUNDAMENTALS OF DBMS

1.1 INTRODUCTION TO DBMS


The term "DBMS" refers to a database management system.
DBMS = Database + Management System is one way to break
things down. A database is a collection of data. A management
system is a set of programs to store and retrieve that data. Based
on this, we may describe DBMS: A database management
system (DBMS) collects corresponding data and programs for
storing and accessing that data simply and efficiently.
NEED OF DBMS
Database systems are developed for a large amount of data.
When dealing with a large amount of data, two things must be
optimized: Data storage and retrieval are two data that are used
interchangeably.
Storage: According to database system standards, data is
stored to take up much less space, so redundant data (duplicate
data) is discarded before storage. To illustrate, consider the
following case for non-technologists:
Assume a client has two deposits in a financial system, one
for savings and one for income. Let us say the bank stores
saving account data in one place (these places are called tables,
and we will learn about them later) and salary account data in
another. Suppose the customer information, such as name,
address, and so on, is stored in both places. In that case, it is just
a waste of storage (redundancy/duplication of data). The
information should be to The same thing happens in DBMS.

1|Page
Database Management Systems

Data Retrieval in a Short Time: Along with storing the data


in an optimized and systematic fashion, it is also critical to
extract the data quickly as appropriate. Database services ensure
the data is accessed as soon as possible.
The Function of Database Systems
The primary function of database systems is data
management. Consider a university that stores data on pupils,
professors, classes, books, etc. To handle this data, we need to
store it somewhere to add new data, erase redundant data,
update obsolete data, and remove data. To execute these
operations on data, we need a database management system
that allows us to organize the data to do all of these operations
effectively.
Purpose of a Database
Databases are used to hold mountains of data gathered in
order and readily available to an authenticated person.
Companies use various databases depending on the quality of
their data. They can be beneficial to a company's growth in a
variety of ways:
 Enables a company to make more intelligent strategic choices.
 Store and access similar information in an efficient manner.
 Aids in the analysis and Aggregating of Market Data.
 Gather and archive critical consumer data from various
applications.
 Provides data-driven, timely, customized software as well as
real-time analytics.
 Ensures correct, dependable, and immediate access to critical
business data that various business units can use to understand
data dynamics, produce reports, and forecast future trends.

2|Page
Database Management Systems

 Data is often mapped from hierarchical databases used by


legacy applications to relational databases used in data
warehouses.
1.2 EVALUATION OF DBMS
A list of statistics and figures is referred to as Data. Data
collection was increasing by the day, and it needed to be
processed in a secure system or application.
Charles Bachman was the first to create the Integrated Data
Store (IDS) based on a network data model. He received the
Turing Award (The most prestigious award, which is equivalent
to a Nobel prize in Computer Science.). It was founded in the
early 1960s.
IBM (International Business Machines Corporation) created
the Integrated Management Systems (IMS) in the late 1960s. It is
now the mainstream database system used in many countries
today. It was created using the hierarchical database model.
Edgar Codd created the relational database model in the year
1970. Many of today's database structures are hierarchical. Since
then, it has been regarded as the traditional database format.
Most people in the industry have used the relational
paradigm.
Later in the same decade (the 1980s), IBM created the
Structured Query Language (SQL) as part of the R initiative. ISO
and ANSI proclaimed it to be a standard query language. James
Gray also created Transaction Management Systems for
transaction processing, for which he received the Turing Award.
Several other versions have advanced features such as
dynamic queries, datatypes for inserting images, etc. The
Internet Age may have had a more significant impact on data
models. Data models were created using object-oriented
3|Page
Database Management Systems

programming features and scripting languages such as Hyper


Text Markup Language (HTML) for queries. With massive
amounts of data accessible online, DBMS is becoming
increasingly important.
1.3 OBJECTIVES OF DBMS
Large-Scale Storage
A database management system (DBMS) can hold a large
amount of data. As a result, DBMS is the right technology to use
for all large corporations. It can hold thousands of documents
and can be accessed at any time.
Removes Duplicity
If you have a large amount of data, data duplication is
unavoidable. DBMS ensures that there can be no data
duplication across all documents. DBMS ensures that the same
data was not previously inserted when storing new information.
Access by Several Users
Nobody manages the whole database by themselves. A large
number of people had access to the website. As a result, two or
three people may be accessing the database. They are free to
adjust whatever they wish, but DBMS ensures that they function
simultaneously.
Security of Data
Bank information, employee payroll information, and sale
purchase information should be confidential. Furthermore, all
businesses want their data to be protected from unauthorized
access. DBMS provides data protection at the master level. No
one can change or edit the data without the right to use it.
Backup and retrieval of Data
When a database fails, there is no choice but to declare that
all data has been destroyed. A copy of the database should be

4|Page
Database Management Systems

accessed in a database outage. Both data in a database can be


backed up and recovered using DBMS.
Anyone can use DBMS.
If you want to work on DBMS, you do not have to be a
programming language master. Any accountant with limited
professional expertise will work with DBMS. It contains all of
the meanings and explanations necessary for someone with non-
technical experience to work on it.
Sincerity
Integrity implies that the data is genuine and reliable. DBMS
provides several validity tests to ensure the data is entirely
correct and consistent.
Independent of Platform
DBMS can be operated on any device. Working on a
database management system does not necessitate using a
specific framework.
1.4 OBJECTIVE TYPES IN DATABASE
There are four categories of database objects that assist users
in compiling, entering, storing, and analyzing critical data:
1. table(s)
2. Inquiries
Documents
reports
Database Structure Types
There are some kinds of database structures:
1. Hierarchical database: The data in a hierarchical database is
structured using a rating order or a parent-child relationship.
2. Network database: Although there are specific differences, a
network database is identical to a relational database. The

5|Page
Database Management Systems

network database enables the child record to bind to multiple


parent accounts, allowing for two-way relationships.
3. Object-Oriented Database: The knowledge in an object-
oriented database is stored in an object-like way.
4. Relational database: A relational database is a table-oriented
database in which every piece of data is linked to every other
piece of data.
5. Non-relational or NoSQL database: A NoSQL database uses
various formats, such as records, tables, wide columns, and so
on, giving a database architecture tremendous versatility and
scalability.
However, databases are broadly classified into two groups
or categories: relational or sequence databases and non-
relational or non-sequence databases, known as No SQL
databases. Depending on the type of data and the features
provided, an enterprise can use these separately or in
combination.
1.5 RELATION DATABASES
People often inquire, "What is the most popular kind of
database?" A relational database is a solution to their problem.
This database management system (DBMS) uses schema. This
template specifies the configuration of data contained in the
database.
For example, suppose a business sells goods to its
consumers. In that case, it must have a kind of retained
information as to where these products go, to whom, and in
what quantity.
Various tables and even relational database table forms can
be used for each method. For example, one table can display
simple consumer information, a second table can be used to
6|Page
Database Management Systems

show product information, and a third table can show who


ordered this product, how many times, and when.
In a relational database, keys are shared with the tables.
They aid in offering a short database overview or access to a
specific row or column that you may wish to examine.
The tables, also known as beings, are all related. The table
containing customer information may assign a unique ID to
each customer, representing all known about that customer,
such as their address, name, and contact information. Besides,
each product may be assigned a unique ID in the table with the
product definition. These IDs and quantities will only need to be
reported in the table where all orders are held. Any adjustment
to these tables would affect all of them consistently and
systemically.
The limits of Relational Database Management Systems
guarantee Data confidentiality (RDBMS). An RDBMS
guarantees that the data displayed is precise and reliable. There
are, however, various kinds of relational databases. Microsoft
SQL Server, Oracle Database, MySQL, and IBM DB2 are the
names of various databases.
SQL databases are classified into many categories. These
Structured Query Language (SQL) databases include:
The Oracle
Oracle database system, a product of Oracle Corporation,
functions as a multimodal management system.
PostgreSQL is a database management system.
PostgreSQL, also known as Postgre, is an object-relational
database management system that stresses standard
enforcement and extensibility.

7|Page
Database Management Systems

MySQL is a database management system.


This open-source RDBMS is compatible with all Windows,
Linux, and UNIX systems.
SQL Server is a database server
SQL Server, a Microsoft product, is mainly used to store and
retrieve data to and from software application systems.
Advantages and Disadvantages of Relational Databases
Relational databases have advantages and disadvantages
that should be considered before investing in them:
Advantages
 Relational databases adhere to a strict schema. Each new entry
must have unique components that fit into that performed
specification. This makes the data predictable and easy to assess.
 All RDBMS databases must be ACID-compliant. It implies that
they must have Atomicity, Consistency, Isolation, and
Durability.
 They are well-structured and leave little or no room for a
mistake.
Disadvantages
 Because of the careful structure, rigid schemas, and limitations
of relational databases, it is virtually impossible to store data in
the numbers required for today's massive internet data.
 Since relational databases adhere to a specific schema,
horizontal scaling is complicated. While vertical scaling seems to
be the logical solution, it is not. Vertical scaling has a limit, and
in this day and age, the amount of data gathered from the
internet daily is too high to consider vertical scaling working for
too long.

8|Page
Database Management Systems

 Schema limitations often make data transfer to and from various


RDBMS difficult. They must be similar, or it will not work.
Databases that are not relational.

The Advantages and Disadvantages of Non-Relational


Databases
Non-relational databases, like everything else, are not ideal
and have both benefits and drawbacks.
Advantages
 Since they are schema-free, they make it easy to process and
archive large amounts of data. They are also conveniently
scaleable horizontally.
 The data is not too abstract and can be spread among various
distinct nodes for improved usability.
Disadvantages
 Since they do not have a particular format or schema for the
data stored, you cannot rely on the data for a specific area
because it does not exist.
 Having no connections makes updating the data very difficult,
so you must independently update every piece of information.

Types of Database Users


There are various types of database (DBMS) users, such as:
1. Database Administrator (DBA)
2. End-User
3. System Analyst
4. Application Programmer
5. Database Designer
The type of database a company uses should be aligned with
the company’s requirements and needs.
9|Page
Database Management Systems

1.6 TYPES OF DATABASES


Depending upon the usage requirements, there are the
following types of databases available in the market −
 Centralized database.
 Distributed database.
 Personal database.
 End-user database.
 Commercial database.
 NoSQL database.
 Operational database.
 Relational database.
 Cloud database.
 Object-oriented database.
 Graph database.

FIG 1.1 Types of Data Bases

10 | P a g e
Database Management Systems

1. Centralized Database
The information (data) is stored at a centralized location, and
users from different locations can access this data. This type of
database contains application procedures that help users access
the data even from a remote location.
Various authentication procedures are applied for the
verification and validation of end-users. Likewise, the
application procedures provide a registration number, keeping
track and record data usage. The local area office handles this
thing.

Fig 1.2 Centralized Database

2. Distributed Database
The distributed database has contributions from the
common database. The information captured by local computers
is just opposite the centralized database concept. The data is not
in one place and is distributed at various sites of an
organization. These sites are connected with the help of

11 | P a g e
Database Management Systems

communication links which help them to access the distributed


data.
You can imagine a distributed database in which various
portions of a database are stored in multiple locations(physical).
The application procedures are replicated and distributed
among various points in a network.
There are two kinds of distributed databases, viz.,
homogenous and heterogeneous. The databases with the same
underlying hardware and run over the same operating systems
and application procedures are known as homogeneous DDB,
e.g., all physical locations in a DDB. The operating systems,
underlying hardware, and application procedures can be
different at various sites of a DDB known as heterogeneous
DDB.

Fig 1.3 Distributed Databases

3. Personal Database
Data is collected and stored on personal computers, which
are small and easily manageable. The data is generally used by

12 | P a g e
Database Management Systems

the same department of an organization and is accessed by a


small group of people.

4. End-User Database
The end-user is usually not concerned about the transaction
or operations at various levels and is only aware of the product,
which may be software or an application. Therefore, this is a
shared database specifically designed for the end-user, just like
managers from different levels. A summary of complete
information is collected in this database.
5. Commercial Database
These are the paid versions of the enormous databases
designed uniquely for the users who want to access the
information for help. These databases are subject-specific, and
one cannot afford to maintain such a piece of enormous
information. Access to such databases is provided through
commercial links.
6. NoSQL Database
These are used for large sets of distributed data. Relational
databases effectively handle some significant data performance
issues. NoSQL databases easily manage such issues. There are
very efficient in analyzing large-size unstructured data stored at
multiple virtual servers of the cloud.
7. Operational Database
Information related to the operations of an enterprise is
stored inside this database. Functional lines like marketing,
employee relations, customer service, etc., require such kinds of
databases.

13 | P a g e
Database Management Systems

Fig 1.4 Operational Database


8. Relational Databases
These databases are categorized by tables where data gets fit
into a pre-defined category. The table consists of rows and
columns. The column has an entry for data for a specific
category, and rows contain instances for that data defined
according to the category. The Structured Query Language
(SQL) is the standard user and application program interface for
a relational database.

14 | P a g e
Database Management Systems

Various simple operations can be applied over the table,


making these databases easier to extend, joining two databases
with a common relation, and modifying all existing applications.

Fig 1.5 Relational Databases


9. Cloud Databases
Data has been stored explicitly over clouds a day, also
known as a virtual environment, either in a hybrid, public, or
private cloud. A cloud database is a database that has been
optimized or built for such a virtualized environment. There are
various benefits of a cloud database: the ability to pay for
storage capacity and bandwidth on a per-user basis. They
provide scalability on-demand, along with high availability.
A cloud database also allows enterprises to support business
applications in a software-as-a-service deployment.
15 | P a g e
Database Management Systems

Fig 1.6 Cloud Databases


10. Object-Oriented Databases

Fig 1.7 Object-Oriented Databases

16 | P a g e
Database Management Systems

An object-oriented database is a collection of object-oriented


programming and relational database. Various items are created
using object-oriented programming languages like C++ and
Java, which can be stored in relational databases, but object-
oriented databases are well-suited.
An object-oriented database is organized around objects
rather than actions and data rather than logic. For example, a
multimedia record in a relational database can be a definable
data object instead of an alphanumeric value.
11. Graph Databases
The graph collects nodes and edges where each node
represents an entity, and each edge describes the relationship
between entities. A graph-oriented or graph database is a
NoSQL database that uses graph theory to store, map, and
query relationships.
Graph databases are used for analyzing interconnections.
For example, companies might use a graph database to mine
data about customers from social media.

Fig 1.7 Graph Databases


The DBMS Environment
One of the main goals of a database is to provide users with
an abstract image of data, obscuring certain aspects of data
17 | P a g e
Database Management Systems

storage and manipulation. As a result, an overview and generic


summary of the knowledge needs of the entity that will be
reflected in the database should be the starting point for
database design. As a result, you will need a place to store data
and function as a database.
1.7 DATABASE APPLICATIONS – DBMS
Database Management Systems are used in the following
applications:
• Telecommunications: A database tracks phone calls, network
usage, and customer information. It is difficult to keep track of
such a large volume of data that changes every millisecond
without database systems.
• Industry: Whether it is a manufacturing plant, a warehouse, or
a distribution center, each needs a database to track the ins and
outs. For example, a distribution center needs to keep track of
the product units brought into the center and the items
distributed daily; this is where a database management system
(DBMS) comes into play.
•Banking System: This system is used to store customer
information, track daily credit and debit transactions, and
generate bank statements, among other things. Database
management systems were used to complete all of this work.
• Sales: Customers' information, manufacturing information,
and invoice details are stored here.
• Airlines: We make early airline reservations, and this
information, along with the flight schedule, is saved in a
database.
• Education sector: Database systems are widely used in schools
and colleges to store and retrieve data related to student
information, staff information, course information, test
18 | P a g e
Database Management Systems

information, payroll data, attendance information, fees


information, and so on. A massive quantity of corresponding
data has to be stored and accessed quickly.
• Online shopping: You should be familiar with websites that
offer online purchasing, such as Amazon, Flipkart, and others.
These details save product information, addresses, preferences,
and credit card information and then present you with a
relevant list of products based on your search. All of this
necessitates the use of a database management system.
1.8 DBMS OVER A FILE SYSTEM
This article will explore what a file processing system is and
how Database management systems are superior to file
processing systems.
The disadvantages of the file system include • Data
redundancy: The term "data redundancy" refers to the
duplication of data. For example, if we are handling the data of
a college where a student is enrolled in two courses, the exact
student details would be stored twice, requiring more storage
than is required. Data redundancy frequently results in
increased storage costs and slower access times.
• Data inconsistency: Let us use the same example as before: a
student is enrolled in two courses, and we have the student's
address stored twice. Now, let us say the student requests to
change his address; if the address is changed in one place but
not on all the records, this can lead to data inconsistency.
• Data Isolation: It is difficult to write new application programs
to retrieve appropriate data because data is scattered across
multiple files, and files may be in different formats.
• Reliance on application programs: Changing files would lead
application programs to change.
19 | P a g e
Database Management Systems

• Transaction atomicity: Transaction atomicity refers to "all or


nothing," which indicates that all or none of the operations in a
transaction are executed.
Let us say Steve makes a $100 transfer to Negan's account. This
transaction consists of many operations, including a $100 debit
from Steve's account and a $100 credit to Negan's account. A
computer system, like any other device, can fail. For example, if
it fails after the first operation, Steve's account would have been
debited by $100. However, the amount would not have been
credited to Negan's account. In this case, the operation should
be rolled back to maintain the atomicity of the transaction. In file
processing systems, achieving atomicity is difficult.
• Data Security: Data should be protected from unauthorized
access; for example, a student in a college should not be able to
see the teachers' payroll details; however, such security
constraints are difficult to implement in file processing systems.
1.9 DISADVANTAGES OF DBMS
•DBMS implementation costs are higher than file system
implementation costs
. • Complexity: Database systems are easy to understand
.• Performance: Because database systems are general, they can
be used in many applications. However, for certain apps, this
functionality impacts their performance.
1.10 DBMS ARCHITECTURE
We learned the fundamentals of database management systems
in introductory lectures. We will look at the DBMS architecture
in this guide. The architecture of database management systems
will aid us in comprehending the components of a database
system and its relationships.

20 | P a g e
Database Management Systems

The computer system determines the architecture of a database


management system. For example, in a client-server DBMS
architecture, the database systems on the server machine can
handle multiple queries from the client machine. With the help
of diagrams, we shall understand this communication.
TYPES OF DBMS ARCHITECTURE
There are three types of DBMS architecture:
1. Single-tier architecture
2. Two-tier architecture
3. Three-tier architecture
1. Single-tier architecture
The database is easily accessible on the client machine with this
architecture. Therefore any request made by the client does not
need a network connection to conduct the action on the
database.
Let us imagine you wish to get an employee's records from a
database, and the database is on your computer system. In this
case, the request for employee information will be processed by
your computer, and the records will be acquired from your
computer's database. A local database system is a name given to
this sort of system.
2. Two-tier architecture
The database system is present on the server machine. AS
ILLUSTRATED ABOVE, the DBMS application is present on the
client machine in a two-tier architecture. These two machines
are linked to each other by a trustworthy network.

21 | P a g e
Database Management Systems

FIG 1.7 Two-tier architecture


When a client machine uses a query language like SQL to
request access to a database on the server, the server executes
the request on the database and delivers the result to the client.
Application connection interfaces such as JDBC and ODBC
interact between server and client.

22 | P a g e
Database Management Systems

3. Three-tier architecture

FIG 1.8 Three-tier architecture


Another layer exists between the client and server machines in a
three-tier architecture. In this architecture, the client application
does not communicate directly with the database systems
present on the server machine; rather, the client application
communicates with the server application, and the server
application communicates internally with the database systems
present on the server.
23 | P a g e
Database Management Systems

1.11 DBMS – THREE-LEVEL ARCHITECTURE


DBMS Three-Level Architecture Diagram

FIG 1.9 DBMS – Three-Level Architecture

This architecture has three levels:


1. External level

24 | P a g e
Database Management Systems

2. Conceptual level
3. Internal level
1. External level
It is also known as the view level. Several people can see the
data they want from this level, which comes from the database
through conceptual and internal level mapping. This level is
called a "view."
The user is not required to understand database schema
specifics such as data structure, table definition, etc. After the
data has been retrieved from the database and sent to the view
level, the user only cares about the data that the view level
sends back to the database.
The "top level" of the three-level DBMS architecture is the
superficial level.
2. Conceptual level
It is also known as the logical level. This level describes the
whole database architecture, including data relationships,
schema, etc.
Database restrictions and security are also implemented at this
level of architecture. A DBA maintains this level (database
administrator).
3. Internal level
This level is often referred to as the physical level. This level
describes the actual storage of data in storage devices. This level
is also in charge of allocating data storage space. This is the most
fundamental level of architecture.
View of Data in DBMS Abstraction is a key aspect of database
systems. Hiding irrelevant details from users and presenting
users with an abstract view of data facilitates quick and effective
user-database interaction. In the last session, we reviewed the
25 | P a g e
Database Management Systems

three levels of DBMS architecture. The "view level" is the highest


level of that architecture. The view level gives users a "view of
data" while hiding irrelevant details like data relationships,
database schema, constraints, security, etc.
To completely comprehend the data view, you must first
understand data abstraction, instance, and schema. To learn
about them in-depth, refer to these two tutorials.
Data abstraction
1. Instance and schema
Data Abstraction in DBMS

FIG 1.10 Data Abstraction in DBMS


Database systems are comprised of intricate data structures. The
developers conceal internal, irrelevant details from users to

26 | P a g e
Database Management Systems

facilitate user engagement with the database. Data abstraction


refers to the process of concealing irrelevant details from the
user.
We have three different degrees of abstraction:
Physical LEVEL This is the most fundamental level of data
abstraction. It outlines how data is saved in a database. At this
level, you may get complicated data structure details.
The logical level is the intermediate level of the 3-level data
abstraction architecture. It describes the data that is saved in the
database.
View level: the most abstract level of data. This level represents
the interaction of the user with the database system.
Assume we are keeping customer information on a customer
table. Physically, these records may be characterized as memory
storage blocks (bytes, gigabytes, terabytes, etc.). These details
are often concealed from programmers.
At the logical level, these records may be specified as fields and
attributes with data types, and their relationships with one
another can be logically implemented. Programmers often work
at this level since they are familiar with database systems.
At the view level, the user only interacts with the system
through the GUI and enters the details on the screen; they are
unaware of how and what data is being kept; such details are
concealed.
In DBMS, there are two types of instances: instances and
schemas.
This article will create an instance and a schema in relational
databases.

27 | P a g e
Database Management Systems

DBMS Schema
Schema definition: the schema is the design of a database. There
are three sorts of schema: physical, logical, and view schema.
As an example, the schema in the diagram below shows the
relationship between three tables: course, student, and section.
The diagram simply depicts the database's design; it does not
depict the data contained in the tables. The diagram below
demonstrates that a schema is just a database's structural view
(design).

FIG 1.11 DBMS Schema


A physical schema describes how data stored in blocks of
storage is defined at the physical level of a database.
At this level, programmers and database administrators operate;
data may be characterized as particular sorts of data records
stored in data structures; however, underlying details such as
28 | P a g e
Database Management Systems

data structure implementation are concealed at this level


(available at the physical level).
View schema is the design of a database at the view level. End-
user engagement with database systems is referred to as this.
DBMS Instance
The data stored in a database at a certain time is called an
instance of a database. The database schema defines the variable
declarations in tables that belong to a specific database. The
value of these variables at any one time is called the database
instance.
Let us imagine we have a single table student in the database,
and the table has 100 entries today. Hence the database instance
has 100 records today. Let us imagine we are planning to add
another 100 entries to this table by tomorrow, resulting in a
database instance with 200 records. In a nutshell, the data stored
in a database at a given time is called an instance, which varies
over time as we add or remove data from the database.
1.12 DBMS LANGUAGES
Database languages are employed to read, update, and save
data in a database. Several languages may be used for this, one
of which is SQL (Structured Query Language).
Types of DBMS languages:
Data Definition Language (DDL)
The database schema is specified using DDL. It is used in
databases to create tables, schemas, indexes, and constraints,
among other things. Let us look at some of the actions that DDL
may do on a database:
 To create the database instance – CREATE
 To alter the structure of the database – ALTER
 To drop database instances – DROP

29 | P a g e
Database Management Systems

 To delete tables in a database instance – TRUNCATE


 To rename database instances – RENAME
 To drop objects from a database such as tables – DROP
 To Comment – Comment

FIG 1.12 DBMS LANGUAGES

All of these commands either define or update the database


schema. That is why they come under the Data Definitions
language.
Data Manipulation Language (DML)
DML is used for accessing and manipulating data in a database.
The following operations on the database come under DML:
 To read records from a table(s) – SELECT
 To insert record(s) into the table(s) – INSERT
 Update the data in the table(s) – UPDATE
 Delete all the records from the table – DELETE
30 | P a g e
Database Management Systems

Data Control Language (DCL)


DCL is used for granting and revoking user access to a database.
To grant access to user – GRANT
To revoke access from user – REVOKE
Practical data definition language, data manipulation
language, and data control languages are not separate
languages; rather, they are parts of a single database language
such as SQL.
Transaction Control Language(TCL)
The changes in the database that we made using DML
commands are either performed or rollbacked using TCL.
 To persist the changes made by DML commands in the database
– COMMIT
 To rollback the changes made to the database – ROLLBACK
1.13 VENDORS OF DBMS SYSTEMS AND THEIR
PRODUCTS
Relational (R), extended-relational (X), object-relational (OR),
object-oriented (OO), network (N), and hierarchical structures
are classified in alphabetical order (H). It is worth noting that
some vendors claim that their database management system is
more than one of these. Several designations define the DBMS
type in this case. Centura Software, for example, claims that
their Velocis database is built on both relational and network
structures, with the notation "RN" being used in this instance.
Some leeway has been taken in the word "Enterprise when it
comes to the main sector." When a manufacturer does not
specify a primary market for their database management
system, it is "Enterprise."
DBMS Vendor
Access (Jet, MSDE) Microsoft

31 | P a g e
Database Management Systems

Adabas D Software AG
Adaptive Server Anywhere Sybase
Adaptive Server Enterprise Sybase
Advantage Database Server Extended Systems
Datacom Computer Associates
DB2 Everyplace IBM
Filemaker FileMaker Inc.
IDMS Computer Associates
Ingres ii Computer Associates
Interbase Inprise (Borland)
MySQL Freeware
NonStop SQL Tandem
Pervasive.SQL 2000 (Btrieve) Pervasive Software
Pervasive.SQL Workgroup Pervasive Software
Progress Progress Software
Quadbase SQL Server Quadbase Systems, Inc.
R: Base R:Base Technologies
Rdb Oracle
Red Brick Informix (Red Brick)
SQL Server Microsoft
SQLBase Centura Software
SUPRA Cincom
Teradata NCR
YARD-SQL YARD Software Ltd.
TimesTen TimesTen Performance
Software
Adabas Software AG
Model 204 Computer Corporation of
America

32 | P a g e
Database Management Systems

UniData Informix (Ardent)


UniVerse Informix (Ardent)
Cache' InterSystems
Cloudscape Informix
DB2 IBM
Informix Dynamic Server 2000 Informix
Informix Extended Parallel Informix
Server
Oracle Lite Oracle
Oracle 8I Oracle
PointBase Embedded PointBase
PointBase Mobile PointBase
PointBase Network Server PointBase
PostgreSQL Freeware
UniSQL Cincom
Jasmine ii Computer Associates
Object Store Exceleron
Objectivity DB Objectivity
POET Object Server Suite Poet Software
Versant Versant Corporation
Raima Database Manager Centura Software
Velocis Centura Software
Db.linux Centura Software
Db.star Centura Software
IMS DB IBM

33 | P a g e
Database Management Systems

CHAPTER 2
DATA BASE DESIGN AND DATA MODELS

2.1 Introduction
A database collects bulk data stored in a framework that
makes it easy to find and explore related data. A well-designed
database provides reliable and up-to-date information, making
data retrieval quick and easy. We should appreciate the value of
a database to an organization that deals with large amounts of
data daily. However, it necessitates a database design capable of
analyzing all kinds of data designs quicker and more accurately.
Database Design A set of measures that aid in designing,
developing, implementing, and maintaining a company's data
management systems is referred to as database design. The
primary goal of database design is to create physical and
conceptual representations of the proposed database structure.
Designing a Good Database
Basic rules guide a successful database design process. The
first rule states that duplicate data should be avoided because it
consumes room and increases storage errors and anomalies. The
following maxim is that knowledge consistency and
comprehensiveness are critical. If a database contains incorrect
information, all records that retrieve data from it may also
contain incorrect information. As a result, all conclusions based
on such records would be incorrect, emphasizing the value of a
database design that follows all of the above guidelines.
So, how do you make sure the database design is up to par?
A well-designed database satisfies the following criteria:
• To reduce data complexity, divide the data into tables based on
particular subject areas.
34 | P a g e
Database Management Systems

• Provides the database with the information required to connect


the tables' results.
• Assists and ensures data accuracy and reliability.
• Meets the needs for information collection and publishing
• As far as possible, works in tandem with database owners.
The database Process is important
The database system used for preparing, storing, and
handling information is defined by database design. Only a
database containing only valuable and essential information will
achieve data accuracy.
A well-designed database is critical for ensuring data
integrity, removing redundant data, effectively running queries,
and optimizing database efficiency. Taking the time to design a
database carefully would save you time and frustration during
the database construction process. A robust database design also
makes it simple to view and retrieve information anytime you
need it.
The table arrangement determines the data's durability
while creating primary and exclusive keys ensures that the data
is uniformly stored. Data duplication can be avoided by creating
a table of possible values and denoting the value with a key. As
a result, the adjustment is only made once in the main table if
the value varies.
A good database design uses quick queries and quicker
execution because its design's overall performance is
determined. It is simple to manage and upgrade while repairing
minor glitches in a shoddy database design, resulting in the loss
of stored events, views, and utilities.

35 | P a g e
Database Management Systems

2.2 The Life Cycle of Database Creation


The construction of a database goes through many phases. It
is not possible, however, to complete each move in order.
Requirement analysis, database design, and implementation are
the three stages of the life cycle.
1-Analysis of Requirements
There are two stages to requirement analysis:
 Planning: The overall schedule for the Database Development
Life Cycle is determined. It also necessitates an examination of
the company's information technology approach.
 Defining the system: This stage clarifies and defines the scope
of the proposed database system.
2- Database development
When it comes to database design, there are two main
models to consider:
 Logical model: This is concerned with creating a database
model based on the specified specifications. The whole design is
sketched out on paper, with little consideration for any
particular DBMS requirements or physical implementation.
 Physical model: This step follows the logical model and entails
physically putting the logical model into action. It considers the
database management system (DBMS) and other physical
implementation considerations.
3-Implementation
The database architecture life cycle's implementation stage is
concerned with:
 File loading and transfer: Data importation and conversion
from the existing device to the current database.
 Testing: Finally, this step detects defects in the current method
and fulfills all database requirements.
36 | P a g e
Database Management Systems

2.3 Techniques for Database Design


The following are the two most popular database design
techniques:
 Normalization: Tables are ordered such that data duplication
and dependence are reduced. Larger tables are subdivided into
smaller tables, which are then interconnected using
relationships.
 Entity-Relationship (ER) Modeling: This is a graphical
database design approach that represents real-life objects by
modeling entities, their attributes, and defining relationships
within them. Any real-world item that is distinct or separate
from its environment is referred to as an individual.
2.4 Database Design: Steps to Creating a Database
Identifying the purpose of the database is usually the first
step in database design. The pertinent information is then
gathered and sorted into tables. Then, for a more effective data
design, you define primary keys and evaluate relationships
between tables. The final step in table standardization is
introducing normalization rules after optimizing the tables.
Let us take a closer look at these database design steps:
 Decide what your database's goal is.
The first step is to figure out what your database's function
is. For example, run a small home-based company. You could
create a customer database to keep track of customer
information and produce emails and reports. As a result, it is
essential to comprehend the significance of a database.
After this phase, you will have a clear mission statement to
relate to in the database design process. It will assist you in
focusing on your goals while making critical choices.
37 | P a g e
Database Management Systems

 Locate and consolidate the required information.


The next move is to gather all of the data you need to store
in the database. Begin with the most up-to-date material.
Consider the questions you want your database to address to
determine which details should be recorded.
 Sort the information into columns.
Once you have gathered all relevant info, you will need to
split it into key bodies or subject areas. For example, if you are a
merchant, some essential entities may be goods, consumers,
manufacturers, and orders. Each individual would then be
assigned to their table.
 Sort the data into tables.
Data is organized into columns, with each data object being
a field and shown as a column. A Customer table, for example,
might contain fields such as name, address, e-mail address, and
city.
Once you have determined the initial collection of columns
for each table, you can fine-tune them. For example, a
customer's name may be split into two columns: first and last.
You may also divide the address into five columns based on the
address, town, state, zip code, and area. You would be able to
filter facts more efficiently due to this.
Determine the main keys
Selecting a primary key for each table is the next step in
improving the database design. This primary key is a column
(or series of columns) that uniquely identifies each row.
Customer ID, for example, maybe the primary key in your
customer table. Based on the customer ID, you would be able to
recognize each line uniquely.
38 | P a g e
Database Management Systems

A composite key, which includes several columns, may have


more than one primary key. FOR EXAMPLE, request ID and
product ID may be the primary keys in your Order Details table.
Fields of identical or different data types may be combined to
form a composite key.
Similarly, you can recognize the product ID from the
Products table and the order number or ID from the Orders
table and get an idea of the product sales.
 Work out how the tables are linked
Following data division into columns, the information must
be pulled together logically. So, look at each table and see if the
data in one table relates to another. To simplify the relationship
based on the details, you can add fields or create new tables if
necessary.
2.5 Introduction of ER Model
The ER Model is used to model the system's conceptual view
from a data perspective and consists of the following
components:
Components of ER Diagram

Fig 2.1 Components of ER Diagram


39 | P a g e
Database Management Systems

2.5.1 Entity
Any object, class, individual, or location may be considered
an entity. Rectangles can be used to describe entities in the ER
diagram. See an organization as an example: a boss, a
commodity, an employee, a group, and so on may all be
considered separate entities.

Fig 2.2 Entity

a. Weak Entity
A vulnerable entity is reliant on another entity. The
vulnerable individual does not have any of its primary
attributes. A double rectangle represents the vulnerable entity.

Fig 2.3. Weak Entity

2.5.2 Attribute
The attribute is used to define an entity's property. An
eclipse is a symbol for an attribute.
40 | P a g e
Database Management Systems

A student's id, age, phone number, and name, for example,


are all attributes.

Fig 2.4 Attribute


a. Key Attribute
The core attribute is used to represent an entity's primary
characteristics. It stands for a primary key. An ellipse with the
text underlined represents the primary attribute.

41 | P a g e
Database Management Systems

Fig 2.5 Key Attribute

b. Composite Attribute
A composite attribute is an attribute that is made up of
several other attributes. An ellipse represents the composite
attribute, and an ellipse links those ellipses.

Fig 2.6 Composite Attribute

c. Multivalued Attribute
There can be several values for an attribute. A multivalued
attribute is a kind of attribute that has several values. A double
oval represents a multivalued attribute.
A student, for example, can have several phone numbers.

42 | P a g e
Database Management Systems

Fig 2.6 Multivalued Attribute

d. Derived Attribute
A derived attribute is an attribute that can be derived from
another attribute. A dotted ellipse may be used to reflect it.
A person's age, for example, varies over time and may be
determined by another characteristic such as their date of birth.

Fig 2.7 Derived Attribute

43 | P a g e
Database Management Systems

2.5.3 Relationship
The term "relationship" describes the connection between
two or more individuals. A diamond or rhombus symbolizes the
partnership.

Fig 2.8 Relationship

Types of relationships are as follows:


a. One-to-One Relationship
A one-to-one relationship exists because only one instance of
an individual being involved in the relationship.
A female can marry only one male, and a male can only
marry one female.

Fig 2.9. One-to-One Relationship

44 | P a g e
Database Management Systems

b. One-to-many relationship
A one-to-many relationship exists where only one entity on
the left and multiple instances of the entity on the right is
associated with the relationship.
Scientists, for example, may create a large number of
inventions, but a single scientist creates each discovery.

Fig 2.10 One-to-many relationship

c. Many-to-one relationship
A many-to-one relationship exists where more than one
instance of the entity on the left and only one instance of the
entity on the right is associated with the relationship.
User, for example, enrolls in only one course, but a course
may contain a large number of students.

Fig 2.11 Many-to-one relationship


45 | P a g e
Database Management Systems

d. Many-to-many relationship
A many-to-many relationship exists where more than one
instance of the entity on the left and more than one instance of
the entity on the right is associated with the relationship.
Workers, for example, can be assigned to a variety of tasks,
and projects can involve a large number of employees.

Fig 2.12 Many-to-many relationship

2.5.4 Participation Constraint


Participation Constraint is a constraint imposed on the
person who is a member of the interaction package.
1. Total Participation – The interaction must include all entities
in the entity set. If each student is required to participate in a
course, the student's attendance will be complete. In the ER
diagram, a double line represents total attendance.
2. Partial Participation – A member of the organization may or
may not participate in the partnership. If a student does not
participate in any of the classes offered, the student's enrollment
in the course would be restricted.
The ‘Enrolled in' relationship set is shown in the diagram.
The Student Entity set has full participation, and the Course
Entity set has limited participation.

46 | P a g e
Database Management Systems

Fig 2.13 relationship set

Using set, it can be represented as,

Fig 2.14 relation set multivalued

Any student in the Student Entity set is involved in the


relationship. However, there is one course, C4, that is not
involved.
An entity class has a primary attribute that uniquely defines
each entity in the entity set. However, certain entity types have
main attributes that cannot be specified. Weak Entity is the
name given to this form of entity.
For example, an organization can keep track of an
employee's dependents (parents, children, and spouse).
However, without the employee, the dependents would not live.
As a result, Dependent will be a weak entity type, while
Employee will be the Dependant's Identifying Entity type.
47 | P a g e
Database Management Systems

A double rectangle represents a weak entity form. The


involvement of the weak entity form is still complete. The
identification relationship, defined by a double diamond, is the
relationship between a weak entity type and its identifying
strong entity type.

Fig. 2.15 weak entity form

2.6 Notation of ER Diagram


The notations can be used to describe a database. Often
notations are used to express cardinality in an ER diagram. The
following are the notations:

Fig. 2.16: Notations of ER diagram


48 | P a g e
Database Management Systems

2.7 Keys in Relational Database Management Systems


 In a relational database, keys are very significant.
 It is used to uniquely describe every record or row of data in the
table. It is also used to create and classify table relationships.
E.g., since ID is particular for each student, it is used as a key
in the Student table. Passport number, license number, and SSN
are keys in the Individual table since they are notable for each
person.

Fig. 2.17 keys

2.8 Additional E-R Model Features


• Specialization – Specialization is the method of designating
subgroupings within an individual collection. In the diagram

49 | P a g e
Database Management Systems

above, the "individual" is classified as either an "employee" or a


"customer."
 The diagram above represents specialization by a triangle part
labeled (is a), indicating that the consumer is a male.
 This ISA is sometimes called a super class-subclass relationship.
Thus is often used to stress the importance of developing
distinct lower-level entity sets.
• Generalization is an association between a higher-level entity
set and one or lower-level entity sets known as generalization.
The process of generalization combines these entity sets into a
single entity set.
• Higher and lower-level entity sets – Specialization and
generalization give level to this property. Lower-level entity sets
inherit the properties of higher-level entity sets.
 E.g., “customers” and “employee” inherit the characteristics of
“person” in the diagram above.
• Attribute inheritance: A single attribute inheritance occurs
where a specified entity set is used as a lower entity set with just
one “ISA” (is a) relationship. A multi-attribute inheritance
occurs when a lower entity set is involved in more than one ISA
(is a) relationship.
• Aggregation: The E-R model has one limitation: it cannot
express interactions between relationships. As a result,
aggregation is an approximation that treats relationships as
higher-level entities.

The ER diagram has been reduced to a table


The notations can represent the database, and these
notations can be reduced to a collection of tables.

50 | P a g e
Database Management Systems

Any object set or relationship set in the database can be


expressed in tabular form.
The ER diagram is given below

Fig. 2.18 Example ER Diagram

There are a few things to consider when translating the ER


diagram to a table:
 The entity class is transformed into a table.
51 | P a g e
Database Management Systems

 LECTURE, STUDENT, Topic, and COURSE shape individual


tables in the ER diagram.
 For the table, any single-valued attribute becomes a column.
STUDENT NAME and STUDENT ID are the columns of the
STUDENT table of the STUDENT entity. FOR EXAMPLE,
COURSE NAME and COURSE ID are columns in the COURSE
table.
 The primary key expresses an object type's key attribute.
The main attributes of the object in the ER diagram are
COURSE ID, STUDENT ID, SUBJECT ID, and LECTURE ID.
 A different table is used to represent the multivalued attribute.
A hobby is a multivalued feature in the student table. As a
result, different values cannot be represented in a single column
of the STUDENT table. As a result, we create the table STUD
HOBBY, which contains the columns STUDENT ID and
HOBBY. We make a composite key by combining both columns.
 Elements display a composite characteristic.
Student address is a composite feature in the ER diagram.
CITY, PIN, DOOR#, STREET, and STATE are all included.
These characteristics can be combined as a single column in the
STUDENT table.
 Derived attributes are not taken into account in the table.
Age is the derived variable in the STUDENT table. It can be
determined at some point by subtracting the latest date from the
date of birth.
You may translate the ER diagram to tables and columns
using these rules and then delegate the mapping between the
tables. The following is the table structure for the given ER
diagram:

52 | P a g e
Database Management Systems

Figure: 2.19 Table structure


2.9 Conceptual Design with E-R Data Model
There are four stages to creating a computational data
model using the E-R diagram.
 Recognize object sets
 Define the value sets, attributes, and primary key for each object
set.
 For each relationship package, identify relationship sets and
semantic details (cardinality, subtype/supertype).
 Combine various views of entities, attributes, and interactions
into a single view.

Entity and feature identification guidelines


 Entities provide descriptive details but do not distinguish
attributes.
53 | P a g e
Database Management Systems

 Entities can be used to classify multivalued attributes.


 A descriptor of one organization should be listed as an entity if
it has a many-to-one relationship with another entity.
 Assign attributes to entities that they more specifically describe.
 As far as possible, avoid composite identifiers.
 Reattach attributes to related entities if a generalization or
subset hierarchy of entities is observed.
Relationship definition guidelines
 Get rid of any partnerships that are not necessary.
 A ternary relationship is characterized where multiple binary
relationships cannot express the interaction between the entities.
Relationships are redundant
• An employee who works in a city is a professional society
member with offices in various locations.

• An employee's job entails working on various programs in


different cities.
• A student may be a member of several clubs at his or her school.
Taking a look at a conceptual data model
• Extensibility/flexibility — the simplicity at which a concept can
be modified to meet evolving needs.
• Expressiveness — the capacity to naturally carry out various
abstractions and interactions without the need for further
clarification
• Simplicity — easy to use and comprehend
• Formality: uniform to allow for individual interpretation.
• There are no extra annotations, so it is self-explanatory.
• Clarity — no doubt or guesswork.
• Completeness: all applicable device domain functionality is
represented.
54 | P a g e
Database Management Systems

• Syntactic (concepts are correctly defined) and semantic


correctness (concepts, e.g., entities, relationships, are used
according to their definition)
• Minimalism: no definition can be removed from the schema
without causing data loss.
• Readability — delivered pleasingly (symmetry, minimal
crossing & bends, etc.).
2.10 Conceptual Design for Large Enterprise
Defining small fragments of the program in terms of ER
diagrams is part of the computational design process. For a
large corporation, the design can necessitate the collaboration of
several designers and span data and program code 104 used by
various user groups. The approach used to organize the overall
design layout to ensure that the design considers all customer
expectations and is coherent is an essential part of the design
process.
The typical solution is to consider the needs of different user
classes, settle any conflicts, and produce a single collection of
global requirements after the requirements review process.
Create independent conceptual 105 schemas for various user
categories and then combine these conceptual schemas as an
option. We must create correspondences between entities,
relationships, and attributes and settle various conflicts to
incorporate several conceptual schemas.
2.11 Introduction to Data Models in DBMS
The Database Management System explains the logic behind
the structure of a Database system. It can typically include all
tables, which are represented as entities in the ER model, the
relationships between the tables and objects, and the
requirements provided by the project team to settle on how data
55 | P a g e
Database Management Systems

should be stored and accessed, given the aimed Database


System need.
Different types of Data Models in DBMS
The different types that are used are as given below
 Flat Data Model
 Entity-Relationship Model
 Relation Model
 Record base Model
 Network Model
 Hierarchical Model
 Object-oriented Data Model
 Object Relation Model
 Semi-structured Model
 Associative Model
 Context Data Model
Below are the detailed description of the above database
models
1. Flat Data Model
The flat data model was the first standard data model to be
implemented, in which all data were kept in the same plane.
This is an out-of-date model that is not scientific.
2. Entity-Relationship Data Model
The entity-relationship data model's framework focuses on
the perception of real-world entities and their current
relationships. Entity sets are generated first in modeling the real-
world scenario into the database model. Then the model is
based on the two vital aspects mentioned below, which are
entities composed of attributes and the relationships that occur
within the entities.

56 | P a g e
Database Management Systems

An attribute is a real-world property that an object has. A


collection of values known as domains defines Attributes. For
example, an employee is an object in an office, the office is the
database, and employee ID and identity are attributes. The
relationship between the various entities is the logical
connection between them.
Relational Data Model
The relational data model is the most common and widely
used. The data model allows for data to be stored in relational
tables. The relations are normalized, and the atomic values are
the normalized relation values. Each row in a relation is referred
to as a tuple, and each tuple contains a single value. The values
of each of the columns that belong to the same namespace are
the attributes.
Network Data Model
Many of the entities in the network data model are
represented graphically. It is possible that entities can be
reached from many places on the table.
6. Hierarchical Data Model
The parent-child hierarchical relationship is the foundation
of the hierarchical model. There are one parent entity and
multiple children entities in this model. There should only be
one entity at the end, called the root. An organization, for
example, is the root entity and has many children entities such
as clerks, officer, and several others.
6. Object-oriented Data Model
The most evolved data models are video, graphical image,
and audio object-oriented data models. This comprises the data
and processes written as database management system
instructions.

57 | P a g e
Database Management Systems

7. Record base Data Model


The record-based data model determines the ultimate design
of the database. There are several document types in this data
model. Each record type has a predetermined length and
number of fields.
8. Object-relational Data Model
The object-relational data model is robust, but it is
challenging to design object-relational data. Since this paradigm
produces efficient results and has a broad range of applications,
certain aspects of the complexity challenge can be overlooked. It
also has options for collaborating with other data models. We
may also deal with the relational model using the object-
relational data model.
9. Semi-structured Data Model
A self-describing data model is a semi-structured data
model. The data in this model is usually correlated with a
scheme embedded within the data property's self-describing
property.
10. Associative Data Model
The associative data model is based on the division theorem,
which divides data into two categories: entities and associations.
As a result, the model divides data into entities and
relationships for all real-world scenarios.
11. Context Data Model
Context data models are very versatile since they combine
many data models. It is a collection of data models that includes
hierarchical, network, semi-structured, and object-oriented
models. As a result of the database model's modular design, it
can be used to complete various activities. As a result, support

58 | P a g e
Database Management Systems

for various users has been introduced, which could vary


depending on how users communicate with the database.

2.12 View of Data in DBMS


Abstraction is one of the core aspects of database systems,
and one of the views of data in DBMS is one of them. User-
database interaction is easier and more effective by hiding trivial
information from users and offering an abstract view. We
covered the three levels of DBMS architecture in the previous
tutorial. “View level” is the highest level of architecture. The
view level gives users a "view of data" while hiding non-
essential information like data relationships, database schema,
restrictions, and protection.
You will need a simple understanding of data abstraction and
instance & schema to comprehend the data's view. To read more
about them, see these two examples.
1. Abstracting the details
2. Schema and instance
In a database management system, data abstraction is
defined as the process of extracting information from a
Complex data models are used in database systems. The
developers mask meaningless internal information from users to
make database contact more user-friendly. Data abstraction is
the method of concealing non-essential information from the
user.

59 | P a g e
Database Management Systems

Fig. 2.20 View of Data in DBMS

2.12.1 Abstraction is divided into three levels


The physical level is the most basic level of data abstraction.
It explains how information is processed in a database. At this
level, you will learn about the complicated data structure.
The logical level of the 3-level data abstraction architecture is
the middle level. It specifies the kind of data that is held in the
database.

60 | P a g e
Database Management Systems

The view level is the most abstract level of results. The user
interface with the database system is defined at this level.
Let us pretend we are using a custom table to store customer
records. These records are memory data blocks (bytes,
gigabytes, terabytes, and so on). Programmers are often
unaware of this information.
These documents can be represented logically as fields and
attributes and their data types, and their relationships can be
applied logically. Since they are familiar with database systems,
programmers usually work at this level.
At the display level, users simply communicate with the
device through a graphical user interface (GUI) and enter data
on the screen; they have no idea how or when data is processed
since such information is shielded from them.

2.12.2 Data model Schema and Instance


An instance of the database is the data contained in the
database at a certain time.
 Schema refers to a database's general design.
 A database schema is the database's skeleton structure. It reflects
the whole database's logical vision.
 A schema comprises schema objects such as tables, international
keys, primary keys, views, columns, data types, stored
procedures, and so forth.
 A graphic representation may be used to describe a database
schema. This diagram depicts the database objects and their
relationships.
 Database designers create database schemas to assist
programmers who may be interacting with the database. Data
modeling is the term for the method of creating a database.
61 | P a g e
Database Management Systems

Only certain facets of a schema, such as the name of the


record format, data type, and constraints, can be shown in a
schema diagram. The schema diagram does not allow for the
specification of other elements. The provided number, for
example, does not indicate the data form of each data object or
the relationship between the different files.
Accurate data in the database is updated regularly. As seen
in the diagram, the database shifts when we add a new grade or
a pupil. The database instance refers to the data at a certain
point in time.

Fig 2.21 Data model Schema

62 | P a g e
Database Management Systems

2.13 Data Independence


 The three-schema architecture can be used to illustrate data
freedom.
 Data independence refers to the ability to change the schema of
a database system at one level without affecting the schema at
the next higher level.
Data freedom can be divided into two categories
1. Data Independence in Logical Terms
 The ability to alter the conceptual schema without changing the
external schema is logical data freedom.
 Logical data independence is used to distinguish the external
level from the conceptual view.
 Any modifications to the logical view of the data would have no
impact on the consumer view of the data.
 At the user interface level, logical data independence occurs.

2. Data Independence from Physical Sources


 Physical data independence is known as the ability to modify
the internal schema without affecting the logical schema.
 If the storage space of the database system server is changed, the
database's conceptual configuration will not be affected.
 To distinguish conceptual levels from internal levels, physical
data independence is used.
 At the logical interface level, physical data freedom is achieved.

2.14 Three-Schema Architecture and Data Independence


(1) Use of a catalogue to store the database description
(schema) such that it is self-describing, (2) insulation of
programs and data (program-data and program-operation
63 | P a g e
Database Management Systems

independence), and (3) support of various user views are three


of the four main characteristics of the database method. In this
part, we define a database system architecture called the three-
schema architecture9, which was proposed to aid in achieving
and visualizing these characteristics. The principle of data
freedom is then further discussed.

Fig. 2.22 Three-Schema Architecture and Data Independence


The Three-Schema Architecture is the first of its kind. As
seen in Figure 2.22, the three-schema architecture is designed to
keep user programs apart from the physical database. Schemas
can be described at three different levels in this architecture:
An internal schema defines the database's physical storage
configuration at the internal level. The internal schema employs

64 | P a g e
Database Management Systems

a physical data model to define all aspects of the database's data


management and access routes.
A logical schema outlines the configuration of the whole
database within a group of users at the conceptual level. The
logical schema abstracts from the specifics of physical storage
systems, focusing only on entities, data types, interactions, user
processes, and constraints. A representational data model is
often used to define the computational schema when a database
system is applied. This conceptual implementation schema is
mainly built on a high-level data model's conceptual schema
design.
Various external schemas or user views are used at the
external or view level. Each external schema specifies the
portion of the database that a particular user group is interested
in while keeping most of the database hidden from that user
group. Each external schema is usually implemented using a
representational data model, which might be based on an
external schema design in a high-level data model, as in the
previous level.
The three-schema architecture is a valuable method for
visualizing the different schema layers in a database structure.
Most DBMSs do not fully and directly distinguish the three
layers. However, they accept the three-schema architecture to
some degree. Physical-level information can be used in the
logical schema of some older DBMSs. Since it simply
distinguishes the superficial level of users, the database's
conceptual level, and the internal storage level for constructing a
database, the three-level ANSI architecture has a significant role
in database technology advancement. Also, today, it is widely
used in the design of database management systems.
65 | P a g e
Database Management Systems

External schemas are defined in the same data model that


defines the conceptual-level information in most DBMSs that
provide user views (for example, a relational DBMS like Oracle
uses SQL). Different data structures can be used at some DBMSs'
conceptual and external levels. Universal Data Base (UDB), an
IBM DBMS, is a database management system that uses the
relational paradigm to represent the conceptual schema but can
use an object-oriented model to describe an additional schema.
It is worth noting that the three schemas are just
representations of data; the accurate data is only processed at
the physical level. Each user community corresponds to its
external schema in a DBMS with a three-schema architecture. As
a result, the DBMS must convert an external schema request into
a request against the logical schema and then request against the
internal schema for processing over the stored database. The
data extracted from the stored database must be reformatted to
match the user's external view if the request is for database
retrieval. Mappings are the processes that translate requests and
results across stages. Since these mappings can be time-
consuming, some DBMSs, especially those designed for small
datasets, do not support external views. However, some
mapping is needed to translate requests between the conceptual
and internal levels, even in such schemes.
The catalogue must be extended to provide information
about mapping queries and data between tiers using a multi-
level database management system. The DBMS uses additional
tools to complete these mappings by pointing to the mapping
details in the catalogue. Because while the schema at one level is
modified, the schema at the next higher level stays unchanged;
only the mapping between the two levels is altered. As a result,
66 | P a g e
Database Management Systems

configuration programs that refer to the higher-level schema do


not need to be modified.
True data freedom, both physical and conceptual, will be
simpler with the three-schema architecture. The two layers of
mappings, on the other hand, add overhead to the compilation
or execution of a query or application, resulting in DBMS
inefficiencies. As a result, only a few DBMSs have fully adopted
the three-schema architecture.

67 | P a g e
Database Management Systems

CHAPTER 3
RELATIONAL MODEL

3.1 Introduction
Data is represented in a relational model by tables or links.
Relational Schema: A schema describes a relationship's
configuration; for instance; The STUDENT relation's relational
schema looks like this:
Student (Stud_No, Stud_Name, Stud_Phone, Stud_State,
Stud_Country, Stud_Age)
Relational Instance: Table 1 and Table 2 represent
relational instances, which collect values present in a reference
at a certain time.

Table with students


Stu_i Stu_nam Stu_phon Stu_state Stu_contr Stu_ag
d e e y e
1 Raju 123456 Telangan India 20
a
2 Rani 987456 Ap India 21
3 Rama 654123 Delhi India 20
4 Sitha 963258 Punjab India 21

Student _Course
Stu_No Course_Id Coursce_Name
1 C1 Dwdm
2 C2 Bda
1 C2 Bda

68 | P a g e
Database Management Systems

Attribute: Each relationship is characterized by a set of


properties, each of which is referred to as an attribute.,
STUD_NO, STUD_NAME, etc., are attributes of relation
STUDENT.
The domain of an attribute: A relation's domain is the set
of potential values for an attribute. For Example, a domain of
STUD_AGE can be from 18 to 40.
Tuple: A tuple is a single row of data in relation. e.g.,
STUDENT relation given below has four tuples.
NULL values: NULL values reflect unknown, incomplete,
or ambiguous values of some attributes for some tuples. In a
relation, two NULL values are regarded as distinct.
Table 1 and Table 2 represent relational models having two
relations STUDENT and STUDENT_COURSE.

3.2 CODD Principles


E.F. CODD suggested CODD laws, which could be
satisfied by a relational model.
1. The Foundation Rule: Every system marketed or believed to be
a relational database management system must be capable of
managing databases entirely through its relational capability.
2. Information Rule: Data in a relational model must be a value of
a table cell.
The Guaranteed Access Rule states that any data variable must
be accessible by the table name, primary key, and the
attribute's name whose value is defined.
Systematic handling of NULL values: A NULL value in a
database can only represent incomplete, undefined, or
inapplicable values.

69 | P a g e
Database Management Systems

5. Active Online Archive: The database structure must be stored


in an online catalog that registered users may query.
6. Extensive Data Sub-language Rule: A database should be
available via a language for definition, manipulation, and
transaction processing operations.
7. View Updating Rule: The code should automatically update
different views generated for various purposes.
8. High-level insert, update and delete a rule: At each level of
connections, the Relational Model can help insert, delete,
update, etc. Set operations such as union, intersection, and
minus could also be provided.
9. Spatial data independence: Any change in the physical
position of a table does not necessitate application-level
alteration.
10. Logical data independence: Any changes to a table's logical or
mental schema do not necessitate changes at the
implementation stage. For example, combining two tables into
one can have little effect on how an application accesses it,
challenging.
11. Integrity Independence: Integrity restrictions that are changed
at the database level can not be enforced at the application
level.
12.Distribution Independence: End-users should not see how data
is distributed across many sites.
13.Non-Subversion Rule: Users with low-level access to data
should not be able to alter data by circumventing the honesty
rule.

3.3 Constraints Over Relational Database Model


Relational database model constraints
70 | P a g e
Database Management Systems

When modeling a relational database architecture, we


should have constraints such as what values can be added to the
relationship and what modifications and deletions are
permitted. These are the constraints we place on the relational
database.
We did not have those features in models like ER models.
Database constraints can be divided into three categories:
1. Implicit constraints are constraints that are implemented in the
data model.
2. Constraints explicitly implemented in the data model schemas
by defining them in the DDL (Data Definition Language). are
known as schema-based or explicit constraints.
Constraints that cannot be explicitly implemented in the
data model schemas. These are known as application-based or
semantic constraints.
So we will live with implicit constraints here.
There are four kinds of constraints on a relational database:
1. Domain constraints
2. Key constraints
3. Entity Integrity constraints
4. Referential integrity constraints
Let's discuss each of the above constraints in detail.

3.3.1. Domain constraints


Each domain must have atomic values (minor indivisible
units), ensuring that composite and multi-valued attributes are
not permitted. We do a datatype search here, which ensures that
when we add a data type to a column, we restrict the values
stored in it. For example, suppose we assign the data type of

71 | P a g e
Database Management Systems

attribute age to int. In that case, we cannot assign values other


than int datatype.

Example:
EID NAME PHONE
0010 GUPTHA 9492004956
Explanation:
Since Name is a composite attribute and Phone is a multi-
valued attribute in the above relationship, it violates the domain
restriction.
2. Key Constraints or Uniqueness Constraints :
These are known as uniqueness constraints because they
guarantee that each tuple in the relation is unique. A connection
may have several keys or candidate keys (minimal superkeys),
from which we choose one as the primary key. There are no
restrictions on selecting the primary key from candidate keys.
However, choosing the candidate key with the fewest attributes
is recommended.
Since null values are not permitted in the primary key, the
Not Null restriction is also part of the key constraint.
Example:
EID NAME PHONE
0010 GURU 9492004956
0112 RAJ 123456987
0113 NARESH 897456123

Explanation:
The primary key in the above table is EID, and the first and
last tuples have the same value in EID, i.e., 01, so the key
constraint is violated.

72 | P a g e
Database Management Systems

3.3.2 Entity Integrity Constraints


Since primary keys uniquely mark each tuple in a
connection, integrity constraints state that no primary key may
have a NULL value.
Example
EID NAME PHONE
0010 GURU 9492004956
0112 RAJ 123456987
NULL VENU 88854898989
Explanation:
EID is made the primary key in the primary reference, and
the primary key cannot accept NULL values. However, the
primary key is null in the third tuple, indicating that Entity
Integrity constraints are being violated.

3.3.3 Referential Integrity Constraints :


The referential honesty constraints are defined between two
relations or tables. They are used to keep the tuples in two
relations consistent. An attribute of the foreign key of relation
R1 has the same domain(s) as the primary key of relation R2, the
foreign key of R1 is said to reference or relate to the primary key
of relation R2. The values of the international key in a tuple of
relation R1 can either be the primary key values for any tuple in
relation R2, or they can be NULL, but they cannot be zero.

Example:
EID NAME DNO
010 GURU 10
011 GAJA 10

73 | P a g e
Database Management Systems

012 RAJA 11
013 RANGA 11

DNO PLACE
10 CHENNAI
11 HYDERABAD
Explanation:
The foreign key in the first relation is DNO, and the main
key in the second relation is DNO. DNO = 22 in the first table's
international key is not permitted because DNO = 22The
primary key of the second relation is not specified. As a result,
the referential integrity constraints are violated here. Database
Constraints should be used to enforce data integrity.
3.4 Data Integrity
The accuracy, continuity, and dependability of data in a
database is data integrity. Data integrity is implemented within
one or more similar databases by both database designers and
database developers. For example, in the Northwind categories
table, the Category Name must be unique regardless of how
many records the table contains. If this rule is not followed, the
Seafood category can be stored twice in the table, which breaks
our market guidelines.
3.4.1 Types of Data Integrity
There are four types of data integrity:
1. Row integrity
2. Column integrity
3. Referential integrity
4. User-defined integrity
Row integrity

74 | P a g e
Database Management Systems

The requirement that all rows in a table have a unique


identifier that can distinguish each record is row integrity. This
specific identifier is commonly referred to as the table's Primary
Key. A Primary Key may be made up of a single column or a
combination of columns. For example, the categories table's
unique identifier is a single column named CategoryID.

The Order details table's special identifier combines OrderID


and ProductID. The OrderID 10248 and ProductID 11 row will
only appear once in the graph.

Column integrity
Column integrity is the condition that all data contained in a
column follow the same format and meaning. It includes data
sort, data length, data default value, number of potential values,
whether duplicate values are permitted, and whether null
values are permitted.
For example, in the employees' table, LastName must be
varchar, no more than 20 characters long, default to an empty
string, and cannot be null.

75 | P a g e
Database Management Systems

Referential integrity
How can you say who supplied Longlife Tofu in the goods
table? Referential integrity ensures the existence of a seller.
 You find the data row Longlife Tofu in the goods table and
discover that the value in the SupplierID column is

products table:

You then look in the suppliers' table for the record Supplier ID 4
and discover that the Company Name is Tokyo Traders.

Supplierstable:

76 | P a g e
Database Management Systems

1. SupplierID 4 (of Tokyo Traders) in the suppliers' table cannot be


omitted or changed because it is linked to the commodity
Longlife Tofu by the SupplierID column in the goods table.
2. None of us will apply a new manufacturer with a SupplierID 30
to the goods table since this supplier does not exist on the
suppliers' table.

Referential integrity is established during database creation


and implemented by using table relationships between tables.
After the referential relationship is established, the database
engine will adhere to the two preceding rules to ensure data
integrity. If the rules are broken, it can generate errors.

Integrity specified by the user


Some programs provide dynamic business logic that cannot
be implemented by specifying parameters in the three data
integrity categories we have addressed (row integrity, column
integrity, and referential integrity). In this case, we would write
our code logic to ensure that data is saved correctly and reliably
across all market domains. The code logic can be executed using
database triggers, stored methods, or functions, or by using
resources external to the database engine, such as embedding
non-SQL languages (such as VBScript or C# in SQL Server) in
the database, or by using scripting or programming languages
in the application's middle or front tier.
Here is an example of a user-defined data integrity
constraint.
When a customer places an order in our Northwind
database, we must first determine whether or not the customer
is new to our company. If this is the case, we connect this
77 | P a g e
Database Management Systems

customer to the customers' table. We then check to see if we


have enough of each commodity that this customer ordered in-
store. If this is the case, we link this product to the order details
table for this order and decrement the amount in the Products
table. In this deal, four tables are involved:
1. customers
2. products
3. orders
4. order_details
To ensure data integrity, we use a database transaction that
includes the following four tables (customers, items, requests,
and order details):
 If adding a new client to the customer's table fails, we roll back
the purchase.
 If adding a new order in the orders table fails, we roll back the
transaction.
 If inserting the record into the order details table fails, we roll
back the transaction.
 If decrementing product quantity in the goods table fails, we roll
back the transaction.
When a transaction is rolled out, the four tables are reset to
their initial state (before the start of the transaction). The
reasoning, in this case, is carried out by using our user-defined
data integrity. Your device specifications determine the
mechanisms you use to maintain data integrity.
3.5 Data integrity
Database Constraints are declarative integrity rules for
determining table structures. They are comprised of the seven
restriction forms mentioned below:
Data type constraint:
78 | P a g e
Database Management Systems

This specifies the form of data, data volume, and a few other
attributes directly associated with the type of data in a column.
Default constraint:
This specifies what value the column can use where no value
is expressly specified when entering a record into the row.
Nullability constraint:
This specifies when a column is NOT NULL or allows NULL
values to be stored.
Primary key constraint:
This is the table's special identifier. Each row must have its
worth. The primary key may be either a sequentially
incremented integer number or a natural set of data reflecting
what is happening in the real world (e.g., Social Security
Number). NULL values are not permitted in primary key
values.
Unique constraint:
This specifies that the values in a column must be identical
and that no duplicates can be kept. Even if a column is not the
table's primary key, the data in that column must be unique at
times. For example, the CategoryName column is special in the
Categories table, but it is not the primary key.
Foreign key constraint:
This determines how referential integrity is applied between
two tables.
Check constraint:
This defines a validation rule for the data values in a
column, so it is a user-defined data integrity constraint. The user
defines this rule when designing the column in a table. Not
every database engine supports check constraints. As of version
6.0, MySQL does not support check constraints. However, you

79 | P a g e
Database Management Systems

can use enum data type or set data type to achieve some of its
functionalities in other Relational Database Management
Systems (Oracle, SQL Server, etc.).

Data integrity type Enforced by database constraint


Row integrity  Primary key constraint
 Unique constraint
Column integrity  Foreign key constraint
 Check constraint
 Default constraint
 Data type constraint
 Nullability constraint
Referential integrity  Foreign key constraint
User-defined integrity  Check constraint

Use database constraints whenever possible


There are two key reasons for using database constraints as a
favored method of implementing data integrity.
First, since constraints are implicit in the database engine,
they use fewer machine resources to execute their dedicated
activities. We only use external user-defined integrity
compliance where constraints are insufficient to do the job
properly.
Second, the database engine often checks database
constraints before inserting, editing, or removing processes.
Before the procedure begins, an invalid operation is canceled.
As a result, they are more dependable and robust for upholding
data integrity.

80 | P a g e
Database Management Systems

3.6 Structure of Relational Database


1. A relational database comprises a series of tables, each with its
name. A row in a table describes a relationship between two or
more values.
As a result, a table represents a set of partnerships.
2. The concept of a table and the mathematical concept of a
relation have a clear correspondence. For relational databases, a
significant theory has been established.
Basic Structure
1. Figure 1 shows the deposit and customer tables for our
banking example.

Figure: The deposit and customer relations.


 There is a permissible collection of values for each attribute,
known as the domain of that attribute.
2 A relation is described by mathematicians as a subset of a
Cartesian product of a set of domains. You will see how our
tables correspond. We will refer to things as a reference and a
tuple instead of a table and a row.
We would also mandate that all attribute domains be indivisible
units.
 A domain is said to be atomic if the elements are indivisible.
 The set of integers, for example, is an atomic space.
 The set of all integer sets is not.
 If we think of integers as ordered lists of digits, we should
consider them non-atomic.

81 | P a g e
Database Management Systems

3.7 Database Scheme


1. A database scheme (logical design) is distinguished from a
database case (data in the database at a point in time).
2. A relation scheme is a set of attributes and associated
domains.
The following conventions are used in the text:
 Both names are italicized.
 Relationship and attribute names in lowercase
 Relationship scheme names that begin with an uppercase letter
These notes would do the same thing
For instance, consider the deposit relation's relation scheme:
 Deposit scheme = (bname, account number, cname, balance)
By writing a deposit, we can say that deposit is a link to the
scheme Deposit-scheme (Deposit-scheme).
If we want to define domains, we can use the syntax:
 (bname is a string, account# is an integer, cname is a string, and
balance is an integer).
It is worth noting that consumers are addressed by name.
This will not be permitted in the real world since two or three
consumers could have the same name.
The E-R diagram for a banking enterprise is seen in Figure

The following are the relation schemes for the banking


example used in the text
 Branch-scheme = (bname, assets, bcity)
 Customer-scheme = (cname, street, ccity)
 Deposit-scheme = (bname, account#, cname, balance)
 Borrow-scheme = (bname, loan#, name, amount)all
attributes in one relation

82 | P a g e
Database Management Systems

 Assume we replace consumer and deposit with a single broad


relation:

Fig 3.1: E-R diagram for the banking enterprise


 Account-scheme = (bname, account#, cname, balance,
street, ccity) a customer has several accounts; we would
duplicate his or her address for each.
 If a customer has an account but no current address, we cannot
create a tuple since the address has no values.
 For these fields, we would have to use null values.
 Null values create problems in the database.
 We can avoid using null values by using two different relations.

83 | P a g e
Database Management Systems

3.8 Types of Key

Fig. 3.2 Key in DBMS

1. Primary key
It is the first key used to mark one and only one instance of
an object uniquely. In the PERSON table, an agent may have
several keys. The most appropriate key from both lists is
designated as the primary key.
 Since each employee's ID is unique, ID can be used as the
primary key in the EMPLOYEE table. We may also use License
Number and Passport Number as primary keys in the
EMPLOYEE table since they are separate.
 The primary key for each organization is chosen depending on
the requirements and developers.

84 | P a g e
Database Management Systems

2. Candidate key
A candidate key is an attribute or collection of attributes that
can uniquely define a Tuple.
 Except for the primary key, the remaining attributes are called
candidate keys. The nominee keys are just as powerful as the
primary key.
For, e.g., the primary key should be id in the EMPLOYEE
table. The remaining attributes, such as SSN, Passport Number,
and License Number, are called candidate keys.

Super Key
A super key is a set of attributes that can be used to define a
tuple uniquely. A candidate key is a superset of a super key.

85 | P a g e
Database Management Systems

For example: In the above EMPLOYEE table, for


(EMPLOEE_ID, EMPLOYEE_NAME), two employees' names
can be the same, but their EMPLYEE_ID cannot be the same.
Hence, this combination can also be a key.
The super key would be EMPLOYEE-ID (EMPLOYEE_ID,
EMPLOYEE-NAME), etc.
Foreign key
 A foreign key is a table column that points to the primary key of
another table.
 Any employee in a corporation serves in a separate department,
and the employee and the department are two distinct bodies.
As a result, we cannot store department details on the employee
table. That is why we link these two tables using the primary
key of one of them.
 We add the DEPARTMENT table's primary identifier,
Department Id, as a new attribute to the EMPLOYEE table.
 The international key in the EMPLOYEE table is now
Department Id, and the two tables are connected.

86 | P a g e
Database Management Systems

3.9 Relational Query Languages


Relational query languages use relational algebra to break
down user requests and advise the database management
system (DBMS) on how to perform them. It is the language in
which the user interacts with the database. There are two types
of relational query languages: procedural and non-procedural.
Procedural Query Language is an abbreviation for
Procedural Query language that would have a series of queries
that direct the DBMS to execute different transactions in the
order specified by the user. E.g., the get CGPA protocol would
use different queries to obtain a student's marks in each subject,
measure his total marks, and then determine his CGPA based on
his total marks. This procedural query language informs the
database of what information it needs and how to retrieve it.
Relational algebra is a formal query language.
Non-Procedural Query Language (NPQL) is a query
language that does not follow Non-procedural queries can use a
single query on one or more tables to obtain database results.
For example, obtaining a student's name and address with a
specific ID would require a single question on the STUDENT
table. Relational Calculus is a non-procedural language that tells
you what to do with tables but not how to do it.
These query languages can essentially perform queries on
database tables. A table is referred to as a relation in a relational
database. Tuples are the names given to the table's records or
rows. Table columns are often referred to as attributes. In
relational databases, both of these terms are used
interchangeably.

87 | P a g e
Database Management Systems

3.10 Logical Database Design


A logical database is a subset of ABAP (Advanced Business
Application and Programming) used to access data from several
tables linked to one another. A logical database also offers a
read-only view of Data.

3.10.1 Structure of Logical Database


A logical database only uses a hierarchical system of tables,
which means that data is arranged in a Tree-like structure and
stored as records linked by edges (Links). Open SQL statements
are used to read data from the database in the Logical Database.
The logical database reads the program, saves it if necessary,
and then transfers it line by line to the application program.

Fig. 3.3 Structure of Logical Database

88 | P a g e
Database Management Systems

Structure of Logical database


Features of Logical Database:
Let us take a look at some of the characteristics of a logical
database in this section:
 We can only pick the type of data that we need.
 Data authentication is performed to ensure reliability.
 Logical databases use a hierarchical structure, which ensures
data integrity.

3.10.2 Goal of Logical Database


The logical Database's purpose is to build well-structured
tables that represent the user's needs. The Logical database's
tables store data in a non-redundant way, and foreign keys can
be used in tables to support relationships between tables and
entities.

3.10.3 Tasks of Logical Database


The following are some of the most important Logical
Database tasks:
We can read the same data from different programs using
the Logical database.
• A logical database specifies a common user interface for various
systems.
• The Logical Database guarantees that the Authorization tests for
the unified critical database are carried out.
• The use of a Logical Database improves performance. For
example, in Logical Database, we will use joins instead of
multiple SELECT statements, improving response time and
increasing Logical Database Performance.
89 | P a g e
Database Management Systems

3.10.4 Logical Database Data View


Logical Database offers a specific view of Logical Database
tables. When the database structure is large, a logical database is
sufficient. It is simple to use movement, i.e.
SELECT
 Read
 Process
 Display
To deal reliably with databases. The Logical Database's data
is hierarchical. In a Foreign Key relationship, the tables are
connected.
The Data View of a Logical Database is depicted
diagrammatically as:

Fig. 3.4 Logical Database Data View


3.10.5 Advantages of Logical Database
 We can pick meaningful data from a vast volume of data in a
logical database.
 Logical databases have Central Authorization, which
determines whether or not database access is authenticated.
 In this Coding, the portion retrieves data from the database is
less important than other databases.

90 | P a g e
Database Management Systems

 The access efficiency of reading data from the Database's


hierarchical structure is strong.
 User interfaces that are simple to grasp.
 Logical databases begin with check functions that ensure that
user information is complete, valid, and plausible.

3.10.6 Disadvantages of Logical Database


This segment discusses the logical database's drawbacks:
• A logical Database takes longer because the necessary data is at
the end. After all, if the table is required at the lowest level, then
all upper-level tables must be read first, which takes time and
slows down results.
• Since the ENDGET command does not occur in Logical
Database, the code block associated with an occurrence ends
with the next event statement.

3.11 ER Model to Relational Model


When visualized as graphs, the ER Model provides a clear
picture of entity relationships that are easier to comprehend. ER
diagrams can be mapped to relational schema, which means
they can construct relational schema. We will not import any ER
constraints into the relational model, but we should create a
rough schema.
Converting ER Diagrams to Relational Schema can be done
using various methods and algorithms. Some are automated,
while others are done by hand. We should concentrate on the
contents of the mapping diagram in terms of relational
fundamentals.
ER diagrams mainly comprise −
 Entity and its attributes
91 | P a g e
Database Management Systems

 A relationship is an association among entities.


Mapping Entity

An entity is a real-world object with some attributes.

Fig. 3.5 Mapping Process (Algorithm)

 Create a table for each entity.


 Entity's attributes should become fields of tables with their
respective data types.
 Declare primary key.
Mapping Relationship
A relationship is an association among entities.
 Create a table for a relationship.
 Add the primary keys of all participating Entities as fields of the
table with their respective data types.
 If a relationship has any attribute, add each attribute as a field of
a table.
 Declare a primary key composing all the primary keys of
participating entities.
92 | P a g e
Database Management Systems

 Declare all foreign key constraints.

Fig. 3.6 Mapping Process


Mapping Weak Entity Sets
A weak entity set does not have any primary key associated
with it.

Fig. 3.7 Mapping Weak Entity Sets

Create a table for weak entity set


 Add all its attributes to the table as a field.
 Add the primary key to identifying the entity set.
 Declare all foreign key constraints.

93 | P a g e
Database Management Systems

Mapping Hierarchical Entities


ER, specialization or generalization comes in the form of
hierarchical entity sets.

Fig 3.8 Mapping Hierarchical Entities

Create tables for all higher-level entities.


 Create tables for lower-level entities.
 Add primary keys of higher-level entities in the table of lower-
level entities.
 In lower-level tables, add all other attributes of lower-level
entities.
 Declare the primary key of the higher-level table and the
primary key for a lower-level table.
 Declare foreign key constraints.
94 | P a g e
Database Management Systems

CHAPTER 4
RELATIONAL ALGEBRA AND
RELATIONAL CALCULUS

4.1 Introduction of Relational Algebra


Relational algebra is a formal query language. It takes one or
more connections or tables, executes the procedure, and returns
the result. This outcome is often regarded as a new table or
relation. Assume we need to retrieve the student's name,
address, and class for the given ID. In this case, a relational
algebra can filter the name, address, and class from the
STUDENT table for the input ID. Relational algebra has
generated a subset of the STUDENT table for the given ID.
Operators will denote the operations in relational algebra.
This algebra can be extended to a single reference –unary – or
two tables – known as binary. When operations are applied to a
relation, the resulting subset of relations is referred to as a new
relation. Any operations can necessitate several steps. Relations
are also defined as subsets of relations at the intermediate level.
When we see the various operations below, we will better
understand.
In DBMS, there are six basic operations in relational algebra.
Several other operations are based on these basic operations.

Select (σ)
Select (σ) is a one-dimensional hierarchical operation. This
procedure retrieves the horizontal subset (row subset) of the
relation that meets the conditions. This can include operators
such as>, =, >=, =, and! = to exclude data from the link. It may

95 | P a g e
Database Management Systems

also merge the different filtering conditions using logical AND,


OR, and NOT operators. This procedure is defined as follows:
σ p (r)
Where σ is the symbol for the select operation, r is the
relation/table, and p is the logical formula or filtering
conditions to obtain the subset. Consider the same example:
σSTD_NAME = “James” (STUDENT)
σdept_id = 20 AND salary>=10000 (EMPLOYEE) – Select the records
from the EMPLOYEE table with department ID = 20 and
employees whose salary is more than 10000.

Project (∏)
Project (∏) – This unary operator is identical to the pick
function mentioned above. Depending on the conditions
defined, it generates the subset of relations. It only selects
chosen columns/attributes from the relation-vertical subset of
relation in this case. The above select operation produces a
subset of the relation, except with all of the attributes. It is
written as follows:

∏a1, a2, a3 (r)


Where projection operator, r is the relation, and a1, a2, and
a3 are the attributes of the relations shown in the resultant
subset.
std name, address, course (STUDENT) – This selects all
records from the STUDENT table but just the columns std name,
address, and course. If we only choose to pick these three
columns for a certain student, we would merge both the project
and the pick operations.

96 | P a g e
Database Management Systems

STD ID, address, course (STD NAME = “James”(STUDENT))


– this selects the record for ‘James' and shows only the columns
std ID, address, and his course. Two unary operators are
merged here, and two operations are performed. It begins by
selecting the tuple for ‘James' from the STUDENT table. The
resulting subset of STUDENT is often thought of as an
intermediate relationship. However, it is only temporary and
can only last until the process is completed. The three columns
from this temporary connection are then filtered.
Rename (ρ)
Rename (ρ) – This unary operator renames a relation's tables
and columns. When performing a self-join operation, we must
distinguish between two identical tables. In this scenario, the
rename operator on tables comes into play. When we enter two
or more tables, and the column names are the same, it is often
easier to rename the columns to distinguish them. This happens
when we run a Cartesian product operation.
ρ R(E)

E is the original relation name, R is the current relation


name, and renamed operator.
ρ STUDENT (STD_TABLE) – Renames STD_TABLE table to
STUDENT
Let us look at another case of renaming the table's columns.
If the STUDENT table contains ID, NAME, and ADDRESS
columns that need to be renamed to STD_ID, STD_NAME, and
STD_ADDRESS, we must write.
STD_ID, STD_NAME, STD_ADDRESS (STUDENT) – It will rename
ρ

the columns in the order the names appear in the table

Cartesian product (X)

97 | P a g e
Database Management Systems

Cartesian product (X): Cartesian product (X): A binary


operator. It integrates two relations' tuples into a single relation.
RXS
R and S are two relationships, and X is the operator. If
relation R contains m tuples and relation S contains n tuples, the
resulting relation contains mn tuples. E.g., suppose we perform
a cartesian product on the EMPLOYEE (5 tuples) and DEPT (3
tuples) relations. In that case, we can get a new tuple of 15
tuples.
Employee X Dept
This operator essentially creates a pair from the tuples in
each table. In other words, each employee in the EMPLOYEE
table will be mapped to a department in the DEPT table. The
diagram below depicts the outcome of the Cartesian product.

Union (U)
Union (U) – Union (U) is a binary operator that joins the
tuples of two ties. It is represented by
RUS
Where R and S are relationships and U is the operator.
Design_Employee U Testing_Employee
It differs from Cartesian products in the following ways:
98 | P a g e
Database Management Systems

 The cartesian product combines two relations' attributes into


one, while Union combines two relations' tuples into one.
 All ties in a union should have the same number of columns.
Assume we need to list the workers who serve in the design and
testing department. Then we will negotiate with the union on
the employee table. Since it is a union on the same table, it has
the same set of attributes. Cartesian products do not prioritize
the number of attributes or rows. It haphazardly blends the
attributes.
 All ties in a union should have the same categories of attributes
in the same order. Since the union in the preceding case is on an
employee relation, it has the same attribute in the same order.

Not all relationships need to have the same number of


tuples. If there are redundant tuples as a union function, it only
holds one. If a tuple exists in any one relation, it is retained in
the current relation. In the preceding case, the number of

99 | P a g e
Database Management Systems

employees in the design department does not have to be the


same as the number of employees in the research department.
The diagram below depicts this. We can see that it incorporates
the data from the table in the order that it appears in the table.
We would be unable to enter any of these tables if the column
order or several columns differed.

Set-difference (-)
Set-difference (-)-The operator is a binary operator. This
operator generates a new relation with tuples in one but not the
other. The ‘-‘symbol represents it.
R–S
Where R and S denote the relationships.
Assume we want to find staff working in the Design
department but not in research.

Assignment
The assignment operator ‘ ' is used to delegate the product of
a relational operation to a temporary relational attribute, as the
name implies. This is useful because there are several phases in
a relational operation. It is impossible to handle it in a single
sentence. Assigning the outcome to a temporary relation and
then using this temporary relation in the next operation
simplifies the job.
T S – denotes relation S is assigned to temporary relation T
A relational operation ∏a1, a2 (σ p (E)) with selection and
projection can be divided below.

T σ p (E)
S ∏a1, a2 (T)

100 | P a g e
Database Management Systems

Our projection example above for obtaining STD ID,


ADDRESS, and COURSE for the Student ‘James' can be
rewritten.
∏STD_ID, address, course (σ STD_NAME = “James”(STUDENT))

T σ STD_NAME = “James”(STUDENT)
S ∏STD_ID, address, course (T)

4.1.1 JOINS
Natural join – Natural join – As previously said, the
Cartesian product essentially blends the properties of two
relations into one. However, the current reference would not
have valid tuples. It just contains tuple variations. We must
perform a selection procedure on the Cartesian product result to
get the right tuples. This sequence of operations – Cartesian
product followed by collection – is merged into a single relation
known as natural join. RS denotes it.
R∞S
Assume we want to choose staff from Department 10. Then
we will do a Cartesian product on EMPLOYEES and DEPT to
find the DEPT ID in both ties that match 10. The same is
achieved for natural joins by
σ EMPLOYEE.DEPT_ID = DEPT>DEPT_ID AND EMPLOYEE.DEPT_ID =
10(EMPLOYEE X DEPT)

The same can be written using the natural join as


EMPLOYEE ∞ DEPT

101 | P a g e
Database Management Systems

102 | P a g e
Database Management Systems

In the preceding case, we can see that only the


corresponding data from both relations are preserved in the
final relation. Assume we keep all of the information from the
first relation and the related information from the second
relation, regardless of whether it happens or not. In this case, we
use an outer join. This join ensures that all tuple variations are
shown properly. Unlike the Cartesian product, this join ensures
that a tuple is created from both tables if the correct match
occurs. A null is added to those attributes if there is no match.
Let us look at them in the types of outer joins mentioned below.
There are three types of outer joins
Left Outer Join

Left outer join– This action keeps all the tuples in the left-
hand side connection. Both matching attributes in the right-hand
relation are displayed with values. Those that do not have a
value are displayed as NULL.
103 | P a g e
Database Management Systems

The following left outer join on the DEPT, and EMPLOYEE


tables blend the corresponding combination of DEPT ID = 10
with values. However, DEPT ID = 30 currently has no staff. As a
result, NULL is shown for those workers. As a result, combining
two relations is more meaningful with this outer join than with a
Cartesian product.
Right outer join
Right outer join – Right outer join – Which is the inverse of
left outer join. All attributes on the right-hand side are
preserved, and the corresponding attribute on the left-hand side
is identified and shown. If no match is found, the value null is
shown. The preceding example has been rewritten to help you
understand this:

Notice the order and column difference in both cases.

104 | P a g e
Database Management Systems

Full Outer Join


Full outer join – Total outer join – A mixture of the left and
right outer joins. It shows all of the qualities of both
relationships. If the corresponding attribute appears in another
reference, it will be displayed; otherwise, it will be null.

Hope the above diagram is self-explanatory.


Division
Division – This operation is used to locate tuples that use
the word "for everyone." The symbol denotes it ‘. Assume we
want to see all the workers who serve in all divisions. What are
the measures involved in locating this?

105 | P a g e
Database Management Systems

 First, we find all the department ID – T1


∏DEPT_ID (DEPARTMENT)
 The next step is to list all the employees and their departments –
T2 ∏ EMP_ID, DEPT_ID (EMPLOYEE)
In the third phase, we will locate the employees in T2 who
have the whole department ID in T1. This is achieved by using
the division operation –– T2 ÷ T1

4.2 Relational Calculus


Relational calculus is a branch of mathematics that differs
from relational algebra. In comparison to algebra, which is
formal, relational calculus is declarative or non-procedural.
It allows the user to define the collection of answers without
displaying the method for calculating them. The architecture of
106 | P a g e
Database Management Systems

commercial query languages such as SQL and QBE is heavily


influenced by relational calculus (Query-by Example).

Introduction to Relational Calculus in DBMS


In RDBM, relational calculus refers to a non-procedural
query language that stresses the idea of what to do with data
management rather than how to do it. The relational calculus
uses mathematical predicates calculus notations to provide
descriptive detail about queries to obtain the desired outcome. It
is an essential component of the relational data model. The
relational calculus in DBMS employs basic terminology such as
tuple and domain to define requests. Variables, constants,
comparison operators, logical connectives, and quantifiers are
some of the other related general terminologies for relational
calculus. It generates expressions of unbound formal variables,
also known as formulas.

Types of Relational Calculus in DBMS


This section will explore the different forms of relational
calculus in DBMS based on the terms and methods of
mathematically describing query functionalities. The key
components of relational calculus are the tuple and the domain.
A consequence tuple is a set of constants assigned to these.
Variables that cause the formula to test as real. In DBMS, there
are two kinds of relational calculus.
 Tuple relational calculus (TRC)
 Domain relational calculus (DRC)

107 | P a g e
Database Management Systems

Fig 4.1 Relational Calculus

For running in DBMS data retrieval concepts, all forms of


relational calculus are semantically identical. We will detail each
form of relational calculus, using database table examples to
demonstrate the syntax and applications.

4.2.1 TRC
Tuple relational calculus (TRC) is a basic subset of first-order
logic that filters tuples based on defined conditions. TRC
considers tuples as equivalent status as variables, and field
referencing can pick the tuple components. It is denoted by the
letter 'T,' with conditions denoted by the pipe sign and enclosed
by curly braces.

Syntax of TRC:
{T | Conditions)
The TRC syntax allows you to denote table names or
reference names and define tuple variables and column names.

108 | P a g e
Database Management Systems

It specifies the column names with the table name using the ‘.'
operator symbol.
The Tuple variable name, such as 'T,' is used to specify the
reference names in TRC. TRC Relationship Specification Syntax:
Relation(T)
E.g., if the relation name is Product, it can be denoted as
Product (T). Similarly, TRC allows you to decide the parameters.
The condition applies to a certain attribute or column.
For example, suppose data for a certain product id of value
10 must be represented. In that case, it can be denoted as
T.product id=10, where T is the tuple variable representing the
row of the table. Let us assume the Product table in the database
as follows:

Product Product Product Unit


Product_id
Category Name Price
8 New TV Unit 1 $100
10 New TV Unit 2 $120
12 Existing TV Cabinet $77

To reflect the relational calculus to return the product name


with the product id value of 10 from the product table, use the
tuple variable T.

T.Product Name | Product(T) AND T.Product_id = 10


This relational calculus predicate describes what to do to get
the resultant tuple from the database. The result of the tuple
relational calculus for the Product table will be:

109 | P a g e
Database Management Systems

Product_id Product Name


10 TV Unit 2

4.2.2 DRC
The regional domain calculus is dependent on domain and
attributes filtering. DRC is the vector spectrum over the domain
elements or the field values. It is a kind of first-order logic
simple subset. It is domain-dependent as opposed to TRC,
which is tuple-dependent. For the relational calculus
representations in DRC, the formal variables are explicit. In
DRC, the domain attributes are denoted as C1, C2,..., and Cn.
The condition relevant to the attributes is denoted as the
formula specifying the condition for fetching the F(C1, C2,...Cn
).
Syntax of DRC in DBMS
{c1, c2,...,cn| F(c1, c2,... ,cn)}
Let us assume the same Product table in the database as
follows:
Product Product Product Unit
Product_id
Category Name Price
8 New TV Unit 1 $100
10 New TV Unit 2 $120
12 Existing TV Cabinet $77

DRC for the product name attribute from the Product table
needs where the product id is 10; It will be demoted as:
{< Product Name, Product_id> | ∈ Product ∧ Product_id> 10}
The result of the domain relational calculus for the Product
table will be

110 | P a g e
Database Management Systems

Product_id Product Name


10 TV Unit 2

Some commonly used logical operator notations for DRC are


∧ AND,∨ OR, and NOT.

Expressive Power of Algebra and Calculus


For the relational model, we introduced two formal query
languages. Are their levels of influence comparable? Is it
possible to articulate any question expressed in relational
algebra in relational calculus? Yes, it does, according to the
comment.
In terms of expressiveness, we can demonstrate that any
query expressed as a protected relational calculus query can also
be expressed as a relational algebra query. The explanatory
power of relational algebra is often used to determine the power
of a relational database query language. A query language is
considered relationally complete if it can express all of the
queries expressed in relational algebra. A functional query
language is supposed to be relationally complete. Commercial
query languages usually have features that enable them to
express queries that cannot be expressed using relational
algebra.

111 | P a g e
Database Management Systems

CHAPTER 5
BASIC OF SQL

5.1 Introduction to SQL


The Structured Query Language (SQL) is a programming
language that is widely used to access and manage databases.
SQL helps the user build, retrieve, modify, and migrate data
between databases. It is a programming language used to
manage and access data in a Relational Database Management
System (RDBMS).
SQL comes in a variety of flavors. In the early 1970s, the
initial version was produced at IBM's Research Centre and was
known as Sequel. The language was later modified to SQL.
ANSI (American National Specification Institute) published a
SQL format in 1986, revised in 1992. The most recent SQL was
launched in 2008, named SQL 2008.

5.2 Role of SQL in RDBMS


RDBMS is an abbreviation for Relational Database
Management System. RDBMS packages include Oracle, MySQL,
MS SQL Server, IBM DB2, and Microsoft Access. SQL is a
programming language used to access data in such databases.
A database, in general, is a list of tables that store sets of data
that can be queried for use in other applications. A database
management system aids in developing, administering, and
using database platforms. RDBMS is a database management
system with a row-based table structure that links related data
elements and includes functions related to Create, Read,
Update, and Delete (CRUD) operations.

112 | P a g e
Database Management Systems

In RDBMS, data is stored in database objects known as


Tables. A table is a grouping of similar data entries made up of
rows and columns.
A field is a column in a table used to store unique
information about each record in the table. A vertical object
includes all information related to a given field in a table. A
student table can include fields of the following types: Admn
No, StudName, StudAge, StudClass, Location, etc.
A record is a row, a grouping of similar fields or columns in
a graph. A record is a horizontal object in a table that contains
information about a certain student in a student table.

5.3 Processing Skills of SQL


The various processing skills of SQL are:
Data Definition Language (DDL): SQL DDL provides
commands for defining relation schemas (structure), deleting
relations, constructing indexes, and changing relation schemas.
Data Manipulation Language (DML): Data Manipulation
Language (DML): The SQL DML contains commands for
inserting, deleting, and modifying tuples in the database.
Embedded Data Manipulation Language, This type of SQL is
used in high-level programming languages.
View Definition: The SQL language also contains commands
for defining table views.
Authorization: The SQL language contains commands for
controlling table relations and views access.
Integrity the SQL language includes forms for testing integrity
using conditions.
Transaction control: The SQL language provides commands for
handling file transfers and control over transaction processing.

113 | P a g e
Database Management Systems

Creating Database
1 To build a database, enter the command CREATE DATABASE
name; in the prompt. CREATE DATABASE database_name;
For example, to create a database to store the tables: CREATE
DATABASE stud;
2. Enter the following command to interact with the database
USE DATABASE;
For example, to use the stud database created, give the
command USE stud;

5.4 Components of SQL


SQL commands are divided into five categories:

Fig 5.1 Components of SQL

114 | P a g e
Database Management Systems

5.4.1 Data Definition Language


The Data Definition Language (DDL) is made up of SQL
statements used to describe the structure or schema of a
database. It deals with database schema definitions and is used
to construct and alter the configuration of database objects in
databases.
The DDL is a collection of definitions that define the
database system's storage structure and access methods.

A DDL performs the following functions:


1. It must specify the form of data division, such as data object,
column, document, or database file.
2. Each data item type, record type, files type, and database is
given a unique name.
It must define the correct data form.
It should specify the data item's size.
5. It may specify the values that a data item should have.
6. It can provide privacy safeguards to avoid unauthorized data
access.
The following SQL commands are part of the Data
Definition Language:
SQL commands which come under Data Definition
Language are:

Fig 5.2 Data Definition Language


115 | P a g e
Database Management Systems

Create: To create tables in the database.


Alter: Alters the structure of the database.
Drop: Delete tables from a database.
Truncate: Remove all records from a table and release the space
occupied by those records.
5.2 Data Manipulation Language
A Data Manipulation Language (DML) is a computer
programming language that is used to add (insert), remove
(delete), and alter (update) data in a database. The SQL-data
change statements, which alter stored data but not the database
table schema, are part of the data manipulation language.
After specifying the database schema and creating the
database, the data can be manipulated using DML-expressed
procedures.
We mean the following when we say "Data Manipulation."
 Data entry of new information into the database
 Information retrieval from a database.
 Information from the database is deleted.
 Modification of data contained in a database

The DML is divided into two types:


Procedural DML – This type of DML requires the user to
determine what data is required and how to obtain it.
Non-Procedural DML - Requires the user to decide what
data is required but not how to obtain it.
SQL commands that fall under Data Manipulation Language
include:
Insert: This function inserts data into a table.
Update: This function updates the current data in a table.
116 | P a g e
Database Management Systems

Delete: Removes all documents from a table but not the


room they occupy.

5.4.3 Data Control Language


A Data Control Language (DCL) is a programming language
used to control data access in a database. It is used in the
database to manage rights (Authorization). All database
operations, such as generating sequences and views of tables,
necessitate the use of privileges.
SQL commands that fall under Data Control Language
include:
Grant: Allows one or more administrators to execute
particular tasks.
Revokes the access authorization granted by the GRANT
statement.
5.4.4 Transactional Control Language
Transactional control language (TCL) commands are used to
handle database transactions. These are used to handle changes
made to a table's data through DML statements.
The SQL commands that fall under Transfer Control
Language are as follows:
Commit: Indefinitely saves every transaction into the database.
Rollback: Reverts the database to its previous commit state.
Saving point: A transaction is temporarily saved so that it can be
rolled out.
5.4.5 Data Query Language
The Data Query Language is a collection of commands for
querying or retrieving data from a database. In Data Query
Language, one such SQL command is
Select: It lists the table's documents.5.5 SQL Data Types
117 | P a g e
Database Management Systems

A database stores data depending on the type of value


stored in it. This is known as the data type or assigning a data
type to each sector. The values in a given field must all be of the
same kind.
The ANSI SQL standard only recognizes Text and Number
data types. At the same time, some commercial programs use
other data types such as Date and Time, among others. Table
12.1 contains a list of ANSI data types.

Fig 5.3 SQL Data Types

118 | P a g e
Database Management Systems

Data: Type Description


Char (Character): Fixed width string value. Values of this
type are enclosed in single quotes, forex. Anu’s will be written
as ‘Anu’‘s.’

Varchar: Variable width character string. This is similar to char,


except that the data entry size varies considerably.
dec (Decimal): It represents a fractional number of 15.12,
0.123, etc. Here the size argument consists of two parts:
precision and scale. The precision indicates how many digits the
number may have. The scale indicates the maximum number of
digits to the right of the decimal point. The size (5, 2) indicates a
precision as 5 and scale as 2. The scale cannot exceed the
precision.
Numeric: It is the same as a decimal except that the
maximum number of digits may not exceed the precision
argument.
Int (Integer): It represents a number without a decimal
point. Here the size argument is not used.
Smallest: It is the same as integer but the default size may be
smaller than Integer.
Float: It represents a floating-point number in base 10
exponential notation and may define a precision up to a
maximum of 6
Real: It is the same as float, except the size argument is not
used and may define a precision up to a maximum of 6
Double: Same as real except the precision may exceed 6

119 | P a g e
Database Management Systems

5.6 SQL Functions


Since tables are the only way to store data, all information
must be organized in tables. SQL contains a predefined series of
commands for working with databases.
Keywords: In SQL, they have a specific name. They are
recognized as directions.
Commands are instructions provided to the database by the
user, also known as statements.
Clauses begin with a keyword and are made up of a
keyword and an argument.
SQL has a plethora of built-in features for performing data
operations. These functions come in handy when doing
mathematical equations, line concatenations, sub-strings, etc.
SQL functions are classified into two types:
 Aggregate Functions
 Scalar Functions

5.6.1 Aggregate Functions


After conducting measurements on a set of values, these
functions return a single value. The below are some of the more
often used Aggregate functions.
AVG () Function
Average yields the average value calculated from the values
in a numeric column.
Its general syntax is,
SELECT AVG(column_name) FROM table_name
Using AVG() function
Consider the following Emp table

120 | P a g e
Database Management Systems

Eid name age salary


401 Anu 22 9000
402 Shane 29 8000
403 Rohan 34 6000
404 Scott 44 10000
405 Tiger 35 8000
SQL query to find average salary will be,

SELECT avg(salary) from Emp;


The result of the above query will be,
avg(salary)
8200

COUNT() Function
Count returns the number of rows in the table, either with or
without a condition.
Its general syntax is as follows:

SELECT COUNT(column_name) FROM table-name


Using COUNT () function
Consider the following Emp table
SQL query to count employees, satisfying the specified
condition is,

SELECT COUNT (name) FROM Emp WHERE salary = 8000;


The result of the above query will be,
count(name)
2

Example of COUNT (distinct)


121 | P a g e
Database Management Systems

Consider the following Emp table


SQL query is,

SELECT COUNT (DISTINCT salary) FROM EMP;


The result of the above query will be,
count(distinct salary)
4

FIRST () Function
The first function returns the first value of a selected
column.
Using FIRST () function

SELECT FIRST(column_name) FROM table-name;


Consider the following Emp table
SQL query will be,
SELECT FIRST (salary) FROM EMP;
And the result will be,
first(salary)
9000

LAST () Function
The LAST function returns the last value of the chosen
column.
Syntax of the LAST function is,
Using LAST () function

SELECT LAST(column_name) FROM table-name;


Consider the following Emp table
SQL query will be,

122 | P a g e
Database Management Systems

SELECT LAST(salary) FROM emp;


Result of the above query will be,

last(salary)
8000

MAX() Function
The MAX function returns the highest value from a table
column.

Syntax of MAX function is,


SELECT MAX (column_name) from table-name;
Using MAX () function
Consider the following Emp table
SQL query to find the Maximum salary will be,
SELECT MAX (salary) FROM emp;

Result of the above query will be,


MAX(salary)
10000
MIN() Function
The MIN function returns the table's minimum value from a
specified column.
The syntax for the MIN function is as follows:
SELECT MIN(column_name) from table-name;

Using MIN() function


Have a look at the accompanying Emp table.
SQL query to find a minimum salary is,
SELECT MIN(salary) FROM emp;
123 | P a g e
Database Management Systems

Result will be,


MIN(salary)
6000

SUM() Function
The SUM function returns the absolute sum of the numeric
values in a given column.
Syntax for SUM is,
SELECT SUM(column_name) from table-name;
Using SUM() function
Consider the following Emp table
SQL query to find sum of salaries will be,
SELECT SUM(salary) FROM emp;
Result of the above query is,
SUM(salary)
41000

5.6.2 SCALAR FUNCTIONS


The UCASE function translates the value of a string column
to uppercase characters.
UCASE() Function
UCASE function is used to convert the value of the string
column to Uppercase characters.

Syntax of UCASE,
SELECT UCASE(column_name) from table-name;
Using UCASE() function
Consider the following Emp table
SQL query for using UCASE is,
124 | P a g e
Database Management Systems

SELECT UCASE(name) FROM emp;


Result is,
UCASE(name)
ANU
SHANE
ROHAN
SCOTT
TIGER

LCASE() Function
The LCASE function translates the values of string columns
to lowercase characters.

Syntax for LCASE is,


SELECT LCASE(column_name) FROM table-name;
Using LCASE() function
Consider the following Emp table
SQL query for converting string value to Lower case is,
SELECT LCASE(name) FROM emp;
Result will be,
LCASE(name)
anu
shane
rohan
scott
tiger

125 | P a g e
Database Management Systems

MID() Function
The MID function derives substrings from string-style
column values in a table.
The MID function syntax is,Syntax for MID function is,
Using MID() function
SELECT MID(column_name, start, length) from table-name;
Consider the following Emp table
SQL query will be,
SELECT MID(name,2,2) FROM emp;
Result will come out to be,

MID(name,2,2)

ROUND() Function
The ROUND function is used to round a numeric field to
the nearest integer. It is applied to decimal point values.
Syntax of Round function is,
SELECT ROUND(column_name, decimals) from table-name;
Using ROUND() function
Consider the following Emp table
SQL query is,
SELECT ROUND(salary) from emp;
Result will be,

ROUND(salary)
9001
8001
6000
10000

126 | P a g e
Database Management Systems

8000

5.7 Type of Constraints


Constraints ensure database integrity, therefore known as
database integrity constraints. The various types of constraints
are :

Fig 5.4 Type of Constraints

(i)Unique Constraint
This restriction ensures that no two rows have the same
value in the stated columns. For example, using the UNIQUE
constraint on the Admno of student table ensures that no two
students have the same admission number, and the constraint
can be used as follows:

CREATE TABLE Student


(
Admno integer NOT NULL UNIQUE, → Unique constraint
Name char (20) NOT NULL,
Gender char (1),
Age integer,
127 | P a g e
Database Management Systems

Place char (10),


);
The UNIQUE restriction can only be extended to fields that
are also NOT NULL.
Multiple constraints occur when two constraints are applied
to a single field. Multiple constraints NOT NULL and UNIQUE
are applied on a single field Admno. A space separates the
constraints, and a comma(,) is added at the end of the field
definition. Combining these two constraints, the field Admno
must have a value, i.e. it cannot be NULL and cannot be
duplicated.
ii) Primary Key Restriction
This constraint declares a field as a primary key, which aids
in the unique identification of a record. It is similar to the unique
constraint, except that only one field in a table can be designated
as the primary key. Since the primary key does not allow NULL
values, any field declared as the primary key must have the
NOT NULL restriction.

5.8 Join Expressions


It is not uncommon for a necessity to merge information
from several tables into a single coherent query result to occur
quickly. For example, we might use the EMP and DEPT tables to
show employee numbers and names and the department's name
in which employees work. We would need to merge
information from both tables, as employee information is stored
in the EMP table and department name information is stored in
the DEPT table (in the DNAME attribute).
The first thing to remember is that this will result in the EMP
and DEPT tables being included in the table-list after the
128 | P a g e
Database Management Systems

question is from keyword. The table list will include all of the
tables that must be accessed during query execution. So far, the
table-list has only included one table since our queries have only
ever accessed one table. However, if you want to mention
employee numbers and names alongside department names, the
FROM clause would look like this:
However, listing both the EMP and DEPT tables after the
FROM keyword is insufficient to achieve the desired
performance. We do not just want the tables to be accessed in
the query; we want the way they are accessed to be coordinated
in a specific way. We would like to link the display of a
department name to the display of employee numbers and
names who work in that department. As a result, we need to
link employee records in the EMP table to department records in
the DEPT table. The Relational operator JOIN is used in SQL to
accomplish this. The JOIN is a fundamental principle in
relational databases and, by extension, the SQL language. This
logical combining or relating data from different tables is a
standard and requirement in almost all applications, so it is such
a central concept. The ability to consistently connect data from
various tables has been a key factor in the widespread adoption
of relational database systems.
A peculiar feature of performing JOINs, or relating
information from different tables logically as required in the
above query, is that, although the method is universally referred
to as performing a JOIN, the way it is represented in SQL does
not always involve the use of the word JOIN. This can be
particularly perplexing for newcomers to JOINs. To satisfy the
question above, for example, we will code the WHERE clause as
follows:
129 | P a g e
Database Management Systems

Where Emp.Deptno = Dept.Deptno


We want rows in the EMP table to be compared to rows in
the DEPT table by comparing rows from the two tables whose
department numbers (DEPTNOs) are the same. So we are
linking each employee record in the EMP table with the
department record for that employee in the DEPT table using
the DEPTNO column from each employee record in the EMP
table.
The full query would therefore be:

SELECT EMPNO,ENAME,DNAME

FROM EMP,DEPT

WHERE EMP.DEPTNO = DEPT.DEPTNO;


For our test data collection, this yields the following results:
A few additional points should be made regarding the above
query's expression:
 Since we want to show DNAME attribute values in the result, it
must be included in the select-list.
 The DEPTNO attribute does not need to be included in the
select-list. We need the EMP.DEPTNO and DEPT.DEPTNO
columns to perform the JOIN, so we include them in the
WHERE clause, but we do not want to show any DEPTNO
detail, so it is not in the select-list.
 The order in which the EMP and DEPT tables appear after the
FROM keyword is unimportant, at least if we can overlook
performance response problems, which we can for tables of this
size.
130 | P a g e
Database Management Systems

Outer JOINs
In addition to the basic form of the JOIN, also known as a
NATURAL JOIN and used to connect rows in various tables, we
often need a little more syn-tax than we have seen so far to get
all of the details we need. Assume we want to list all
departments and their employee numbers and names and any
departments that do not have any employees.
As a first attempt, we might code:

SELECT DEPT. DEPTNO, DNAME, EMPNO, ENAME FROM


EMP, DEPT WHERE EMP.DEPTNO = DEPT.DEPTNO

ORDER BY DEPT.DEPTNO;
However, the findings of this first attempt do not provide a
full response to the original question. Department 40, titled
Operations, has no staff assigned to it, but it does not appear in
the results.
The issue here is that the simple JOIN only extracts matching
instances of records from the joined tables. Something else is
required to force any record instances that do not fit a record in
the other table. To do this,
We use a construct known as an OUTER JOIN in situations
where we want to force rows that match and do not match a
typical JOIN condition into our results set. There are three types
of OUTER JOINS: LEFT, RIGHT, and FULL OUTER JOINS. The
following tables will be used to illustrate the OUTER JOINS.
Person table

131 | P a g e
Database Management Systems

The person table holds the information of people. The ID is


the primary key.

A person can own a car or not.


Car table

The car table holds information about cars. The REG is the
primary key. A car can have an owner or not.

Left Outer Join


The LEFT OUTER JOIN syntax entails using the LEFT JOIN
key-word in the question. Here is an example: List all people
and their car registration and model, including anyone who
does not own a car. The prerequisite is to include someone who
does not own a car in the result package. To meet the
requirement, we will write our question as follows:
SELECT ID, NAME, REG, MODEL

FROM Person LEFT JOIN car ON Person.ID = Car.OWNER;

Right Outer Join


The syntax of the RIGHT OUTER JOIN is similar to that of
the LEFT JOIN in that it requires the inclusion of the RIGHT

132 | P a g e
Database Management Systems

JOIN keyword in the question. For example, list all cars and
their owner's identification and name, including any cars that no
one owns.

SELECT REG,MODEL,ID,NAME

FROM Person RIGHT JOIN car ON Person.ID = Car.OWNER;

Full Outer Join


If you want to display both individual records of those who
do not own a car and car records that do not have an owner, use
the FULL OUTER JOIN:
SELECT REG,MODEL,ID,NAME

FROM Person FULL JOIN car ON Person.ID = Car.OWNER;

Using table aliases


Table aliasing entails defining aliases, or alternate names,
that can refer to a table during query processing. Following the
FROM keyword, the table aliases are listed in the table-list. For
example, the above FULL OUTER JOIN query can be written
using aliases:
SELECT REG,MODEL,ID,NAME
FROM Person p FULL JOIN car c ON p.ID = c.OWNER;
133 | P a g e
Database Management Systems

Self Joins
To compare records from the same table, it is often
important to JOIN a table to itself. For example, suppose we
want to compare compensation values on an individual basis
among employees.

5.8 Union, Intersect and Except


SQL extends the standard query structure with three set-
manipulation constructs. Since a query is a collection of rows, it
is logical to consider operations like union, intersection, and
difference. SQL supports these operations under the names
UNION, INTERSECT, and EXCEPT.
Other set operations in SQL include IN (to see if an element
is in a given set), op ANY, op ALL (to compare a value with the
elements in a given set using the comparison operator op), and
EXISTS (to check if a set is empty). IN and EXISTS can be
prefixed with NOT, with the apparent consequence of their
altered context.

134 | P a g e
Database Management Systems

5.9 Introduction of Views


Views in SQL are essentially virtual tables. Like a real table
in the database, a view has rows and columns. We may
construct a view by selecting fields from multiple database
tables. A View may include either rows of a table or unique
rows depending on the situation.
This article will teach you how to create, delete, and update
Views.

Sample Tables:
Student Details
S_Id Name Address
1 GURU CHENNAI
2 KUMAR GUNTUR
3 NARESH NELLORE
4 VENU GUDUR
Student Marks
S_ID Name Marks Age
1 GURU 86 19
2 KUMAR 91 20
3 NARESH 87 23
4 VENU 81 22

135 | P a g e
Database Management Systems

Creating Views
Using the Build VIEW argument, we can create a View. A
View can be built from a single table or several tables.

Syntax:
CREATE VIEW view_name AS
SELECT column1, column2.....
FROM table_name
WHERE condition;

view_name: Name for the View


table_name: Name of the table
condition: Condition to select rows

Creating a View from Multiple Tables: In this case, we will


make a View called Marks View out of two tables: Student
Details and Student Marks. We may include multiple tables in
the SELECT statement to construct a View from multiple
tables. Question:
 CREATE VIEW Marks View AS
 SELECT Student Details.NAME, Student Details. ADDRESS,
Student Marks. MARKS
 FROM Student Details, Student Marks
 WHERE Student Details. NAME = Student Marks. NAME;
To display data of View Marks View:
SELECT * FROM MarksView;
Output:
Name Address Marks
Guru Chennai 86
Kumar Guntur 91

136 | P a g e
Database Management Systems

Naresh Nellore 87
Venu Gudur 81

Deleting Views
We learned how to construct a View, but what if the View
we generated is no longer required We will want to get rid of it.
SQL gives us the ability to remove a current View. Using the
DROP phrase, we can delete or exclude a View.

Syntax:
DROP VIEW view_name;

view_name: Name of the View which we want to delete.


For example, if we want to delete the View Marks View, we
can do this as:
DROP VIEW Marks View;

Updating Views
Certain requirements must be met to upgrade a view. If any
of these criteria are not met, we will be unable to update the
view.
1. The GROUP BY and ORDER BY clauses should not be included
in the SELECT statement used to build the view.
2. The DISTINCT keyword should not be included in the SELECT
argument.
Everything NOT NULL values should be present in the
View.
Nested or complex queries should not be used to build the
view.

137 | P a g e
Database Management Systems

5. A single table should be used to construct the view. We would


not be able to change the view if it was generated using several
tables.
To add or delete fields from a view, we can use the CREATE
OR REPLACE VIEW expression.

Syntax:
CREATE OR REPLACE VIEW view_name AS
SELECT column1,coulmn2,..
FROM table_name
WHERE condition;
For example, if we want to update the view Marks
View and add the field AGE to this View from Student
Marks Table, we can do this as:
CREATE OR REPLACE VIEW Marks View AS
SELECT Student Details. NAME, Student Details.
ADDRESS, Student Marks. MARKS, Student Marks. AGE
FROM Student Details, Student Marks
WHERE StudentDetails.NAME = Student Marks.NAME;
If we fetch all the data from Marks View now as:
SELECT * FROM Marks View;

Output:
NAME ADDRESS MARKS AGE
GURU CHENNAI 86 19
KUMAR GUNTUR 91 20
NARESH NELLORE 87 23
VENU GUDUR 81 22

138 | P a g e
Database Management Systems

Inserting a row in a view:


In a View, we can insert a row in the same way as in a table.
To insert a row into a View, we can use the SQL INSERT INTO
statement.

Syntax:
INSERT INTO view_name (column1, column2 , column3,..)
VALUES (value1, value2, value.);
view_name: Name of the View

Example:
In the below example we will insert a new row in the View
Details View that we created above in the example of “creating
views from a single table”.
INSERT INTO Details View (NAME, ADDRESS)
VALUES ("Suresh", "Gurgaon");
If we fetch all the data from Details View now,
SELECT * FROM Details View;

Output:
NAME ADDRESS
GURU CHENNAI
KUMAR GUNTUR
NARESH NELLORE
VENU GUDUR

Deleting a row from a View:


Delete rows from a view in the same way you delete rows
from a table. We can use SQL's DELETE statement to delete
rows from a view. Before removing a row from a view, remove
139 | P a g e
Database Management Systems

the row from the actual chart, and the adjustment is reflected in
the view.
DELETE FROM view_name
WHERE condition;
view_name: Name of view from where we want to delete
rows
condition: Condition to select rows

Example:
We will delete the last row from the view Details View in
this example, which we added in the above example of inserting
rows.
DELETE FROM Details View
WHERE NAME="Suresh";
If we fetch all the data from Details View now,
SELECT * FROM Details View;
Output:
Name Address
GURU CHENNAI
KUMAR GUNTUR
NARESH NELLORE
VENU GUDUR

With Check Option


In SQL, the WITH CHECK OPTION clause is useful for
views. It applies to a view that can be modified. If the view
cannot be modified, adding this clause in the CREATE VIEW
statement is pointless.

140 | P a g e
Database Management Systems

 The WITH CHECK OPTION clause is used to prevent rows


from being inserted into the view of the condition in the
WHERE clause of the CREATE VIEW statement is not met.
 If we used the WITH CHECK OPTION clause in the CREATE
VIEW statement, and the UPDATE or INSERT clause does not
meet the requirements, an error would be returned

Example:
In the below example, we create a View Sample View from
Student Details Table with the CHECK OPTION clause.
CREATE VIEW Sample View AS
SELECT S_ID, NAME
FROM Student Details
WHERE NAME IS NOT NULL
WITH CHECK OPTION;
In this view, if we now try to insert a new row with a null
value in the Name column, it will give an error because the view
is created with the condition for NAME column as NOT NULL.
For example, though the View is updatable but then also the
below query for this View is not valid:
INSERT INTO Sample View(S_ID)
VALUES(6);
NOTE: The default value of NAME column is null.

Uses of a View :
Views should be present in a good database for the
following reasons:
1. Limiting data access –
Views provide an extra layer of protection to a table by
limiting access to a predefined collection of rows and columns.
141 | P a g e
Database Management Systems

2. Keeping data sophistication hidden


A view can conceal the complexities of a multiple-table join.
Simplify the user's instructions –
Views allow users to pick information from several tables
without knowing how to execute a join.
Complex queries should be saved.
Views are useful for storing complex queries.
5. Columns can be renamed –
Views may also be used to rename columns without
affecting the base tables. The number of columns in the view
matches the number of columns defined in the select statement.
As a result, renaming aids in hiding the names of the columns in
the foundation tables.

6. The ability to view multiple views –


On the same table, different views can be generated for
different users.

5.10 SQL - Transactions


A transaction is a unit of work performed on a database.
Transactions are units or sequences of work completed in
sequential order, either manually by a user or automatically by a
database program.
A transaction is the propagation of one or more database
changes. For example, if you create a record, update a record, or
delete a record from a table, you are carrying out a transaction
on that table. Controlling these transactions is critical for
ensuring data integrity and dealing with database errors.

142 | P a g e
Database Management Systems

You can group several SQL queries and execute them all at
once as part of a transaction.

Transactional Properties
Transactions have the four standard properties mentioned
below, often referred to by the acronym ACID.
 Atomicity guarantees that all activities within the work unit are
completed. Otherwise, the transaction is aborted at the point of
failure, and all prior operations are reverted to their previous
state.
 Consistency ensures that the database switches states correctly
after a successfully committed transaction.
 Isolation allows transactions to run independently and
transparently to one another.
 Durability guarantees that the outcome or consequence of a
committed transaction is preserved in the event of a system
failure.
5.11 Nested Queries
Nested queries are one of SQL's most powerful features. A
nested query contains another query; the embedded query is a
subquery. Of course, the embedded query can be a nested
query, allowing for queries with extremely deep nested
structures. We occasionally need to express a condition in a
query that refers to a table that must be computed. The query
used to create this subsidiary table is a subquery included in the
main query. A subquery is typically found in a query's WHERE
clause. Subqueries can occasionally appear in the FROM or
HAVING clauses.

143 | P a g e
Database Management Systems

Introduction to Nested Queries

5.12 Correlated Nested Queries


The inner subquery has always been completely
independent of the outer query in the nested queries we have
seen. In general, the inner subquery (in terms of our conceptual
evaluation strategy) could depend on the row currently being
examined in the outer query (in terms of our conceptual
evaluation strategy)..

144 | P a g e
Database Management Systems

The EXISTS operator, like IN, is a set comparison operator. It


lets us determine whether a set is nonempty by comparing it to
the empty set. As a result, we check if the set of Reserves rows R
such that R.bid = 103 AND S.sid = R.sid is not zero for each
Sailor row 5. If that is the case, sailor 5 has made a reservation
for boat 103, and we will pick up the tag. The subquery must be
re-evaluated for each row in Sailors because it depends on the
current row S. A correlation is defined as S in the subquery (in
the literal form of S.sid) such queries are known as correlated
queries.
Set-Comparison Operators: EXISTS, IN, and UNIQUE, as
well as their negated versions NOT, are set-comparison
operators. op Some and op ALL are also supported in SQL,
where op is one of the arithmetic comparison operators, such as
=, =,>, >=, >. (SOME is an option, but it is just a synonym for
ANY.)

145 | P a g e
Database Management Systems

5.13 Aggregate Operators


SQL allows you to use arithmetic expressions. We will look
at calculating aggregate values using a powerful class of
constructs called MIN and SUM. These features are a significant
extension of relational algebra. SQL has five aggregate
operations that can be used on any column in a connection, for
example, A:
1. The number of (distinguished) values in the A column. COUNT
([DISTINCT] A):
2. SUM ([DISTINCT] A): Combines all of the (unique) values in
the A column.
The sum of all (unique) A column values is AVG
([DISTINCT] A).
MAX (A): The maximum value can be entered in the A
column.
6. MIN (A): The value in the A column with the lowest value.
It is worth noting that using DISTINCT in conjunction with
MIN or MAX is not good.

146 | P a g e
Database Management Systems

5.14 Group by and Having Clauses


We frequently want to apply aggregate operations to each of
a relation's multiple groups of rows, the number of which varies
depending on the relation case (i.e., is not known in advance).
To write such queries, we need to add a significant extension
to the basic SQL query type, namely the GROUP BY clause. In
reality, the extension includes an optional HAVING clause that

147 | P a g e
Database Management Systems

allows you to specify qualifications over classes. The following


is a general form of a SQL query with these extensions:
SELECT [DISTINCT] select-list
FROM from-list
WHERE 'qualification’
GROUP BY grouping-list
HAVING group-qualification

A few key points to remember about the new clauses:


The select list of the SELECT clause is made up of
(1) a list of column headings and
(2) a table of contents
(2) As a new name, a list of words with the prefix agg.op
(column-name). AS has been used to rename output columns
previously. Because columns created by aggregate operators do
not have a name, naming the column with AS is especially
useful.
Any column in (1) that appears in the grouping list must also
appear in the grouping list. Each row in the query result
corresponds to a single category, a group of rows that agree on
the values of the grouping list's columns. Suppose a column
appears in the list (1) but not in the grouping list. In that case,
several rows within a group may have different values in this
column. It is unclear what value should be assigned to this
column in a response row.

148 | P a g e
Database Management Systems

The party-qualification phrases in the HAVING clause must


have a single interpretation per category. The HAVING clause
specifies whether or not a response row should be created for a
given category.
A column in the group qualification must also appear as the
argument to an aggregation operator or in the grouping list in
SQL-92. SQL:1999 added two new collection functions that let us
see if any or all of the rows in a group satisfy a condition,
allowing us to use conditions similar to those found in a
WHERE clause.
The entire table is treated as a single group when GROUP
BY is not defined.

149 | P a g e
Database Management Systems

5.15 Null Values


SQL includes a special column value called null if the value
is unknown. Null is used when the column value is unclear or
inapplicable.

6.61 Comparisons Using Null Values


When we use comparison operators like >, =, and so on to
compare two null values, the result remains unknown. If we
have null in two distinct rows of the sailor relation, any contrast
returns unknown.
The IS NULL comparison operator in SQL can determine if a
column value is null; for example, rating IS NULL would
evaluate to true on the row representing Dan if his rating was
not provided. We can also assert that rating IS NOT NULL on
Dan's lines, which evaluates to false.
AND, OR, and NOT are logical connectives.
We must describe the logical operators AND, OR, and NOT
using a three-value logic.
Those expressions are true(T), false(F), or unknown (U).

And
conditon1 condition2 Result
F F F
F T F
F U F
T F F
T T T
T U U
U U U
150 | P a g e
Database Management Systems

OR
conditon1 condition2 Result
F F F
F T T
F U U
T F T
T T T
T U T
U U U

The word NOT unknown means "unknown."


6.63 Disallowing Null Values: We can prevent null values by
using NOT NULL in the field description, such as sname CHAR
(20) NOT NULL; additionally, null values are not permitted in
fields in a primary key. As a result, any field specified in a
PRIMARY KEY constraint is forced to be NOT NULL.

5.16 Complex Integrity Constraints in SQL


5.16.1 Constraints over a Single Table: The CREATE DOMAIN
argument, which employs CHECK constraints, can be used to
describe a new domain.
DOMAIN CREATE ratingval INTEGER DEFAULT 1

To enforce the constraint that Interlake boats cannot be


reserved, we could use:

151 | P a g e
Database Management Systems

5.16.2 Domain Constraints and Distinct Types


The CREATE DOMAIN statement, which uses CHECK
constraints, can be used to create a new domain.
CREATE DOMAIN ratingval INTEGER DEFAULT 1
CREATE DOMAIN ratingval INTEGER DEFAULT 1 CREATE
DOMAIN rating
CREATE DOMAIN ratingval INTEGER DEFAULT 1
The domain ratingval's underlying, or source, form is
INTEGER, and all ratingval values must be of this type. A
CHECK constraint further restricts Ratingval values; in defining
this constraint, we use the keyword VALUE to refer to a domain
value. We can use the full power of SQL queries to constrain the
values that belong to a domain with this function. The name of a
domain can be used to restrict column values in a table once it
has been specified; for example, in a schema declaration, we can
use the following line:
rating ratingval
The DEFAULT keyword can link a domain to a default
value. The ratingval default value 1 is used if the domain
ratingval is used for a column in a relation and no value is
entered for this column in an inserted tuple.
Example: CREATE TYPE ratingtype AS INTEGER

5.16.3 Assertions: ICs over Several Tables


152 | P a g e
Database Management Systems

Table constraints are associated with a single table. At the


same time, the conditional expression in the CHECK clause may
apply to other tables. If the accompanying table is not empty,
table constraints must hold. As a result, the table constraint
process may be inefficient and unsatisfactory when a constraint
requires two or more tables. Assertions, which are not
associated with any table, can be created in SQL to address such
scenarios.
Assume we want to limit the number of boats to less than
100, including the number of sailors. (For example, this
requirement may be required to become a small sailing club.)
We may assert the following:

CREATE ASSERTION smallClub


CHECK (( SELECT COUNT (S.sid) FROM Sailors S )
+ ( SELECT COUNT (B. bid) FROM Boats B)< 100 )

153 | P a g e
Database Management Systems

CHAPTER 6
PL/SQL AND ADVANCED SQL
6.1 Introduction to PL/SQL
PL/SQL is a combination of SQL and the procedural
features of programming languages. Oracle Corporation
developed it in the early 90's to enhance the capabilities of SQL.
PL/SQL is one of three key programming languages embedded
in the Oracle Database and SQL itself and Java.
SQL's disadvantages include
 SQL does not provide programmers with state checking,
looping, or branching techniques.
 SQL statements are sent to the Oracle engine one at a time,
increasing traffic and decreasing speed.
 SQL does not support error checking when manipulating data.

6.2 PL/SQL features include:


 PL/SQL is a procedural language, which provides the
functionality of decision making, iteration and many more
features of procedural programming languages.
 PL/SQL can execute several queries in one block using single
command.
 One can create a PL/SQL unit such as procedures, functions,
packages, triggers, and types stored in the database for reuse
by applications.
 PL/SQL provides a feature to handle the exception in PL/SQL
block known as exception handling block.
 Applications written in PL/SQL are portable to computer
hardware or operating system where Oracle is operational.
 PL/SQL Offers extensive error checking.

154 | P a g e
Database Management Systems

6.3 Structure of PL/SQL Block:


PL/SQL extends SQL by adding constructs found in
procedural languages, resulting in a structural language that is
more powerful than SQL. The basic unit in PL/SQL is a block.
All PL/SQL programs are made up of blocks, which can be
nested within each other.

Fig 6.1 Structure of PL/SQL Block

Typically, each block performs a logical action in the program.


A block has the following structure:
DECLARE
declaration statements;

BEGIN
executable statements

EXCEPTIONS
exception handling statements

END;

155 | P a g e
Database Management Systems

 The declare section starts with the DECLARE keyword in


which variables, constants, and records as cursors can be
declared, which temporarily stores data. It consists definition
of PL/SQL identifiers. This part of the code is optional.
 Execution section starts with BEGIN and ends
with END keyword.it is a mandatory section and here the
program logic is written to perform any task like loops and
conditional statements. It supports
all DML commands, DDL commands and SQL*PLUS built-in
functions as well.
 Exception section starts with EXCEPTION keyword. This
section is optional and contains statements executed when a
run-time error occurs. Any exceptions can be handled in this
section.

6.4 PL/SQL Identifiers


Several PL/SQL identifiers include variables, constants,
procedures, cursors, triggers etc.
1. Variables:
Like several other programming languages, variables in
PL/SQL must be declared before its use. They should have a
valid name and data type as well.
Syntax for declaration of variables:
variable_name datatype [NOT NULL := value ];
Example to show how to declare variables in PL/SQL :

SQL> SET SERVEROUTPUT ON;


SQL> DECLARE
var1 INTEGER;
var2 REAL;

156 | P a g e
Database Management Systems

var3 varchar2(20) ;
BEGIN
null;
END;
/

Output:
PL/SQL procedure completed.
Explanation:
 SET SERVEROUTPUT ON: It displays the buffer used by the
dbms_output.
 var1 INTEGER : It is the declaration of variable,
named var1 which is of integer type. Many other data types can
be used like float, int, real, smallint, long etc. It also supports
variables used in SQL and NUMBER(prec, scale), varchar,
varchar2 etc.
 PL/SQL procedure completed.: It is displayed when the code is
compiled and executed successfully.
 Slash (/) after END;: The slash (/) tells the SQL*Plus to execute
the block.

INITIALISING VARIABLES:
The variables can also be initialised just like in other
programming languages. Let us see an example for the same:

SQL> SET SERVEROUTPUT ON;


SQL> DECLARE
var1 INTEGER := 2 ;
var3 varchar2(20) := 'I Love GeeksForGeeks' ;
BEGIN

157 | P a g e
Database Management Systems

null;
END;
/

Output:
PL/SQL procedure completed.
Explanation:
 Assignment operator (:=) : It assigns a value to a variable.

Displaying Output:
The outputs are displayed using DBMS_OUTPUT, a built-in
package that enables users to display output, debug
information, and send messages from PL/SQL blocks,
subprograms, packages, and triggers.
Let us see an example to see how to display a message using
PL/SQL :

SQL> SET SERVEROUTPUT ON;


SQL> DECLARE
var varchar2(40) := 'I love GeeksForGeeks' ;
BEGIN
dbms_output.put_line(var);
END;
/

Output:
I love GeeksForGeeks

PL/SQL procedure completed.


Explanation:

158 | P a g e
Database Management Systems

 dbms_output.put_line : This command directs the PL/SQL


output to a screen.

2. Using Comments:
Like in many other programming languages, in PL/SQL,
comments can also be put within the code that does not affect
the code. There are two syntaxes to create comments in
PL/SQL :
 Single Line Comment: The symbol-- is used to create a single

line comment.
 Multi Line Comment: To create comments that span over
several lines, the symbol /* and */ is used.
Example to show how to create comments in PL/SQL :

SQL> SET SERVEROUTPUT ON;


SQL> DECLARE
-- I am a comment so that i will be ignored.
var varchar2(40) := 'I love GeeksForGeeks' ;
BEGIN
dbms_output.put_line(var);
END;
/

Output:
I love GeeksForGeeks

PL/SQL procedure completed.

3. Taking input from user:


Like other programming languages, in PL/SQL, we can also

159 | P a g e
Database Management Systems

input the user and store it in a variable. Let us see an example


to show how to take input from users in PL/SQL:

SQL> SET SERVEROUTPUT ON;


SQL> DECLARE
-- taking input for variable a
a number := &a;
-- taking input for variable b
b varchar2(30) := &b;
BEGIN
null;
END;
/

Output:
Enter value for a: 24
old 2: a number := &a;
new 2: a number := 24;
Enter value for b: 'GeeksForGeeks'
old 3: b varchar2(30) := &b;
new 3: b varchar2(30) := 'GeeksForGeeks';

PL/SQL procedure completed.

(***) Let us see an example on PL/SQL to demonstrate all above


concepts in one single block of code.

--PL/SQL code to print sum of two numbers taken from


the user.
SQL> SET SERVEROUTPUT ON;
SQL> DECLARE

160 | P a g e
Database Management Systems

-- taking input for variable a


a integer := &a ;
-- taking input for variable b
b integer := &b ;
c integer ;
BEGIN
c := a + b ;
dbms_output.put_line('Sum of '||a||' and '||b||' is =
'||c);
END;
/

Enter value for a: 2


Enter value for b: 3

Sum of 2 and 3 is = 5

PL/SQL procedure completed.


PL/SQL Execution Environment:
The PL/SQL engine resides in the Oracle engine.The Oracle
engine can process a single SQL statement and block many
statements.The call to Oracle engine needs to be made only
once to execute any number of SQL statements if these SQL
statements are bundled inside a PL/SQL block.
6.5 Cursors in PL/SQL
Cursor in SQL
The Oracle engine uses a work area for its internal processing
and storing the information to execute SQL statements. This
work area is private to SQL’s operations. The ‘Cursor’ is the

161 | P a g e
Database Management Systems

PL/SQL construct that allows the user to name the work area
and access the stored information.
Use of Cursor
The major function is to retrieve data, one row at a time, from
a result set, unlike the SQL commands which operate on all the
rows in the result set at one time.
Cursors are used when the user needs to update records in a
singleton fashion or row by row in a database table.
The Data stored in the Cursor is called the Active Data Set.
Oracle DBMS has another predefined area in the main memory
Set, within which the cursors are opened. Hence the size of the
cursor is limited by the size of this pre-defined area.

fig 6.2 Cursors in PL/SQL

Cursor Actions
 Declare Cursor: A cursor is declared by defining the SQL
statement that returns a result set.
 Open: A Cursor is opened and populated by executing the SQL
statement defined by the cursor.

162 | P a g e
Database Management Systems

 Fetch: When the cursor is opened, rows can be fetched from the
cursor one by one or in a block to perform data manipulation.
 Close: After data manipulation, close the cursor explicitly.
 Deallocate: Finally, delete the cursor definition and release all
the system resources associated with the cursor.

6.5.1 Types of Cursors


Cursors are classified depending on the circumstances in
which they are opened.
 Implicit Cursor: If the Oracle engine opens a cursor for its
internal processing, it is an Implicit Cursor. It is created
“automatically” for the user by Oracle when a query is
executed and is simpler to code.
 Explicit Cursor: On demand, a Cursor can also be opened for
processing data through a PL/SQL block. Such a user-defined
cursor is known as an Explicit Cursor.
Explicit cursor
An explicit cursor is defined in the declaration section of
the PL/SQL Block. It is created on a SELECT Statement which
returns more than one row. A suitable name for the cursor.
The following is the general syntax for constructing a cursor:
CURSOR is an abbreviation for Cursor cursor name IS a
string. Cursor name; select statement – A appropriate name for
the cursor.
Select statement – A select parameter that returns several
rows.

Use Explicit Cursor


Using an Explicit Cursor consists of four stages.
In the Declaration line, DECLARE the cursor.

163 | P a g e
Database Management Systems

OPEN the Execution Section with the mouse.


FETCH data from the cursor into PL/SQL variables or
records in the Execution Section.
Until ENDING THE PL/SQL BLOCK, CLOSE THE
CURRENT IN THE EXECUTION SECTION.

Syntax:
DECLARE variables;
records;
create a cursor;
BEGIN
OPEN cursor;
FETCH cursor;
process the records;
CLOSE cursor;
END;

6.6 PL/SQL - PROCEDURES


A subprogram is a program unit/module that performs a
particular task. These subprograms are combined to form larger
programs. This is called the 'Modular design'. A subprogram
can be invoked by another subprogram or program called
the calling program.
A subprogram can be created −
 At the schema level
 Inside a package
 Inside a PL/SQL block
At the schema level, subprogram is a standalone subprogram. It
is created with the CREATE PROCEDURE or the CREATE
FUNCTION statement. It is stored in the database and can be

164 | P a g e
Database Management Systems

deleted with the DROP PROCEDURE or DROP FUNCTION


statement.
A subprogram created inside a package is a packaged
subprogram. It is stored in the database and can be deleted only
when the package is deleted with the DROP PACKAGE
statement. We will discuss packages in the chapter 'PL/SQL -
Packages'.
PL/SQL subprograms are named PL/SQL blocks that can be
invoked with a set of parameters. PL/SQL provides two kinds
of subprograms −
 Functions − These subprograms return a single value; mainly
used to compute and return a value.
 Procedures − These subprograms do not return a value directly;
mainly used to act.
Parts of a PL/SQL Subprogram
Each PL/SQL subprogram has a name, and may also have a
parameter list. Like anonymous PL/SQL blocks, the named
blocks will also have the following three parts
Parts & Description
Declarative Part
It is an optional part. However, the declarative part for a
subprogram does not start with the DECLARE keyword. It
contains declarations of types, cursors, constants, variables,
exceptions, and nested subprograms. These items are local to the
subprogram and cease to exist when the subprogram completes
execution.
Executable Part
This is a mandatory part and contains statements that perform
the designated action.
Exception-handling
165 | P a g e
Database Management Systems

This is again an optional part. It contains the code that handles


run-time errors.

Creating a Procedure
A procedure is created with the CREATE OR REPLACE
PROCEDURE statement. The simplified syntax for the CREATE
OR REPLACE PROCEDURE statement is as follows −
CREATE [OR REPLACE] PROCEDURE procedure_name
[(parameter_name [IN | OUT | IN OUT] type [, ...])]
{IS | AS}
BEGIN
< procedure_body >
END procedure_name;

In this case,
 procedure-name specifies the name of the procedure.
 [OR REPLACE] option allows the modification of an existing
procedure.
 The optional parameter list contains name, mode and types of
the parameters. IN represents the value passed from outside
and OUT represents the parameter used to return a value
outside of the procedure.
 procedure-body contains the executable part.
 The AS keyword is used instead of the IS keyword for creating a
standalone procedure.

Example
The example below shows how to write a basic procedure
that displays the string 'Hello World!' 'appears on the projector
as executed.
166 | P a g e
Database Management Systems

CREATE OR REPLACE PROCEDURE greetings


AS
BEGIN
dbms_output.put_line('Hello World!');
END;
/

When the above code is run via the SQL prompt, it yields the
following result:
Executing a Standalone Procedure
A standalone procedure can be called in two ways −
 Using the EXECUTE keyword
 Calling the name of the procedure from a PL/SQL block
The above procedure named 'greetings' can be called with the
EXECUTE keyword as −
BEGIN
greetings;
END;
/
The above call will display −
Hello World
PL/SQL procedure completed.
Deleting a Standalone Procedure
The DROP Protocol argument deletes a standalone procedure.
The syntax for deleting a protocol is as follows:
DROP PROCEDURE procedure-name;
You can drop the greetings procedure by using the following
statement −
DROP PROCEDURE greetings;
PL/SQL Subprogram Parameter Modes
167 | P a g e
Database Management Systems

The parameter modes in PL/SQL subprograms are described in


the table below.−
S.
Parameter Mode & Description
No
1 IN
An IN parameter lets you pass a value to the
subprogram. It is a read-only parameter. Inside the
subprogram, an IN parameter acts like a constant. It
cannot be assigned a value. You can pass a constant,
literal, initialized variable, or expression as an IN
parameter. You can also initialize it to a default value;
however, in that case, it is omitted from the subprogram
call. It is the default mode of parameter passing.
Parameters are passed by reference.
2 OUT
An OUT parameter returns a value to the calling program.
Inside the subprogram, an OUT parameter acts like a
variable. You can change its value and reference the value
after assigning it. The actual parameter must be variable
and it is passed by value.
3 IN OUT
An IN OUT parameter passes an initial value to a
subprogram and returns an updated value to the caller. It
can be assigned a value and the value can be read.
The actual parameter corresponding to an IN OUT formal
parameter must be a variable, not a constant or an
expression. Formal parameter must be assigned a
value. Actual parameter is passed by value.

6.7 OLAP queries in SQL:


168 | P a g e
Database Management Systems

The CUBE operator computes a union of GROUP BY’s on


every subset of the specified attribute types. Its result set
represents a multidimensional cube based upon the source
table. Consider the following SALES TABLE.
PRODUCT QUARTER REGION SALES
A Q1 Europe 10
A Q1 America 20
A Q2 Europe 20
A Q2 America 50
A Q3 America 20
A Q4 Europe 10
A Q4 America 30
B Q1 Europe 40
B Q1 America 60
B Q2 Europe 20
B Q2 America 10
B Q3 America 20
B Q4 Europe 10
B Q4 America 40

Examples: SALES TABLE.


We can now formulate the following SQL query:

SELECT QUARTER, REGION, SUM (SALES)


FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)
Essentially, this question computes the union of 22 = 4
SALESTABLE groupings, which are: (quarter, region), (quarter),
(region), (), where () represents an empty group list containing
the complete sum over the entire SALESTABLE. In other words,
169 | P a g e
Database Management Systems

since quarter has four values and area has two, the resulting
multiset would have 4*2+4*1+1*2+1 or 15 tuples, as seen in
Table 1. NULL values have been applied to the dimension
columns Quarter and Area to denote the accumulation. If
required, they can be quickly replaced by the more accurate
‘ALL.' To be more precise, we should add two CASE clauses as
follows:
SELECT CASE WHEN grouping (QUARTER) = 1 THEN
'All' ELSE QUARTER END AS QUARTER, CASE WHEN
grouping (REGION) = 1 THEN 'All' ELSE REGION END AS
REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)
If a NULL value is produced during the aggregation, the
grouping() function returns 1, otherwise, it returns 0. This
distinguishes between generated NULLs and potential actual
NULLs arising from the data. We would not do this in future
OLAP queries to avoid overcomplicating them.
Also, observe the NULL value for Sales in the fifth row. This
represents an attribute combination not present in the original
SALESTABLE since no products were sold in Q3 in Europe.
Remark that besides SUM() also other SQL aggregator functions
such as MIN(), MAX(), COUNT() and AVG() can be used in the
SELECT statement.

Table 1: Result from SQL query with Cube operator


QUARTER REGION SALES
Q1 Europe 50
Q1 America 80

170 | P a g e
Database Management Systems

Q2 Europe 40
Q2 America 60
Q3 Europe NULL
Q3 America 40
Q4 Europe 20
Q4 America 80
Q1 NULL 130
Q2 NULL 100
Q3 NULL 40
Q4 NULL 90
NULL Europe 110
NULL America 250
NULL NULL 360

The ROLLUP operator computes the union on each prefix of


the list of listed attribute types, starting with the most
straightforward and working way up to the total. It is
particularly useful for creating reports that have both subtotals
and totals. The primary distinction between the ROLLUP and
CUBE operators is that the former creates a result set containing
aggregates for a hierarchy of values of the defined attribute
types. At the same time, the latter produces a result set
containing aggregates for all combinations of values of the
chosen attribute types. As a result, the order in which the
attribute forms are mentioned is critical for the ROLLUP

171 | P a g e
Database Management Systems

operator but not for the CUBE operator. Consider the following
problem:
SELECT QUARTER, REGION, SUM (SALES)
FROM SALESTABLE
GROUP BY ROLLUP (QUARTER, REGION)
This query generates the union of three groupings {(quarter,
region), (quarter}, ()} where () again represents the full
aggregation. The resulting multiset will thus have 4*2+4+1 or 13
rows and is displayed in Table 2. You can see that the regional
dimension is first rolled up followed by the quarter dimension.
Note the two rows that have been left out compared to the result
of the CUBE operator in Table 1.

Table 2: Result from SQL query with ROLLUP operator.


QUARTER REGION SALES
Q1 Europe 50
Q1 America 80
Q2 Europe 40
Q2 America 60
Q3 Europe NULL
Q3 America 40
Q4 Europe 20
Q4 America 80
Q1 NULL 130
Q2 NULL 100
Q3 NULL 40
Q4 NULL 90
NULL NULL 360

172 | P a g e
Database Management Systems

Whereas the previous example used the GROUP BY


ROLLUP construct on two entirely separate dimensions, it can
also be used on attribute forms that reflect different aggregation
levels (and hence different levels of detail) along the same axis.
Assume the SALESTABLE tuples reflected more specific sales
data at the city level. The table included three location-related
columns: City, Country, and Region. We could then build the
following ROLLUP question, which would return sales totals for
each city, area, zone, and overall total:

SELECT REGION, COUNTRY, CITY, SUM(SALES)


FROM SALESTABLE
GROUP BY ROLLUP (REGION, COUNTRY, CITY)

In that case, the SALESTABLE will contain the attribute


types City, Nation, and Region in a single table. Since the three
attribute categories reflect various degrees of information in the
same dimension, they are transitively dependent on one
another, demonstrating that the data warehouse data is
denormalized.
The GROUPING SETS operator produces a result set equal
to that produced by a UNION ALL of several simple GROUP
BY clauses. Consider the following scenario:
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER), (REGION))
This query is equivalent to:
SELECT QUARTER, NULL, SUM(SALES)
FROM SALESTABLE
GROUP BY QUARTER
173 | P a g e
Database Management Systems

UNION ALL
SELECT NULL, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY REGION
The result is given in Table
Table 3: Result from SQL query with GROUPING SETS
operator
QUARTER REGION SALES
Q1 NULL 130
Q2 NULL 100
Q3 NULL 40
Q4 NULL 90
NULL Europe 110
NULL America 250
A single SQL query can contain several CUBE, ROLLUP,
and GROUPING SETS statements. Various CUBE, ROLLUP can
generate equivalent result sets, and GROUPING SETS
combinations. Consider the following problem:
SELECT QUARTER, REGION, SUM (SALES)
FROM SALESTABLE
GROUP BY CUBE (QUARTER, REGION)
This query is equivalent to:
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER, REGION),
(QUARTER), (REGION), ())
Likewise, the following query:
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY ROLLUP (QUARTER, REGION)
174 | P a g e
Database Management Systems

is identical to:
SELECT QUARTER, REGION, SUM(SALES)
FROM SALESTABLE
GROUP BY GROUPING SETS ((QUARTER, REGION),
(QUARTER),())
Given the volume of data to be aggregated and collected,
OLAP SQL queries can become extremely time-consuming.
Turning some of these OLAP queries into materialized views is
one way to improve results. For example, a SQL query with a
CUBE operator can be used to precompute aggregations on a set
of dimensions, which can then be saved as a materialized view.
A downside of view materialization is that additional work is
required to update these materialized views periodically.
However, it can be remembered that most businesses are happy
with a near to current version of the data, such that
synchronization can be achieved overnight or at set time
intervals.

6.8 Introduction to Recursive SQL


An Overview of Recursive SQL
Learning recursive SQL techniques will improve your
productivity as a SQL programmer. A recursive query applies to
itself. The easiest way to understand recursion easily is to
imagine a mirror mirrored in another mirror. When you look
into it, you get many reflections of yourself. This is an example
of recursion in reality.
Recursive SQL is implemented differently in various DBMS
products. Recursion is supported in SQL-99 by using common
table expressions (CTEs). Recursive queries using CTEs are
supported by DB2, Microsoft SQL Server, Oracle, and
175 | P a g e
Database Management Systems

PostgreSQL. It should be noted that Oracle also has an alternate


syntax based on the CONNECT BY build, which we will not go
through here.
A CTE is a temporary table inside a SQL statement held for
the length of that statement. A single SQL statement may
contain several CTEs, but each must have a specific name. The
WITH clause is used at the beginning of a query to describe a
CTE.
Before we recursion, let us look at some data that will benefit
from being read recursively. A hierarchical structure map is
shown in Figure 6.3.

Figure 6.3. A sample hierarchy

A table holding this data could be set up as follows:


CREATE TABLE ORG_CHART
(MGR_ID SMALLINT,
EMP_ID SMALLINT,
176 | P a g e
Database Management Systems

EMP_NAME CHAR(20))
;
Of course, this is a simplified implementation, and a
development hierarchy will most certainly necessitate several
more columns. However, the simplicity of this table would be
sufficient for our purposes of studying recursion. To make the
data in this table fit the data in our diagram, we will load it as
follows:
MGR_ID EMP_ID EMP_NAME
-1 1 BIG BOSS
1 2 LACKEY
1 3 LIL BOSS
1 4 BOOTLICKER
2 5 GRUNT
3 6 TEAM LEAD
6 7 LOW MAN
6 8 SCRUB
The MGR ID for the top-most node is set to some value
showing that this row has no parent, in this case, –1. Now that
we have loaded the data, we can write a query to traverse the
hierarchy using recursive SQL. If we need to report on the
whole organizational framework under LIL BOSS, the recursive
SQL using a CTE would suffice::
WITH EXPL (MGR_ID, EMP_ID, EMP_NAME) AS
(
SELECT ROOT.MGR_ID, ROOT.EMP_ID,
ROOT.EMP_NAME
FROM ORG_CHART ROOT
WHERE ROOT.MGR_ID = 3

177 | P a g e
Database Management Systems

UNION ALL

SELECT CHILD.MGR_ID, CHILD.EMP_ID,


CHILD.EMP_NAME
FROM EXPL PARENT, ORG_CHART CHILD
WHERE PARENT.EMP_ID = CHILD.MGR_ID
)

SELECT DISTINCT MGR_ID, EMP_ID, EMP_NAME


FROM EXPL
ORDER BY MGR_ID, EMP_ID;

The results of running this query would be:


MGR_ID EMP_ID EMP_NAME
1 3 LIL BOSS
3 6 TEAM LEAD
6 7 LOW MAN
6 8 SCRUB

Let us break down this very confusing query into its


constituent parts to explain better what is going on. First and
foremost, the WITH clause is used to execute a recursive query
(using a CTE). EXPL is the name of the CTE. The first SELECT
primes the pump, thus initiating the search's "root." In our case,
we will begin with EMP ID 3, which is LIL BOSS.
The following SELECT is an inner join that joins the CTE to
the table on which the CTE is centered. This is where recursion
comes into play. A part of the CTE meaning is self-referential.
Finally, we make a SELECTION from the CTE. Similar queries

178 | P a g e
Database Management Systems

may be written to explode the hierarchy to extract all


descendants of any given node.
Recursive SQL has the potential to be both elegant and
effective. However, because of the difficulties developers may
have in learning recursion, it is often considered “too inefficient
to use frequently.” However, suppose you have a business need
to walk or burst hierarchies in your database. In that case,
recursive SQL will most likely be the most productive choice.
What else are you planning? You can construct pre-exploded
tables, but this necessitates denormalization and a significant
pre-processing, which is inefficient. You may even write your
code to traverse a hierarchy. This, too, has the potential to be
problematic. You can almost certainly download more data than
you want, resulting in sluggish I/O. Moreover, how can you
ensure that the code outperforms the DBMS?
If any row returned by the query is needed in the response
set (“find all employees who work for LIL BOSS”), recursion is
likely to be very effective. A recursive query can be very
expensive if just a few of the rows processed by the query are
required (“find all flights from Houston to Pittsburgh, but
display only the three fastest”). The bottom line is that recursive
SQL can be considered where market conditions necessitate it.
However, ensure that appropriate indexes are accessible and
that the entry routes are still examined.
6.9 Triggers and Active Databases
A trigger is a process that is automatically invoked by the
DBMS in response to modifications to the database that the DBA
defines. An active database has a series of related triggers. A
trigger definition is made up of three parts:

179 | P a g e
Database Management Systems

A modification to the database that triggers the mechanism


is referred to as an event.
When the button is triggered, a query or test is executed.
The action is executed when the button is triggered and the
condition is correct.
A trigger is a 'daemon' that tracks a database. It executes
when the database is updated in a way that meets the event
specification. A trigger may be activated by an attach, erase, or
update statement, regardless of which user or application
invoked the triggering statement; users could be unaware that a
trigger was executed as a side effect of their program.
A trigger condition may be a true/false statement (for
example, all employee wages are less than $100,000) or a query.
If the solution set is not empty, a query is interpreted as true;
otherwise, it is false. If the condition part is true, the action
associated with the cause is carried out.
A trigger action will analyze the query results in the
condition portion of the trigger, refer to old and new values of
tuples changed by the declaration triggering the trigger,
perform new queries, and make database adjustments.
An action will run a sequence of data-definition commands
(e.g., generate new tables, modify authorizations) and
transaction-oriented commands (e.g., commit) or call host-
language procedures.
A critical concern is when the action component performs in
response to the sentence that triggered the trigger. For example,
a statement inserting records into the Students table may trigger
a trigger that keeps track of how many students under the age of
18 are inserted at one time by a traditional insert statement.
Depending on what the trigger does, we may want the action to
180 | P a g e
Database Management Systems

be executed before or after modifications are made to the


Students table: Before each record is inserted, a trigger that
initializes a variable used to count the number of qualified
insertions should be executed, and a trigger that executes once
per qualifying inserted record and increments the variable
should be executed (because we may want to examine the
values in the new record to determine the action).

6.10 Designing Active Databases


Active Database is a database consisting of set of triggers.
These databases are very difficult to be maintained because of
the complexity that arises in understanding the effect of these
triggers. In such database, DBMS initially verifies whether the
particular trigger specified in the statement that modifies the
database) is activated or not, before executing the statement.
DBMS executes the condition part if the trigger is active and
then executes the action part only if the specified condition is
evaluated to true. It is possible to activate more than one
trigger within a single statement.
In such situation, DBMS processes each of the trigger
randomly. An action part of a trigger may activate other
triggers or the same trigger that Initialized this action. Such
trigger types that activate themselves are called ‘recursive
trigger’. The DBMS executes such chains of trigger in some pre-

181 | P a g e
Database Management Systems

defined manner but it effects the concept of understanding.

Features of Active Database:


1. It possess all the concepts of a conventional database i.e. data
modelling facilities, query language etc.
2. It supports all the functions of a traditional database like data
definition, data manipulation, storage management etc.
3. It supports definition and management of ECA rules.
4. It detects event occurrence.
5. It must be able to evaluate conditions and to execute actions.
6. It means that it has to implement rule execution.
Advantages :
1. Enhances traditional database functionalities with powerful
rule processing capabilities.
2. Enable a uniform and centralized description of the business
rules relevant to the information system.
3. Avoids redundancy of checking and repair operations.
4. Suitable platform for building large and efficient knowledge
base and expert systems.

182 | P a g e
Database Management Systems

CHAPTER 7
QUERY PROCESSING

7.1 Introduction
The primary goal of creating a database is to store related
data in one location, allowing the user to access and manipulate
it as needed. Data access and manipulation should be done
efficiently, that is, it should be easy and quick to access.
However, a database is a system, and the users can be
another system, an application, or a person. The data can be
requested in a language that the user understands. On the other
hand, DBMS has its vocabulary (SQL) that it knows. As a result,
users must query the database in its native language, SQL. SQL
is a high-level language designed to bridge the gap between the
user and the database management system. However, the
underlying systems in the DBMS will not comprehend SQL.
There must be some kind of low-level language that these
systems can understand. Typically, any SQL query is converted
into a low-level language that the system can understand using
relational algebra. However, no user will be able to write
relational algebra queries directly. It necessitates a thorough
understanding of it.
As a result, DBMS asks its users to write queries in SQL. It
validates the user's code before converting it to low-level
languages. It then chooses the best execution path, runs the
query, and retrieves the data from internal memory. Many of
these methods are referred to together as query processing.

183 | P a g e
Database Management Systems

7.2 Query Processing


Query Processing includes translations on high level
Queries into low level expressions that can be used at physical
level of file system, query optimization and actual execution of
query to get the actual result.
The Query Processing Block Diagram is as follows:

Fig 7.1 Query Processing

The diagram above illustrates how a query is interpreted in


the database to produce a response. When a query is sent to the
database, the query compiler receives it. The query is then
scanned and divided into individual tokens. The parser checks
the tokens for correctness after they are produced. The
tokenized queries are translated into various relational
expressions, relational trees, and relational graphs (Query
Plans). The query optimizer then selects the best query plan to
process. It searches the device catalog for restrictions and
indexes before determining the best query plan. It produces
various execution plans based on the query plan. The query
execution plan then determines the correct and most efficient
execution plan for execution. The command processor then uses
this execution plan to extract data from the database and return
184 | P a g e
Database Management Systems

the output. This is a high-level description of query handling.


Let us take a closer look below.
A traditional query processing cycle consists of four stages.
 Translation and parsing
 Query Enhancement
 Creating evaluation or query codes
Execution in the runtime processor of DB. Detailed Diagram
is drawn as:

Fig 7.2 Execution in the runtime processor

It is done in the following steps:


 Step-1:
Parser: During parse call, the database performs the following
checks- Syntax check, Semantic check and Shared pool check,
after converting the query into relational algebra.

185 | P a g e
Database Management Systems

Parser performs the following checks as (refer detailed


diagram):
1. Syntax check – concludes SQL syntactic validity. Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.
2. Semantic check – determines whether the statement is
meaningful or not. Example: query contains a tablename which
does not exist is checked by this check.
3. Shared Pool check – Every query possess a hash code during
its execution. So, this check determines existence of written
hash code in shared pool if code exists in shared pool then
database will not take additional steps for optimization and
execution.
Hard Parse and Soft Parse –
If there is a new query and its hash code does not exist in
shared pool then that query has to pass through from the
additional steps known as hard parsing otherwise if hash code
exists then query does not passes through additional steps. It
just passes directly to execution engine (refer detailed
diagram). This is known as soft parsing.
Hard Parse includes following steps – Optimizer and Row
source generation.
 Step-2:
Optimizer: During optimization stage, database must perform
a hard parse atleast for one unique DML statement and
perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as
subquery that require optimization.
It is a process in which multiple query execution plan for
satisfying a query are examined and most efficient query plan
186 | P a g e
Database Management Systems

is satisfied for execution.


Database catalog stores the execution plans and then optimizes
the lowest cost plan for execution.
Row Source Generation –
The Row Source Generation receives an optimal execution plan
from the optimizer. It produces an iterative execution plan
usable by the rest of the database. the iterative plan is the
binary program that produces the result set when executes by
the sql engine.
 Step-3:
Execution Engine: Finally runs the query and display the
required result.
Parsing and Translation
This is the first step of any query processing. The customer
usually submits his or her requests in SQL. DBMS must translate
it into a low-level – machine-readable language to process and
execute this request. The query processor first processes any
query sent to the database. It scans and parses the query into
individual tokens and checks for query correctness. It validates
the tables and views that are being used and the query syntax. It
transforms each token into relational expressions, trees, and
graphs until it is transferred. The DBMS's other parsers quickly
parse these.
Let us look at an example to help us appreciate these
measures. Assume the user wishes to view the student
information for the DESIGN 01 class. The DBMS would not
understand if users say, "Retrieve Student Information who are
in DESIGN 01 class." As a result, DBMS offers a vocabulary –
SQL – that both users and DBMS can understand and interact in.

187 | P a g e
Database Management Systems

This SQL is written in plain English that all parties can read. As
a result, the user will write his request in SQL as follows:
SELECT STD_ID, STD_NAME, ADDRESS, DOB
FROM STUDENT s, CLASS c
WHERE s.CLASS_ID = c.CLASS_ID
AND c.CLASS_NAME = ‘DESIGN_01’;
As he issues this query, the DBMS reads it and transforms it
into a format that the DBMS can use to further process and
synthesize it. This is the parsing and translation step of query
processing. The query processor scans the submitted SQL query
and partitions it into individual meaningful tokens. The various
tokens in our example are ‘SELECT * FROM', ‘STUDENT s',
‘CLASS c', ‘WHERE', ‘s.CLASS ID = c.CLASS ID', ‘AND', and
‘c.CLASS NAME = ‘DESIGN 01'. The processor can simply use
these tokenized query types to continue processing. It executes a
query on the data dictionary tables to determine if the tables and
columns in these tokens exist or not. If they are not in the data
dictionary, the submitted query would fail at this stage.
Otherwise, it checks to see if the syntax in the query is valid.
Please keep in mind that it does not check whether or not
DESIGN 01 resides in the table; rather, it verifies whether or not
'SELECT * FROM', 'WHERE','s.CLASS ID = c.CLASS ID', 'AND',
and other SQL-defined syntaxes are included. It transforms the
syntaxes into relational algebra, relational tree, and graph
representations after validating them. These are simple to
comprehend and are managed by the optimizer for further
processing. The query above can be translated into one of the
two relation algebra forms below. The first query recognizes the
students in the DESIGN 01 class and extracts only the requested
columns. Another query first extracts the desired columns from
188 | P a g e
Database Management Systems

the STUDENT table before filtering for DESIGN 01. Both


provide the same outcome.
∏ STD_ID, STD_NAME, ADDRESS, DOB (σ CLASS_NAME
= ‘DESIGN_01’ (STUDENT ∞CLASS))
or
σ CLASS_NAME = ‘DESIGN_01’ (∏ STD_ID, STD_NAME,
ADDRESS, DOB (STUDENT ∞CLASS))
This can also be represented in relational structures like tree
and graphs as below:

Fig 7.3 Relational Structures

Fig 7.4 Tree and graphs


189 | P a g e
Database Management Systems

The query processor then applies the rules and algorithms to


these relational constructs to describe the DBMS's more efficient
and effective structures. These structures are dependent on the
mappings between tables, joins used, and the cost of the
execution algorithm of these queries. It decides which
processing structure – selecting and then projecting or projecting
and then selecting – is the most effective when to add filters, and
so on. The best structure and strategy chosen by the optimizer
are chosen and implemented in the third stage of query
processing. It digs into the database memory to recover the
documents depending on the schedule. It sometimes processes
and compiles the query and stores it in the database for use in
the runtime DB processor. The customer is then informed of the
outcome. This is the ultimate phase that the DBMS takes when
an easy to complex query is launched. Both of these processes
will take a fraction of a second. However, perfect optimization
and route discovery can make the query much smoother.

7.3 Measures of Query cost


Cost of query is the time taken by the query to hit the
database and return the result. It involves query processing time
i.e.; time taken to parse and translate the query, optimize it,
evaluate, execute and return the result to the user is called cost
of the query. Though it is in seconds, it includes multiple sub
tasks and time taken by each. The optimized query involves
hitting the primary and secondary memory based on the file
organization method. Depending on file organization and the
indexes used, time taken to retrieve the data may vary.

190 | P a g e
Database Management Systems

The query spends majority of time in accessing the data from the
memory. It too has several factors determining the cost of access
time – disk I/O time, CPU time, network access time etc. Disk
access time is the time the processor takes to search and find the
record in the secondary memory and return the result. This
takes the majority of time while processing a query. Other times
can be ignored compared to disk I/O time.
While calculating the disk I/O time, only two factors are usually
considered – seek time and transfer time. The seek time takes
the processor to find a single record in the disk memory and is
represented by tS. For example, to find the student ID of a
student ‘John’, the processor will fetch the memory based on the
index and the file organization method. The time taken by the
processor to hit the disk block and search for his ID is called the
seek time. The time taken by the disk to return fetched result to
the processor / user is called transfer time and is represented by
tT.
Suppose a query needs to seek S times to fetch a record and B
blocks must be returned to the user. Then the disk I/O cost is
calculated as below
(S* tS)+ (B* tT)
It is the sum of the total time taken for seek S times and the total
time taken to transfer B blocks. Here, other costs like CPU,
RAM, etc are ignored as they are comparatively small. Disk I/O
alone is considered as cost of a query. However, we have to
calculate the worst case cost – the maximum time taken by the
query when there is a worst case like buffer is full or no buffers,
etc. The memory space / buffers depend on the number of
queries executing in parallel. All queries would be using the
buffers and determining the number of buffers / blocks
191 | P a g e
Database Management Systems

available for our query is unpredictable. The processor might


have to wait till it gets all the memory blocks.

7.4 selection operation


We learned in the previous segment that calculating the cost
of a query strategy can be achieved by evaluating the overall
resource utilization.
This part will look at how the query execution plan handles
the selection operation.
In most cases, the file scan performs the selection process.
File scans are search algorithms that are used to locate and
manipulate data. It is the most basic operator used in query
processing.
Let us look at how a file scan is used to make a selection.
File scans and indices are used to make selections.
The file search in relational database systems reads a relation
only if the whole relation is contained in a single file. When
performing a selection operation on a relation with tuples stored
in a single file, the following algorithms are used:
Linear Search: The machine searches each record in a linear search
to see if it meets the specified collection condition. An initial seek is
needed for accessing the first block of a file. If the blocks in the file
are not stored in sequential order, some extra seeks are needed.
However, linear search is the slowest algorithm for searching, but
it is valid in all situations. This algorithm is unconcerned with the
collection's essence, the availability of indices, or the file series. On
the other hand, other algorithms are not available in all situations.

192 | P a g e
Database Management Systems

7.4.1 Indexed Selection Operation


Index scans are the search algorithms that depend on
indexes. Such index systems are referred to as entry routes.
These paths help you to find and view the data in the file. The
index is used in query analysis by the following algorithms:
 Primary index, equality on a key: We use the index to extract a
single record that meets the equality condition for making the
pick. The equality comparison is done on the key attribute that
contains a primary key.
 Primary index, equality on non-key: The distinction between
equality on key and equality on non-key is that we can retrieve
several documents in this case. Where the collection parameters
specify an equality relation on a non-key, we can retrieve
multiple documents using a primary key.
 Secondary index, equality on key or nonkey: The selection that
defines an equality condition will use the secondary index.
Using the secondary index strategy, we can retrieve a single
record when the equality condition is on the key or several
records when the equality condition is on the non-key. When
retrieving a single record, the time cost equals the primary
index. Multiple records can reside on separate blocks in the case
of multiple records. This results in one I/O operation for each
fetched record, and each I/O operation necessitates a seek and a
block shift.

7.4.2 Comparisons in Selection Operations


To make some selection based on a similarity in a relation,
we can use either the linear search or indices in the following
ways:

193 | P a g e
Database Management Systems

 Primary index, reference: Where the user's collection condition


is compared, we use a primary ordered index, such as the
primary B+-tree index. For example, when A relation R is
compared to a given value v as A>v, we use a primary index on
A to directly obtain the tuples. The file scan begins its search
initially and works its way to the end, returning all tuples that
fulfill the specified selection condition.
 Secondary ordered index, comparison: The secondary ordered
index is used to fulfill selection operations involving, >, or. The
files check scans the blocks of the lowest-level index in this case.
( ): In this case, it scans from the smallest to the largest value,
v.
(>, ): In this case, it scans from v to the highest value.
However, the secondary index can only be used to pick a
few documents. Such an index contains references to each
record, allowing users to quickly retrieve the record using the
assigned pointers. Since documents may be stored on separate
register blocks, retrieving them may necessitate an I/O
operation. As a result, the secondary index becomes costly if the
number of fetched records is high.

7.4.3 Complex Selection Operations in Place


Working on more nuanced sorting necessitates three
selection predicates: Conjunction, Disjunction, and Negation.
Conjunction: A conjunctive collection has the following
form:
n (r)
A conjunction is the intersection of all records that satisfy the
above collection condition.
Disjunction: A disjunctive selection has the following form:
194 | P a g e
Database Management Systems

n (r)
A disjunction is the union of all records that satisfy the given
collection condition i.
Negation: The product of a selection (r) is the set of tuples of
the given relation r where the selection condition evaluates to
false. However, there are no nulls, and this set is just the set of
tuples to r that are not in (r).
We can execute the selection operations using the following
algorithms using the previously mentioned selection predicates:
 Conjunctive sorting with one index: In this method of selection
operation implementation, we first decide if an attribute has any
access paths. If one is discovered, algorithms based on the index
would do well. The collection operation is completed by
ensuring that each chosen document meets the remaining basic
requirements. The cost of the chosen algorithm provides the cost
of this algorithm.
 Conjunctive selection via Composite index: A composite index
provides information on several attributes. For certain
conjunctive choices, such an index can exist. If the given
selection operation proves correct on the equality condition on
two or more attributes and a composite index exists on these
combined attribute fields, then explicitly check the index. This
kind of index determines the appropriate index algorithms.
 Conjunctive collection through identifier intersection: This
implementation uses database pointers or record identifiers. It
employs indices with record pointers on the fields involved in
the particular selection condition. It searches each index for
pointers to tuples that satisfy the individual condition. As a
result, the intersection of all the retrieved pointers is the set of
pointers to the tuples that satisfy the conjunctive condition. The
195 | P a g e
Database Management Systems

algorithm uses these pointers to retrieve the individual data.


The absence of indexes checks the recovered documents for the
remaining conditions.
 Disjunctive selection by the union of identifiers: This algorithm
searches certain whole indexes for pointers to tuples that satisfy
the person condition. However, this is only true if entry routes
are open under all disjunctive selection conditions. As a result,
the union of all fetched records contains pointer sets to all tuples
that satisfy or prove the disjunctive condition. Furthermore, it
employs pointers to retrieve the individual data. Suppose the
access path is not present for any condition. In that case, we
must use a linear search to find certain tuples that satisfy the
condition. As a result, it is preferable to use a linear search to
determine those tests.
7.5 Cost Estimation
The total cost of the algorithm is calculated by applying the
costs of individual index scans and fetching records from the
intersection of the collected lists of pointers. We can save money
by sorting the list of pointers and retrieving the sorted data. As a
result, we discovered the following two points for cost
estimation:
 Since each pointer in the block occurs together, we can retrieve
all of the block's chosen records with a single I/O operation.
 When blocks are read in ordered order, the disk-arm movement
is reduced.
Chart of Cost Estimation for Different Selection Algorithms
The number br denotes the number of blocks in the file.
hi denotes the index's height.
b is the number of blocks containing records matching the
given search key.
196 | P a g e
Database Management Systems

n denotes the number of documents retrieved.

Selection
Cost Why So?
Algorithms
Linear Search ts + br * It needs one initial seek with
tT br block transfers.
Linear Search, ts + It is the average case where it
Equality on Key (br/2) * needs only one record satisfying
tT the condition. So as soon as it is
found, the scan terminates.
Primary B+-tree (hi +1) * Each I/O operation needs one
index, Equality on (tr + ts) seek and one block transfer to
Key fetch the record by traversing the
tree's height.
Primary B+-tree hi * (tT + It needs one seek for each level of
index, Equality on ts) + b * the tree, and one seek for the first
a Nonkey tT block.
Secondary B+-tree (hi + 1) * Each I/O operation needs one
index, Equality on (tr + ts) seek and one block transfer to
Key fetch the record by traversing the
tree's height.
Secondary B+-tree (hi + n) * It requires one seek per record
index, Equality on (tr + ts) because each record may be on a
Nonkey different block.
Primary B+-tree hi * (tr + It needs one seek for each level of
index, ts) + b * the tree, and one seek for the first
Comparison tT block.
Secondary B+-tree (hi + n) * It requires one seek per record
index, (tr + ts) because each record may be on a
Comparison different block.
197 | P a g e
Database Management Systems

7.6 Hash Join Algorithm


The Hash Join algorithm performs the natural join or equi
join operations. The concept behind the Hash join algorithm is to
partition the tuples of each given relation into sets. The partition
is based on the same hash value on the join attributes. The hash
function provides the hash value. The main goal of using the
hash function in the algorithm is to reduce the number of
comparisons and increase the efficiency to complete the join
operation on the relations.
For example, suppose there are two tuples a and b where both
satisfy the join condition. It means they have the same value for
the join attributes. Suppose that both a and b tuples consist of a
hash value as i. It implies that tuple a should be in ai, and tuple
b should be in bi. Thus, we only compare a tuples in ai with b
tuples of bi. We do not need to compare the b tuples in any other
partition. Therefore, in this way, the hash join operation works.
Algorithm of Hash Join

Hash Join Algorithm


//Partition s//
for each tuple ts in s do begin
i = h(ts [JoinAttrs]);
Hsi = Hsi U {ts};
end
//Partition r//
for each tuple tr in r do begin
i = h(tr[JoinAttrs]);
Hri = Hri U {tr};
end
//Perform the join operation on each partition//
198 | P a g e
Database Management Systems

for i= 0 to nh do begin
read Hsi and build an in-memory hash index on it;
for each tuple tr in Hri do begin
probe the hash index on Hsi to locate all tuples
such that ts[JoinAttrs] = tr[JoinAttrs];
for each matching tuple ts in Hsi do begin
add tr ⋈ ts to the result;
end
end
end
The Hash join algorithm in which we have computed the
natural join of two given relations r and s. In the algorithm,
there are various terms used:
tr ⋈ ts: It defines the concatenation of tuple tr and ts attributes,
further followed by projecting the repeated attributes.
tr and ts: These are the tuples of relations r and s, respectively.
Let us understand the hash join algorithm with the following
steps:
Step 1: In the algorithm, firstly, we have partitioned both
relations r and s.
Step 2: After partitioning, we perform a separate indexed
nested-loop join on each partition pair i using for loop as i = 0 to
nh.
Step 3: For performing the nested-loop join, it initially creates a
hash index on each si and then probes with tuples from ri. In the
algorithm, relation r is the probe input, and relation s is
the build input.
There is a benefit of using the Hash Join algorithm i.e., the hash
index on si is built-in memory, so for fetching the tuples, we do

199 | P a g e
Database Management Systems

not need to access the disk. It is good to use smaller input


relations as the build relations.
7.7 Recursive Partitioning in Hash Join
Recursive partitioning is when the system repeats the
partitioning of the input until each partition of the build input
fits into the memory. The recursive partitioning is needed when
the value of nh is greater than or equal to the number of memory
blocks. Splitting the relation in one pass becomes difficult since
there can be insufficient buffer blocks. So, it is better to split the
relation in repeated passes. We can split the input into several
partitions in one pass because sufficient blocks are available to
be used as output buffers. Each bucket built by the pass is read
separately and further partitioned in the next pass to create
smaller partitions. Also, the hash functions are different in
different passes. So, it is better to use recursive partitioning for
handling such cases.

7.8 Overflows in Hash Join


The overflow condition in hash-table occurs in any
partition i of the build relation s due to the following cases:
Case 1: The overflow condition occurs when the hash index on
si is greater than the main memory.
Case 2: There are multiple tuples in the build relation with the
same values for the join attributes.
Case 3: The hash function does not hold randomness and
uniformity characteristics.
Case 4: When some partitions have more tuples than the
average and others have fewer tuples, such partitioning is
skewed.
7.8.1 Handling Overflows

200 | P a g e
Database Management Systems

We can handle such cases of hash-table overflows using various


methods.
o Using Fudge Factor
We can handle a small skew by increasing the number of
partitions using the fudge factor. The fudge factor is a small
value that increases the number of partitions. So, it will help to
reduce the expected size of each partition, including their hash
index less than the memory size. Unfortunately, the use of a
fudge factor makes the user conservative on the size of the
partitions. Thus, the chances of overflow are still possible.
However, the fudge factor is suitable for handling small
overflows. However, it is not sufficient for handling large
overflows in the hash-table.
As a result, we have two more methods for handling the
overflows.

Fig 7.5 Handling Overflows

1. Overflow Resolution
The overflow resolution method is applied when a hash index
overflow is detected during the build phase. The overflow
resolution works in the following way:

201 | P a g e
Database Management Systems

It finds si for any partition i if having size larger than the


memory size. Again, partitions such build relation si into smaller
partitions through a different hash function. Similarly, it
partitions the probe relation ri through the new hash function.
Only those tuples are joined, which have matching partitions.
However, it is a less careful approach because it waits for such
conditions to occur and then takes the necessary actions to
resolve it.
2. Overflow Avoidance
The overflow avoidance method uses a careful approach while
partitioning in order to avoid the occurrence of overflow in the
build phase. The overflow avoidance works in the following
way:
It initially partitions the build relation s into several small
partitions and then combines some of the partitions. These
partitions are combined so that each combined partition fits in
the memory. Similarly, it partitions the probe relation r as the
combined partitions on s. However, the size of ri does not matter
in this method.
Both overflow resolution and overflow avoidance methods may
fail on some partitions if many tuples in s have the same value
for the join attributes. In such a case, it is better to use block
nested-loop join rather than applying the hash join technique for
completing the join operation on those partitions.
Cost Analysis of Hash Join
For analyzing the cost of a hash join, we consider that no
overflow occurs in the hash join. We will consider only two
cases where:

202 | P a g e
Database Management Systems

1. Recursive partitioning is not needed


We need to read and write relations r and s completely for
partitioning them. 2(b r + b s ) block transfers are required. The
term b r and b s are the number of blocks holding records of
relations r and s. Both relations read each partition once for
more br + bs blocks transfers. However, the partitions might
have occupied slightly more blocks than br + bs, resulting in
partially filled blocks. To access such partially filled blocks can
include the overhead of 2nh approximately for each relation.
Thus, a hash join cost estimates need:
Number of block transfers = 3(br + bs) + 4nh
Here, we can neglect the overhead value of 4nh since it is much
smaller than br + bs value.
Number of disk seeks = 2(Γbr/bbꓶ + Γbs/bbꓶ) + 2nh
Here, we have assumed that each input buffers are allocated
with bb blocks. The build and probe phase need only one seek
for each nh partition of the relation, as we can read each
partition sequentially.
2. Recursive partition is required
In this case, each pass reduces the size of each partition by M-1
expected factor, and passes are repeated until it makes the size
of each partition as M blocks at most. Therefore, for partitioning
the relation s, we need:
Number of passes = ΓlogM-1(bs) - 1ꓶ
The number of passes required to partition the build and probe
relations is the same. As in each pass, each block of s is read and
written out and needs a total of 2bsΓlogM-1(bs) - 1ꓶ block
transfers for splitting relation s. Thus, a hash join cost estimates
need:
Number of block transfers = 2(br + bs)ΓlogM-1(bs) - 1ꓶ + br + bs
203 | P a g e
Database Management Systems

Number of disk seeks = 2(Γbr/bbꓶ + Γbs/bbꓶ)ΓlogM-1(bs) - 1ꓶ


Here, we assume that for buffering each partition we allocate
bb blocks to them. Also, we have neglected a relatively small
number of seeks during the build and probe phase.
As a result, the hash join algorithm can be further improved if
the size of the main memory increases or is large.
Hybrid Hash Join
It is a type of hash join useful for performing the join operations
in which the memory size is relatively large. However, the build
relation still does not completely fit in the memory. So, the
hybrid hash join algorithm resolves the drawback of the hash
join algorithm.

7.9 Evaluation of Expressions


In our previous sections, we understood various
concepts in query processing. We learned about the query
processing steps, selection operations, and several types of
algorithms used to perform the join operation with their cost
estimations.
We are already aware of computing and representing the
individual relational operations for the given user query or
expression. Here, we will learn how to compute and evaluate an
expression with multiple operations.
For evaluating an expression that carries multiple operations,
we can perform the computation of each operation one by one.
However, we use two methods for evaluating an expression
carrying multiple operations in the query processing system.
These methods are:
1. Materialization
2. Pipelining
204 | P a g e
Database Management Systems

Let's take a brief discussion of these methods.


Materialization
In this method, the given expression evaluates one relational
operation at a time. Also, each operation is evaluated in an
appropriate sequence or order. After evaluating all the
operations, the outputs are materialized in a temporary relation
for their subsequent uses. It leads the materialization method to
a disadvantage. The disadvantage is that it needs to construct
those temporary relations to materialize the evaluated
operations' results, respectively. These temporary relations are
written on the disks unless they are small in size.
Pipelining
Pipelining is an alternate method or approach to the
materialization method. In pipelining, it enables us to
simultaneously evaluate each relational operation of the
expression in a pipeline. After evaluating one operation, its
output is passed on to the next operation in this approach. The
chain continues until all the relational operations are evaluated
thoroughly. Thus, there is no requirement of storing a
temporary relation in pipelining. Such an advantage of
pipelining makes it a better approach than the approach used in
the materialization method. Even the costs of both approaches
can have subsequent differences in-between. But, both
approaches perform the best role in different cases. Thus, both
ways are feasible at their place.

7.10 Cost Estimation of Materialized Evaluation


The process of estimating the cost of the materialized
evaluation is different from the process of estimating the cost of
an algorithm. In analyzing the cost of an algorithm, we do not

205 | P a g e
Database Management Systems

include writing the results on to the disks. But in evaluating an


expression, we compute the cost of all operations. We include
the cost of writing the result of currently evaluated operation to
disk.
To estimate the cost of the materialized evaluation, we consider
that results are stored in the buffer. When the buffer fills, the
results are stored to the disk.
Let, a total of br number of blocks are written. Thus, we can
estimate br as:
br = nr/fr.
Here, nr is the estimated number of tuples in the result relation r
and fr is the number of records of relation r that fits in a block.
Thus, fr is a blocking factor of the resultant relation r.
We also need to calculate the transfer time by estimating the
number of required disks. It is so because the disk head may
have moved in-between the successive writes of the block. Thus,
we can estimate:
Number of seeks = Γ br/ bbꓶ
Here, bb defines the size of the output buffer, i.e., measured in
blocks.
We can optimize the cost estimation of the materialization
process by using the concept of double buffering. Double
buffering uses two buffers, where one buffer executes the
algorithm continuously, and the other is written out. It makes
the algorithm execute more quickly by performing CPU
activities parallel with I/O activities. We can also reduce the
number of seeks by allocating the extra blocks to the output
buffer and writing multiple blocks.

206 | P a g e
Database Management Systems

CHAPTER 8
QUERY OPTIMIZATION

8.1 Introduction
We have seen how a query can be processed based on
indexes and joins, and how they can be transformed into
relational expressions. The query optimizer uses these two
techniques to determine which process or expression to consider
for evaluating the query.
8.2 Types of Query Optimization
There are two methods of query optimization
8.2.1 Cost-based Optimization (Physical)
This is based on the cost of the query. The query can use
different paths based on indexes, constraints, sorting methods
etc. This method mainly uses the statistics like record size,
number of records, number of records per block, number of
blocks, table size, whether whole table fits in a block,
organization of tables, uniqueness of column values, size of
columns etc.
Suppose, we have series of table joined in a query.
T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6
For above query we can have any order of evaluation. We can
start taking any two tables in any order and start evaluating the
query. Ideally, we can have join combinations in (2(n-1))! / (n-1)!
ways. For example, suppose we have 5 tables involved in join,
then we can have 8! / 4! = 1680 combinations. However, when
query optimizer runs, it does not always evaluate in all these
ways. It uses Dynamic Programming to generate the costs for
join orders of any combination of tables. It is calculated and
generated only once. This least cost for all the table
207 | P a g e
Database Management Systems

combinations is stored in the database and used for future use.


i.e.; say we have a set of tables, T = { T1 , T2 , T3 .. Tn}, then it
generates least cost combination for all the tables and stores it.
8.2.2 Dynamic Programming
As we learnt above, the least cost for joining any table
combination is generated here. These values are stored in the
database. When those tables are used in the query, this
combination is selected to evaluate the query.
While generating the cost, it follows below steps :
Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It
picks the first table, and computes cost for joining with rest of
the tables in set T. It calculates cost for each table and then
chooses the best cost. It continues doing the same with rest of
the tables in set T. It will generate 2n – 1 cases and select the
lowest cost and store it. When a query uses those tables, it
checks for the costs here and that combination is used to
evaluate the query. This is called dynamic programming.
In this method, time required to find optimized query is in the
order of 3n, where n is the number of tables. Suppose we have 5
tables, then the time required in 35 = 243 is lesser than finding
all the combinations of tables and then deciding the best
combination (1680). Also, the space required for computing and
storing the cost is less than 2n. In above example, it is 25 = 32.
Left Deep Trees
This is another method of determining the cost of the joins.
Here, the tables and joins are represented in the form of trees.
The joins always form the tree's root and table is kept at the
right side of the root. LHS of the root always point to the next
join. Hence it gets deeper and deeper on LHS. Hence it is called

208 | P a g e
Database Management Systems

as left deep tree.

Fig 8.1 Left Deep Trees

Instead of calculating the best join cost for a set of tables,


the best join cost for joining each table is calculated. In this
method, time required to find optimized query is in the order of
n2n, where n is the number of tables. Suppose we have 5 tables,
then the time required in 5*25 =160 is less than dynamic
programming. Also, the space required for computing storing
costs less and is in 2n. In above example, it is 25 = 32, same as
dynamic programming.
 Interesting Sort Orders
This method is an enhancement to dynamic programming.
Here, while calculating the best join order costs, it also considers

209 | P a g e
Database Management Systems

the sorted tables. It assumes, calculating the join orders on


sorted tables would be efficient. i.e.; suppose we have unsorted
tables T1 , T2 , T3 .. Tn and join on these tables.
(T1 ∞T2)∞ T3 ∞… ∞ Tn
This method uses hash join or merge join method to calculate
the cost. Hash Join will simply join the tables. We get sorted
output in merge join method, but it is costlier than hash join.
Even though merge join is costlier at this stage, when it moves
to join with third table, the join will have less effort to sort the
tables. This is because first table is the sorted result of first two
tables. Hence it will reduce the total cost of the query.
However, the number of tables involved in the join would be
relatively less. This cost/space difference will be hardly
noticeable.
All these cost based optimizations are expensive and are
suitable for large number of data. There is another method of
optimization called heuristic optimization, which is better than
cost-based optimization.

8.2.3 Heuristic Optimization (Logical)


This method is also known as rule based optimization.
This is based on the equivalence rule on relational expressions;
hence the number of combination of queries get reduces here.
Hence the cost of the query too reduces.
This method creates relational tree for the given query based on
the equivalence rules. By providing an alternative way of
writing and evaluating the query, these equivalence rules give
the better path to evaluate the query. This rule need not be true
in all cases. It needs to be examined after applying those rules.

210 | P a g e
Database Management Systems

The most important set of rules followed in this method is listed


below:
 Perform all the selection operation as early as possible in the
query. This should be first and foremost set of actions on the
tables in the query. We can reduce the number of records
involved in the query by performing the selection operation,
rather than using the whole tables.
Suppose we have a query to retrieve the students aged 18 and
studying in class DESIGN_01. We can get all the student details
from STUDENT table, and class details from CLASS table. We
can write this query in two different ways.

Here both the queries will return same result. But when
we observe them closely, we can see that the first query will join
the two tables and then apply the filters. That means, it traverses
whole table to join, hence the number of records involved is
more. Nevertheless,second query, applies the filters on each
table first. This reduces the number of records on each table (in
class table, the number of record reduces to one in this case!).
Then it joins these intermediary tables. Hence the cost in this
case is comparatively less.
Instead of writing query the optimizer creates relational algebra
and tree for above case.

211 | P a g e
Database Management Systems

Fig 8.2 cost is a lower tree

 Perform all the projection as early as possible in the query.


This is similar to selection but will reduce the number of
columns in the query.
For example, we have to select only student name, address and
class name of students aged 18 from STUDENT and CLASS
tables.
Here again, both the queries look alike, results alike. But
when we compare the number of records and attributes
involved at each stage, the second query uses less records and,
hence, is more efficient.
 Next step is to perform most restrictive joins and selection
operations. When we say most restrictive joins and selection
means, select those sets of tables and views, resulting in
comparatively fewer records. Any query will have better
performance when tables with few records are joined. Hence,
throughout the heuristic optimisation method, the rules are
formed to get fewer records at each stage, making query
performance better. So is the case here too.

212 | P a g e
Database Management Systems

Suppose we have STUDENT, CLASS and TEACHER tables. Any


student can attend only one class in an academic year and only
one teacher takes a class. But a class can have more than 50
students. We have to retrieve STUDENT_NAME, ADDRESS,
AGE, CLASS_NAME and TEACHER_NAME of each student in
a school.
∏STD_NAME, ADDRESS, AGE, CLASS_NAME,
TEACHER_NAME ((STUDENT ∞ CLASS_ID CLASS)∞
TECH_IDTEACHER)
Not So efficient
∏STD_NAME, ADDRESS, AGE, CLASS_NAME,
TEACHER_NAME (STUDENT ∞ CLASS_ID (CLASS∞
TECH_IDTEACHER))
Efficient
In the first query, it tries to select the records of students from
each class. This will result in a very huge intermediary table.
This table is then joined with another small table. Hence the
traversing of number of records is also more. But in the second
query, CLASS and TEACHER are joined first, which has one
relation here. Hence the number of resulting record is
STUDENT table give the final result. Hence this second method
is more efficient.
 Sometimes we can combine the above heuristic steps with a
cost-based optimization technique to get better results.
All these methods need not be always true. It also depends on
the table size, column size, type of selection, projection, join sort,
constraints, indexes, statistics etc. Above optimization describes
the best way of optimizing the queries.

213 | P a g e
Database Management Systems

8.3 Transforming Relational Expressions


The first step of the optimizer says to implement such
expressions that are logically equivalent to the given expression.
We use the equivalence rule that describes the method to
transform the generated expression into a logically equivalent
one for implementing such a step.
Although there are different ways to express a query, with
different costs. But for expressing a query efficiently, we will
learn to create alternative and equivalent expressions of the
given expression, instead of working with the given expression.
Two relational-algebra expressions are equivalent if both the
expressions produce the same set of tuples on each legal
database instance. A legal database instance refers to that
system that satisfies all the integrity constraints specified in the
database schema. However, the sequence of the generated
tuples may vary in both expressions. However, they are
considered equivalent until they produce the same tuples set.
8.4 Equivalence Rules
The equivalence rule says that expressions of two forms
are the same or equivalent because both expressions produce
the same outputs on any legal database instance. It means that
we can replace the expression of the first form with that of the
second form and replace the expression of the second form with
an expression of the first form. Thus, the optimizer of the query-
evaluation plan uses such an equivalence rule or method for
transforming expressions into the logically equivalent one.
The optimizer uses various equivalence rules on relational-
algebra expressions for transforming the relational expressions.
For describing each rule, we will use the following symbols:
θ, θ1, θ2 … : Used for denoting the predicates.
214 | P a g e
Database Management Systems

L1, L2, L3 … : Used for denoting the list of attributes.


E, E1, E2 …. : Represents the relational-algebra expressions.
Let us discuss several equivalence rules:
Rule 1: Cascade of σ
This rule states the deconstruction of the conjunctive selection
operations into a sequence of individual selections. Such a
transformation is known as a cascade of σ.
σ θ1 ᴧ θ 2 (E) = σ θ1 (σ θ2 (E))
Rule 2: Commutative Rule
a) states that selections operations are commutative.
σ θ1 (σ θ2 (E)) = σ θ2 (σ θ1 (E))
b) Theta Join (θ) is commutative.
E 1 ⋈ θ E 2 = E 2 ⋈ θ E 1 (θ is in subscript with the join symbol)
However, in the case of theta join, the equivalence rule does not
work if the order of attributes is considered. Natural join is a
special case of Theta join, and natural join is also commutative.
However, in the case of theta join, the equivalence rule does not
work if the order of attributes is considered. Natural join is a
special case of Theta join, and natural join is also commutative.
Rule 3: Cascade of ∏
This rule states that we only need the final operations in the
sequence of the projection operations, and other operations are
omitted. Such a transformation is referred to as a cascade of ∏.
∏L1 (∏L2 (. . . (∏Ln (E)) . . . )) = ∏L1 (E)
Rule 4: We can combine the selections with Cartesian products
as well as theta joins
Rule 4: We can combine the selections with Cartesian products
as well as theta joins
1. σ θ (E 1 x E 2 ) = E 1θ ⋈ E 2
2. σ θ1 (E 1 ⋈ θ2 E 2 ) = E 1 ⋈ θ1ᴧθ2 E 2
215 | P a g e
Database Management Systems

Rule 5: Associative Rule


a) This rule states that natural join operations are associative.
(E1 ⋈ E2) ⋈ E3 = E1 ⋈ (E2 ⋈ E3)
b) Theta joins are associative for the following expression:
(E 1 ⋈ θ1 E 2 ) ⋈ θ2ᴧθ3 E 3 = E 1 ⋈ θ1ᴧθ3 (E 2 ⋈ θ2 E 3 )
In the theta associativity, θ2 involves the E2 and E3 only. There
may be chances of empty conditions, thereby concluding that
Cartesian Product is also associative.
Rule 6: Distribution of the Selection operation over the Theta
join.
Under two following conditions, the selection operation gets
distributed over the theta-join operation:
a) When all attributes in the selection condition θ0 include only
attributes of one of the expressions which are being joined.
σ θ0 (E 1 ⋈ θ E 2 ) = (σ θ0 (E 1 )) ⋈ θ E 2
b) When the selection condition θ1 involves the attributes of
E1 only, and θ2 includes the attributes of E2 only.
σ θ1ꓥ θ2 (E1 ⋈ θ E 2 ) = (σ θ1 (E 1 )) ⋈ θ ((σ θ2 (E 2 ))
Rule 7: Distribution of the projection operation over the theta
join.
Under two following conditions, the selection operation gets
distributed over the theta-join operation:
a) Assume that the join condition θ includes only L1 υ
L2 attributes of E1 and E2. Then, we get the following expression:
∏ L1υL2 (E1 ⋈ θ E 2 ) = (∏ L1 (E 1 )) ⋈ θ (∏ L2 (E 2 ))
b) Assume a join as E1 ⋈ E2. Both expressions E1 and E2 have sets
of attributes as L1 and L2. Assume two attributes, L3 and L4,
where L3 is an attribute of the expression E1, involved in the θ
join condition but not in L1 υ L2. Similarly, an L4 is an attribute of

216 | P a g e
Database Management Systems

the expression E2 involved only in the θ join condition and not in


L1 υ L2 attributes. Thus, we get the following expression:
∏ L1υL2 (E1 ⋈ θ E 2 ) = ∏ L1υL2 ((∏ L1υL3 (E 1 )) ⋈ θ ((∏ L2υL4 (E 2 )))
Rule 8: The union and intersection set operations are
commutative.
E1 υ E2 = E2 υ E1
E1 ꓥ E2 = E2 ꓥ E1
However, set difference operations are not commutative.
Rule 9: The union and intersection set operations are
associative.
(E 1 υ E 2 ) υ E 3 = E 1 υ (E 2 υ E 3 )
(E 1 ꓥ E2 ) ꓥ E 3 = E 1 ꓥ (E 2 ꓥ E 3 )
Rule 10: Distribution of selection operation on the intersection,
union, and set difference operations.
The below expression shows the distribution performed over
the set difference operation.
σ p (E 1 − E 2 ) = σ p (E 1 ) − σ p (E 2 )
We can similarly distribute the selection operation on υ and ꓶ
by replacing them with -. Further, we get:
σ p (E 1 − E 2 ) = σ p (E 1 ) −E 2
Rule 11: Distribution of the projection operation over the union
operation.
This rule states that we can distribute the projection operation
on the union operation for the given expressions.
∏ L (E 1 υ E 2 ) = (∏ L (E 1 )) υ (∏ L (E 2 ))
Apart from these discussed equivalence rules, there are various
other equivalence rules also.
8.5 Estimating Statistics of Expression results in DBMS
In order to determine the ideal plan for evaluating the
query, it checks various details about the tables stored in the

217 | P a g e
Database Management Systems

data dictionary. This information about tables is collected when


a table is created and when various DDL / DML operations are
performed. The optimizer checks the data dictionary for :
 The total number of records in a table, nr. This will help to
determine which table needs to be accessed first. Usually,
smaller tables are executed first to reduce the size of the
intermediary tables. Hence it is one of the important factors to
be checked.
 The total number of records in each block, fr. This will be useful
in determining the blocking factor and is required to determine
if the table fits in the memory.
 A total number of blocks assigned to a table, br. This is also an
important factor in calculating the number of records assigned
to each block. Suppose we have 100 records in a table and the
total number of blocks is 20, then fr can be calculated as nr/b r =
100/20 = 5.
 The total length of the records in the table, l r. This is an
important factor when the size of the records varies significantly
between any two tables in the query. If the record length is
fixed, there is no significant effect. But when variable-length
records are involved in the query, average length or actual
length needs to be used depending upon the type of operations.
 Several unique values for a column, d Ar. This is useful when a
query uses aggregation operation or projection. It will estimate a
specific number of columns selected during projection. Number
groups of records can be determined when the Aggregation
operation is used in the query. E.g., SUM, MAX, MIN, COUNT,
etc.
 Levels of the index, x. This data provides whether the single
index level like primary key index, secondary key indexes are
218 | P a g e
Database Management Systems

used, or multi-level indexes like B+ tree index, merge-sort index,


etc. These index levels will provide details about number of
block access required to retrieve the data.
 Selection cardinality of a column, s A. This is the number of
records present with the same column value. This is calculated
as nr/d Ar. i.e., the total number of records with a distinct value
of A. For example, suppose EMP table has 500 records, and
DEPT_ID has 5 distinct values. Then the selection cardinality of
DEPT_ID in EMP table is 500/ 5 = 100. That means, on average,
100 employees are distributed among each department. This
helps determine an average number of records that would
satisfy selection criteria.
 Many other factors, too, like index type, data file type, sorting
order, type of sorting, etc.
8.6 Choice of Evaluation Plans in DBMS
So far, we have seen how a query is parsed and traversed,
how they are evaluated using different methods, and the
different costs when different methods are used. The important
phase while evaluating a query is deciding which evaluation
plan must be selected to be traversed efficiently. It collects all
the statistics, costs, access/ evaluation paths, relational trees, etc.
It then analyses them and chooses the best evaluation path.
As we saw at the beginning of this article, the same query is
written in different forms of relational algebra. Corresponding
trees for them, too, are drawn by DBMS. Statistics for them
based on cost-based evaluation and heuristic methods are
collected. It checks the costs based on the different techniques
we have seen so far. It checks the operator, joining type, indexes,
number of records, selectivity of records, distinct values, etc.,

219 | P a g e
Database Management Systems

from the data dictionary. Once all this information is collected, it


picks the best evaluation plan.

Have a look at below relational algebra and tree for EMP and
DEPT.
∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 AND
EMP_LAST_NAME = ‘Joseph’ (EMP) ∞DEPT)

Or
∏ EMP_ID, DEPT_NAME (σ DEPT_ID = 10 AND
EMP_LAST_NAME = ‘Joseph’ (EMP ∞DEPT))

Or
σ DEPT_ID = 10 AND EMP_LAST_NAME = ‘Joseph’ (∏
EMP_ID, DEPT_NAME, DEPT_ID (EMP ∞DEPT))

What can be observed here? The first tree reduces the


number of records for joining and seems efficient. But what
happens if we have an index on DEPT_ID? Then the join
between EMP and EMP can also be more efficient. But we see
the filter condition on the EMP table; we have DEPT_ID = 10,
the index column. Hence, applying selection conditions and
then joining will reduce the number of records and make the
joining more efficient than without an index. Next are the
projected columns – EMP_ID and DEPT_NAME. They are all
distinct values. There cannot be duplicate values for them. But
we are selecting those values for DEPT_ID = 10. Hence
DEPT_NAME has only one value. Hence their selectivity is the
same as the number of employees working for DEPT_ID = 10.
But we are selecting only those employees whose last name is
220 | P a g e
Database Management Systems

‘Joseph.’ Hence the selectivity is min (distinct (employee


(DEPT_10)), distinct (employee (DEPT_10, JOSEPH)).
Obliviously distinct (employee (DEPT_10, JOSEPH)) would
have lesser value. The optimizer decides all these factors for the
above 3 trees and then decides the first tree would be more
efficient. Hence it evaluates the query using the first tree.
This is how any query submitted to DB is traversed and
evaluated.
8.7 Advanced Query Optimization
Several topics contribute to the advanced stage of query
optimization.
In this part, we'll go over a couple of them.
8.7.1 Top-K Optimisation
A database system is used for fetching data from it. But there are
some queries that the user has given that access results are
sorted on some attributes and require only top K results for
some K. Also, some queries support bound K, or limit K clause,
which accesses the top K results. But, some queries do not
support the bound K. For such queries, the optimizer specifies a
hint that indicates the results of the query retrieved should be
the top k results only. It does not matter if the query generates
more results, including the top k results. In cases the value of K
is small, and then if the query optimization plan produces the
entire set of results, sorts them, and generates the top K results.
Such a step is not as effective and inefficient as it may likely
discard most of the computed intermediate results. Therefore,
we use several methods to optimize such top-k queries.
Two such methods are:
o Using pipelined query evaluation plans for producing the
results in sorted order.
221 | P a g e
Database Management Systems

o Estimating the highest value on the sorted attributes will appear


in the top K result and introduce the selection predicates used to
eliminate the larger values.
Any how extra tuples are generated beyond the top-K results.
Such tuples are discarded, and if too few tuples are generated
that do not reach the top K results, we need to execute the query
again; also, there is a need to change the selection condition.
Join Minimization
There are different join operations used for processing the given
user query. When queries are generated through views,
computing the query requires joining more relations than the
actual requirement. We need to drop such relations from a joint
to resolve such cases. Such type of solution or method is known
as Join Minimization. We have discussed only one such case.
There are also more numbers of similar cases, and we can apply
the join minimization there also.
Optimization of Updates
An update query is used to make changes in the already
persisted data. An update query often involves subqueries in the
set and where clauses. So, while optimizing the update. For
example, if a user wants to update the score as 97 of a student
in a student table whose roll_no is 102. The following update
query will be used:
update student set score = 97 where roll_no = 102
However, if the updates involve selecting the updated column,
we must handle such updates carefully. Suppose the update is
done during the selection performed by an index scan. In that
case, we need to re-insert an updated tuple in the index ahead of
the scan. Also, several problems can arise in the updation of the
subqueries, whose result is affected by the update.
222 | P a g e
Database Management Systems

8.7.2 The Halloween Problem


The problem was named so because it was first identified
on Halloween Day at IBM. An update that affects the execution
of a query associated with the update is the Halloween
problem. But, we can avoid this problem by breaking up the
execution plan by executing the following steps:
o Executing the queries that define the update first
o Creating a list of affected tuples
o At last, updating the tuples and indices.
Thus, following these steps increases the execution cost of the
query evaluation plan.
We can optimize the update plans by checking if the Halloween
problem can occur. If it cannot occur, perform the update during
the query processing. It, however, reduces the update
overheads. We can understand this with an example; suppose
that Halloween's problem cannot occur if the index attributes are not
affected by the updates. However, if it does and if the updates also
decrease the value, even if the index is scanned in increasing order, in
that case, it will not encounter the updated tuples again during the
scan process. But in such cases, it can update the index even if the
query is being executed. Thus, it will reduce the overall cost and lead to
an optimized update.
Another method of optimizing such update queries that result
in many updates is collecting all the updates as a batch. After
collecting, apply these updates batch separately to each affected
index. But, before applying an updated batch to an index, it
must sort the batch in index order for that index. Thus, such
batch sorting will reduce the amount of random I/O needed to
update the indices at a great height.

223 | P a g e
Database Management Systems

Therefore, we can perform such optimization of updates in most


of the database systems.
8.7.3 Multi-Query Optimization and Shared Scans
We can understand the multi-query optimization when
the user submits a queries batch. The query optimizer exploits
common subexpressions between different queries. It does so to
evaluate them once and reuse them whenever required. Thus,
for complex queries, we can also exploit the subexpression,
which reduces the cost of the query evaluation plan. So, we need
to optimize the subexpressions for different queries. One way of
optimization is the elimination of the common subexpression,
known as Common subexpression elimination. The common
subexpression elimination method optimizes the subexpressions
by computing and storing the result. Further, reusing the result
whenever the subexpressions occur. Only a few databases
exploit common subexpressions among the evaluation plans
selected for each of the batches of queries.
In some database systems, another form of multi-query
optimization is implemented. Such a form of implementation is
known as Sharing of relation scans between the queries.
Understand the following steps to know the working of the
Shared-scan:
o It does not read the relation in a repeated manner from the disk.
o It reads data only once from the disk for every query that needs
to scan a relation.
o Finally, it pipelines to each of the queries.
Such a method of shared-scan optimization is useful when
multiple queries perform a scan on a fact table or a single large
relation.
Parametric Query Optimization
224 | P a g e
Database Management Systems

In the parametric query optimization method, query


optimization is performed without specifying its parameter
values. The optimizer outputs several optimal plans for different
parametric values. It outputs the plan only if it is optimal for
some possible parameter values. After this, the optimizer stores
the output set of alternative plans. Then the cheapest plan is
found and selected. Such selection takes very less time than the
re-optimization process. In this way, the optimizer optimizes the
parameters and leads to an optimized and cost-effective output.

225 | P a g e
Database Management Systems

CHAPTER 9
SCHEMA REFINEMENT

9.1 Problems Caused by Redundancy:


Storing the Same information redundantly, that is, in more than
one place within a database, can lead to several problems.
Redundant Storage: Some information is stored repeatedly.
Update Anomalies: If one copy of such repeated data is
updated, an inconsistency is created unless all copies are
similarly updated.
Insertion Anomalies: It may not be possible to store certain
information unless some other, unrelated information is stored.
Deletion Anomalies: It may not be possible to delete certain
information without losing some other, unrelated information.

In the above example, we notice redundancy with a rating


and hourly wages.
9.2 Decompositions
: Intuitively, redundancy arises when a relational schema forces
an association between attributes that is not natural. Functional
dependencies can be used to identify such situations and
suggest refinements to the schema. The essential idea is that

226 | P a g e
Database Management Systems

many problems arising from redundancy can be addressed by


replacing a bigger relationship with a collection of 'smaller' relations
called decompositions.
We can decompose Hourly_Emps into two relations:
Hourly_Emps20: (ssn,name, lot, rating,hours_worked)
Wages (rating, hourly_wages)

9.3 Problems Related to Decomposition:


Unless we are careful~ with decomposing, a relation
schema can create more problems than it solves. Two important
questions must be asked repeatedly:
1) Do we need to decompose a relation?
When data gets redundancy, then normalization is to be
done. There are several normal forms, and in due course of
normalization, the relations should be decomposed to avoid
redundancy.

2. What problems (if any) does a given decomposition cause?

The two properties of decompositions are of particular


interest. The lossless-join property enables us to recover any
instance of the decomposed relation from corresponding
instances of the smaller relations.
The dependency-preservation property enables us to enforce
any constraint on the original relation by simply enforcing
227 | P a g e
Database Management Systems

constraints on smaller relations. We need not perform joins of


the smaller relations to check whether a constraint on the
original relation is violated.
From a performance standpoint, queries over the original
relation may require us to join the decomposed relations. In
some situations, decomposition could improve performance.
A good database designer should have a firm grasp of
normal forms and what problems they alleviate, the technique
of decomposition, and potential problems with decompositions.
9.4 Functional Dependencies
Functional dependency is a relationship that exists when
one attribute uniquely determines another attribute. If R is a
relation with attributes X and Y, a functional
dependency between the attributes is represented as X->Y,
which specifies Y is functionally dependent on X.
Determining functional dependencies is important for designing
databases in the relational model and database
normalization and de-normalization. A classic example of
functional dependency is the employee department model. The
following table

This case represents an example where multiple functional


dependencies are embedded in a single data representation.
Note that because an employee can only be a member of one

228 | P a g e
Database Management Systems

department, the unique ID of that employee determines the


department.
Employee ID → Employee Name
Employee ID → Department ID
In addition to this relationship, the table also has a functional
dependency through a non-key attribute Department ID →
Department Name
This example demonstrates that even though there is an FD
Employee ID → Department ID, the employee ID would not be
a logical key for determining the department ID. The
normalization process would recognize all FD's and allow the
designer to construct tables and relationships that are more
logical based on the data.

Figure the meaning of the FD AB  C by showing an


instance that satisfies this dependency. The first two tuples
show that an FD is not the same as a key constraint: AB is not a
key for the relation, although the FD is not violated. The third
and fourth tuples illustrate that if two tuples differ in either the
A field or the B field, they can differ in the C field without
violating the FD. On the other hand, if we add a tuple (a1, b1, c2,
d1) to the instance shown in this figure, the resulting instance

229 | P a g e
Database Management Systems

would violate the FD; to see this violation, compare the first
tuple in the figure with the new tuple.

9.5 Reasoning About FDS


Closure of a Set of FDs: The set of all FDs implied by a
given set F of FDs is called the closure of F, denoted as F+. An
important question is how to infer or compute the closure of a
given set of FDs. Armstrong's Axioms' following three rules can
be applied repeatedly to infer all FDs implied by a set of FDs.
We use X, Y, and Z to denote sets of attributes over a relation

schema R:

When talking about F+, it is easy to apply additional rules:

Trivial Functional dependency:


Consider a relation schema ABC with FDs A B and B C.
In a trivial FD, the right side contains only attributes on the left; such
dependencies always hold due to reflexivity. Using reflexivity,
we can generate all trivial dependencies, which are of the form:

From transitivity, we get an A  C. From augmentation, we get


the nontrivial dependencies:

9.6 Attribute Closure

230 | P a g e
Database Management Systems

After finding a set of functional dependencies held on a


relation, the next step is to find the Super key for that relation
(table). The set of identified functional dependencies plays a
vital role in finding the key to the relation. We can decide
whether an attribute (or set of attributes) of any table is a key for
that table or not by identifying the attribute or set of attributes’
closure. If A is an attribute (or set of attributes), its attribute
closure is denoted as A+.
Algorithm:
The following algorithm will help us in finding the closure of an
attribute;

result:= A;
while (changes to result) do
for each functional dependency, B → C in, F do
begin
if B ⊆ result then result:= result ∪ C;
end

Let us discuss this algorithm with an example;


Assume a relation schema R = (A, B, C) with the set of
functional dependencies F = {A → B, B → C}. Now, we can find
the attribute closure of attribute A as follows;
Step 1: We start with the attribute in question as to the initial
result. Hence, result = A.
Step 2: Take the first FD A → B. Its left-hand side (i.e., A) is in
the result; hence the right-hand side can be included. This led to
the result = AB.

231 | P a g e
Database Management Systems

Step 3: Take the second FD B → C. Its left-hand side (i.e., B) is in


the result (or subset of result); hence the right-hand side can be
included. Now, result = ABC.
We have no more attributes. Hence the algorithm exits. As a
result, A+ includes all the relation attributes R. now, we say
A+ is ABC. Moreover, A is one of the keys to the relation R.

Identifying (ABF)+
Then, what is the secret for R? As I previously said, we can
experiment with any of the left-hand side attributes (because
they are the determiners) or any of their combinations. We
might get the idea to use F as one of the key attributes from the
preceding example. So, let us see if we can find (ABF)+, the
closure of attribute set ABF.
 the end product is ABF
 Using the preceding example, we might say (AB)+ = ABCDE
 If we know C and F, we can deduce the result as ABCDEF,
which contains all of R's attributes, using CF B.
As a result, the solution is that ABF is one of the keys for R
since (ABF)+ contains all of R's attributes.

9.7 Normal Forms


Given a relation schema, we need to decide whether it is a
good design or decompose it into smaller relations. Such a
decision must be guided by understanding what problems arise
from the current schema.
To provide such guidance, several normal forms have been
proposed. If a relation schema is in one of these normal forms,
we know that certain problems cannot arise. The normal forms
based on FDs are the first normal form (1NF), second normal
232 | P a g e
Database Management Systems

form (2NF), third normal form(3NF), and Boyce-Codd normal


form (BCNF). These forms have increasingly restrictive
requirements: Every relation in BCNF is also in 3NF, every
relation in 3NF is also in 2NF, and in 2NF is in 1NF.
Normalization of Database: Database Normalisation is a
technique for organizing the data in the database.
Normalization is a systematic approach to decomposing tables
to eliminate data redundancy and undesirable characteristics
like Insertion, Update, and Deletion Anomalies. It is a multi-step
process that puts data into tabular form by removing duplicated
data from the relation tables.
Normalization is used for mainly two purposes,
 Eliminating redundant(useless) data.
 Ensuring data dependencies make sense, i.e., data is logically
stored.
Without Normalization: It becomes difficult to handle and
update the database without facing data loss. Insertion,
Updation, and Deletion Anamolies are very frequent if the
Database is not Normalized. To understand these anomalies, let
us take an example of the Student table.

 Updation Anamoly: To update the address of a student who


occurs twice or more than twice in a table, we will have to
update Sthe _Address column in all the rows; otherwise

233 | P a g e
Database Management Systems

 Insertion Anamoly: Suppose for a new admission, we have a


Student id(S_id), name, and address of a student, but if a
student has not opted for any subjects yet, we have to
insert NULL there, leading to Insertion Anamoly.
 Deletion Anamoly: If (S_id) 401 has only one subject and
temporarily drops it, the entire student record will be deleted
when we delete that row.

3 Normalization Rule: Normalization rules are divided into the


following normal forms.
1. First Normal Form
2. Second Normal Form
3. Third Normal Form
4. BCNF
5. Multivalve dependency (4NF, 5NF)

9.7.1 First Normal Form (1NF): As per the rule of the first
normal form, an attribute (column) of a table cannot hold
multiple values. It should hold only atomic values.
Example: Suppose a company wants to store its employees'
names and contact details. It creates a table that looks like this:

Any row in First Normal Form must not contain a column


with more than one value saved, such as separated by commas.
234 | P a g e
Database Management Systems

Rather, we must divide such data into several rows, with the
value being automatic in row and column intersections. Data
redundancy improves by using the First Normal Form since
several columns of the same data will be in different rows.
However, each row as a whole will be unique.

9.7.2 Second Normal Form (2NF):


As per the Second Normal Form, there must not be any
partial dependency of any column on the primary key. It means
that for a table with a concatenated primary key, each column in
the table that is not part of the primary key must depend upon
the entire concatenated key. If any column depends only on one
part of the concatenated key, the table fails the Second normal
form.

Fig 9.1 Partial Dependencies

In First Normal Form, Adam has two rows to include several


subjects chosen. Although this is searchable and follows First
Normal Form, it is a waste of space. Also, in the above Table in
First Normal Form, while the candidate key is Student, Subject,
235 | P a g e
Database Management Systems

the Age of Student is solely determined by the Student column,


which is incorrect in Second Normal Form. To achieve the
second normal form, separate the subjects into their table and
match them using the student names as international keys.

The candidate key in the Student Table would be the Student


column since all other columns, such as Age, are based on it.
The candidate key in the Subject Table will be the Student,
Subject column. Both preceding tables are now in Second
Normal Form and will never suffer from Update Anomalies.
Although there are a few complex situations in which a table in
Second Normal Form suffers from Update Anomalies, Third
Normal Form is available to manage such scenarios.

9.7.3 Third Normal Form (3NF):


Let R be a relation schema, F be the set of FDs given to
hold over R, X be a subset of the attributes of R, and A be an
attribute of R. R is in third normal form if, for every FD X -+ A in
F, one of the following statements is true:
• A E X; that is, it is a trivial FD, or
• X is a super key, or
• A is part of some keys for R.

236 | P a g e
Database Management Systems

Fig 9.2 Transitive Dependencies

The third Normal Form states that any non-prime attribute


of a table must be based on the primary key or that another non-
prime attribute should not decide a non-prime attribute. As a
result, the transitive functional dependency should be
eliminated from the table. The table should also be in Second
Normal form. Find the following table in the following areas.

The primary key in this table is Student id, but Zip


determines the street, area, and state. The dependency between
zip and other fields is referred to as transitive dependency. As a
result, to apply 3NF, we must transfer the street, neighborhood,
and state to a new table with Zip as the primary key.

The advantage of removing transitive dependency is,

237 | P a g e
Database Management Systems

 The amount of data duplication is reduced.


 Data integrity achieved.
The benefit of eliminating transitive dependency is that
There is less data replication.
Data accuracy has been achieved.

9.7.4 Boyce and Codd Normal Form (BCNF): Let R be a


relation schema, R be the set of FD is given to hold over R. X be
a subset of the attributes of R, and A be an attribute of R. R is in
Boyce-Codd normal form if, for every FD X  A in F, one of the
follo\ving statements is
true:
• A E X; that is, it is a trivial FD, or
• X is a superkey.

Boyce and Codd Normal Form is a higher version of the Third


Normal form. This form deals with a certain anomaly that 3NF
does not handle. A 3NF table that does not have multiple
overlapping candidate keys is in BCNF. For a table to be in
BCNF, the following conditions must be satisfied:
 R must be in the 3rd Normal Form Moreover, for each
functional dependency ( X -> Y ), X should be a super Key.

A relation is said to undergo BCNF normal form if it is in the


third normal form and if the following systems are noticed.
 The relation will have multiple composite keys
 Further, the composite keys will have a common attribute
 And the key of a first composite key is functionally dependent
on a key of another composite key.

238 | P a g e
Database Management Systems

For example

In the above example, the two composite keys are


1. Stu_name and Major
2. Stu_name and Staff

Furthermore, we can see that the stu name is a common


attribute in both keys.
It is also worth noting that there is a functional dependency
between Major and Staff. When the main subject is Physics, Prof.
John handles it, and Mr. David handles computers. So, suppose
many students choose Physics as the main subject. In that case,
Prof. John will appear regularly in the corresponding column,
which is true for Computers. The BCNF normalization methods
can be used to prevent redundancy.

239 | P a g e
Database Management Systems

The relation is decomposed into two sub relations Student


and Major tables with columns
Student{stu_name, Major, marks} and Major{Major, staff}

Multivalued Dependency: Multivalued Dependency: A


multivalued dependency X Y occurs in a relation R if there exists
a definite set of values of Y for each value of X. When we have
two R tuples that agree on all the attributes of X, we can swap
their Y components and get two new R tuples.

A formal definition of Multivalued dependency:

Consider the following relationship between university


courses, course books, and lecturers who will be teaching the
course:

240 | P a g e
Database Management Systems

Since the lecturers and books associated with the course are
independent, this database design has a multivalued
dependency; if we added a new book to the AHA course, we
would have to add one record for each lecturer and vice versa.
There are two multivalued dependencies in this relation:
coursebook and, equivalently, course lecturer. Databases with
multivalued dependencies show redundancy as a result. In
database normalization, the fourth normal form requires that
either every multivalued dependency X Y is trivial or that X is a
super key for every nontrivial multivalued dependency X Y. A
multivalued dependency X Y is trivial if Y is a subset of X or if X
U Y is the entire set of the relation's attributes.

9.7.5 Properties of Decompositions


1 Lossless-Join Decomposition: Let R be a relation schema and
let F be a set of FDs over R. A decomposition of R into two
schemas with attribute sets X and Y are said to be a lossless-join
decomposition concerning F if, for every instance, T of R that
satisfies the dependencies in F, π x(r) x π y(r) = T. In other
241 | P a g e
Database Management Systems

words, we can recover the original relation from the


decomposed relations.

2 Dependency Preservation: A decomposition of a relation R


into R1, R2, R3, …, Rn is dependency preserving decomposition
concerning the set of Functional Dependencies F that hold on R
only if the following is a hold;

(F1 U F2 U F3 U … U Fn)+ = F+
where,
F1, F2, F3, …, Fn – Sets of Functional dependencies of relations
R1, R2, R3, …, Rn.
(F1 U F2 U F3 U … U Fn)+ - Closure of Union of all sets of
functional dependencies.

F+ - Closure of set of functional dependency F of R.


If the closure of the set of functional dependencies of individual
relations R1, R2, R3, …, Rn is equal to the set of functional
dependencies of the main relation R (before decomposition),
then we would say the D is lossless dependency preserving
decomposition.
Example:
Assume R(A, B, C, D) with FDs A→B, B→C, C→D.
Let us decompose R into R1 and R2 as follows;
R1(A, B, C)
R2(C, D)

242 | P a g e
Database Management Systems

The FDs A→, B, and B→C are held in R1.


The FD C→D holds in R2.
All the functional dependencies hold here. Hence, this
decomposition is dependency preserving.

9.8 Schema Refinement in Database Design


Database designers typically use a conceptual design
methodology, such as ER design, to arrive at an initial database
design. Given this, the approach redundancy can be eliminated,
and normalization can be attained with schema refinement by
decomposition of relation.

9.8.1 Constraints on an Entity Set: Consider the Hourly_Emp


relation again. The constraint that attribute ssn is a key can be
expressed as FD
{ssn}  {ssn, name, lot, rating, hourly_wages, hours_worked }

For clarity, we write this FD as SSNLRWH, using a single


letter to denote each attribute. In addition, the constraint that the
rating attribute determines the hourly_wages attribute is an FD:
RW

This leads to the redundant storage of rating wage associations.


It cannot be expressed in terms of ER model. Only FDs
determining all attributes of a relation (key constraints) can be
expressed in the ER model. Therefore, we could not detect it
when considering Hourly_Employees as an entity set during ER
modeling.

243 | P a g e
Database Management Systems

To avoid the problem, we can introduce an entity set called


Wage_Table(with attributes rating and hourly_wage) and a
relationship set Has_Wages associating Hourly_Employees and
Wage_Table.
9.8.1 Relationship Set Constraints: Assume we have entity sets,
Parts, Suppliers, and Departments, and a relationship set
Contracts that connect them all. The contract schema is referred
to as CQPSD. A contract with contract id C states that a supplier
S will supply a certain quantity Q of a component. P is assigned
to department D.
We presume that each department purchases no more than
one component from any given supplier. As a result, if there are
multiple contacts between the same supplier and agency, we can
assume that the same component is involved. This is an FD, DS -
> P constraint.
We are back to redundancy and its related issues. We can
solve this problem by splitting Contracts into two connections
with CQSD and SDP attributes. Intuitively, the relation SDP
records the component supplied by a supplier to a department.
In contrast, the relation CQSD records additional details about a
contract. It is doubtful that we would develop such a concept
solely through ER modeling since it is difficult to formulate an
individual or relationship that naturally corresponds to CQSD.
9.8.2 Identifying Attributes of Entities:
Suppose that we have entity sets, Parts, Suppliers, and
Departments, and a relationship set that involves them. We refer
to the schema for contracts as CQPSD. A contract with contract
id C specifies that a supplier S will supply some quantity Q of a
part. P to a department D.

244 | P a g e
Database Management Systems

We assume a policy that a department purchases at most one


part from any given supplier. Therefore, if there are several
contracts between the same supplier and department, we know
that the same part must be involved. This constraint is an FD,
DS -> P.

Again we have redundancy and its associated problems. We


can address this situation by decomposing Contracts into two
relations with attributes CQSD and SDP. Intuitively, the
relation SDP records the part supplied to a department by a
supplier, and the relation CQSD records additional information
about a contract. It is unlikely that we would arrive at such a
design solely through ER modeling since it is hard to formulate
an entity or relationship that corresponds naturally to CQSD.
9.8.3 Identifying Entity Sets: 5.4 Recognizing Entity Sets:
Consider a variation on the Reserves schema previously used.
Let Reserves have the same attributes as before: S, B, and D
meaning that sailor S has a reservation for boat B on day D.
Allow an attribute G on denoting the credit card to which the
reservation is paid. This example shows how FD knowledge can
improve an ER design. We specifically address how FD
knowledge will determine whether a concept should be
modeled as an object or an attribute.
Assume that each sailor makes reservations with a different
credit card. The FD SC expresses this restriction. This restriction
means that, about Reserves, we store a sailor's credit card
number as often as we have reservations for that sailor, with
redundancy and possible upgrade anomalies. Decomposing
Reserves into two connections with attributes SBD and SC is one

245 | P a g e
Database Management Systems

solution. One intuitively stores reservation information, while


the other store's credit card information.
It is instructive to consider an ER design that would result in
these relationships. One solution would be to create an entity set
called Credit Cards, with the selling attribute Cardno, and a
relationship set Has Card that connects Sailors and Credit
Cards. We can map Has Card and Credit Cards to a single
connection with attributes SC by noting that each credit card
belongs to a single sailor. If our main interest in credit card
numbers is to show how a reservation is to be paid for, we will
probably not model them as entities; instead, we would use an
attribute to model card numbers in this case.
A second solution is to make Cardno a Sailor attribute. This
method, however, is not very natural—sailors may have many
cards, and we are not interested in all of them. Our focus is on
the single card used to pay for reservations, best modeled as an
attribute of the relationship Reserves. In this example, making
Cardno an attribute of Reserves and refining the resulting tables
with the FD information helps think about the design issue.
9.9 4NF (Fourth Normal Form) Rules
Suppose no database table instance contains two or more,
independent and multivalued data describing the relevant
entity. In that case, it is in the 4th Normal Form.
9.10 5NF (Fifth Normal Form) Rules
A table is in the 5th Normal Form only if it is in 4NF. It
cannot be decomposed into any number of smaller tables
without data loss.

246 | P a g e
Database Management Systems

9.11 6NF (Sixth Normal Form) Proposed


6th Normal Form is not standardized, yet database experts
have discussed it for some time. Hopefully, we will have a clear
& standardized definition for the 6th Normal Form shortly...
That is all to SQL Normalization.

247 | P a g e
Database Management Systems

CHAPTER 10
TRANSACTION MANAGEMENT

10.1 Introduction
Often, a collection of several operations on the database
appears to be a single unit from the point of view of the
database user. For example, transferring funds from a checking
account to a savings account is a single operation from the
customer’s standpoint; however, it consists of several operations
within the database system.
Collections of operations that form a single logical unit of work are
called transactions. A database system must ensure proper
execution of transactions despite failures—either the entire
transaction executes, or none of it does. Furthermore, it must
manage concurrent execution of transactions to avoid the
introduction of inconsistency.

10.2 Transaction Concept


A transaction is a program execution unit that accesses
and possibly updates various data items. Usually, a transaction
is initiated by a user program written in a high-level data-
manipulation language (typically SQL) or programming
language (for example, C++ or Java), with embedded database
accesses in JDBC or ODBC. A transaction is delimited by the
form's statements (or function calls) begin the transaction and
end the transaction. The transaction consists of all operations
executed between the beginning and end transactions.

248 | P a g e
Database Management Systems

Atomicity: The collection of steps must appear to the user as a


single, indivisible unit. Since a transaction is indivisible, it either
executes in its entirety or not. This “all-or-none” property is
referred to as atomicity. Thus, if a transaction begins to execute
but fails for whatever reason, any changes to the database that
the transaction may have made must be undone.
Consistency: Execution of a transaction in isolation (that is, with
no other transaction executing concurrently) preserves the
consistency of the database.
Isolation: Furthermore, since a transaction is a single unit, its
actions cannot appear to be separated by other database
operations, not part of the transaction. Therefore, the database
system must take certain actions to ensure that transactions
operate properly without interference from concurrently
executing database statements.
Durability: After a transaction completes successfully, the
changes it has made to the database persist, even if there are
system failures.
These properties are often called the ACID properties

10.3 A Simple Transaction Model


The data items in our simplified model contain a single
data value. Each data item is identified by a name (typically a
single letter in our examples, A, B, C, etc.).
Transactions access data using two operations:
• read(X) transfers the data item X from the database to a
variable called X in a buffer in the main memory belonging to
the transaction that executed the read operation.

249 | P a g e
Database Management Systems

• write(X), which transfers the value in the variable X in the


main-memory buffer of the transaction that executed the write
to the data item X in the database.
Let Ti be a transaction that transfers $50 from account A to
account B. This transaction can be defined as:

10.4 Storage Structure


To ensure the atomicity and durability properties, we
must understand how the various data items in the database
may be stored and accessed.
• Volatile storage. Information residing in volatile storage does
not usually survive system crashes. Examples of such storage
are main memory and cache memory. Access to volatile storage
is extremely fast, both because of the speed of the memory
access itself and because it is possible to directly access any data
item in volatile storage.
• Nonvolatile storage. Information residing in nonvolatile
storage survives system crashes. Nonvolatile storage includes
secondary storage devices such as magnetic disk and flash
storage, used for online storage, and tertiary storage devices
such as optical media and magnetic tapes used for archival
storage. At the current state of technology, nonvolatile storage is
slower than volatile storage.
• Stable storage. Information residing in stable storage is never
lost. Although stable storage is theoretically impossible to
obtain, it can be closely approximated by techniques that make
data loss extremely unlikely. We replicate the information in
250 | P a g e
Database Management Systems

several nonvolatile storage media (usually disk) with


independent failure modes to implement stable storage.
Updates must be done with care to ensure that a failure to
update to stable storage does not cause a loss of information.
For a transaction to be durable, its changes need to be written to
stable storage. Similarly, log records need to be written to stable
storage before any changes are made to the database on disk for
a transaction to be atomic. The degree to which a system ensures
durability and atomicity depends on how stable its
implementation of stable storage is. In some cases, a single copy
on a disk is considered sufficient. However, applications whose
data are highly valuable and whose transactions are highly
important require multiple copies or, in other words, a closer
approximation of the idealized concept of stable storage.

10.5 Transaction Atomicity and Durability


A transaction may not always complete its execution
successfully. Such a transaction is termed aborted. If we are to
ensure the atomicity property, an aborted transaction must not
affect the state of the database. Thus, any changes that the
aborted transaction made to the database must be undone. Once
the changes caused by an aborted transaction have been
undone, we say that the transaction has been rolled back.

It is part of the responsibility of the recovery scheme to manage


transaction aborts. This is typically done by maintaining a log.
Each database modification made by a transaction is first
recorded in the log. We record the transaction's identifier
performing the modification,

251 | P a g e
Database Management Systems

the identifier of the modified data item, and both the old value
(before modification) and the new value (after modification) of
the data item. Only then is the database itself modified.
Maintaining a log allows redoing a modification to ensure
atomicity and durability and the possibility of undoing a
modification to ensure atomicity in case of a failure during
transaction execution.

A transaction that completes its execution successfully is said to


be committed. A committed transaction that has performed
updates transforms the database into a new consistent state,
which must persist even if a system fails. We cannot undo its
effects once a transaction has been committed by aborting it. The
only way to undo the effects of a committed transaction is to
execute a compensating transaction.

We need to be more precise about what we mean by completing a


transaction. We, therefore, establish a simple abstract
transaction model. A transaction must be in one of the following
states:
• Active, the initial state; the transaction stays in this state while
it is executing.
• Partially committed after the final statement has been
executed.
• Failed after the discovery that normal execution can no longer
proceed.
• Aborted after the transaction has been rolled back and the
database restored to its state before it starts.
• Committed after successful completion.

252 | P a g e
Database Management Systems

Fig. 10.1 State Diagram of Transaction


A transaction reaches the failed state when the machine
decides that it can no longer continue with its normal execution.
Such a transaction must be reversed. Then it
Enters the aborted condition At this stage, the system has
two options:
 It can restart the transaction, but only if it was aborted due to a
hardware or software malfunction caused by the transaction's
internal logic. A restarted transaction is considered a new
transaction.
 It has the potential to ruin the transaction. It normally does so
because of an internal logical error that can only be resolved by
rewriting the application software. The input was incorrect, or
the desired data were not contained in the database.
We must exercise caution when dealing with measurable
external writes, such as writing to a user's screen or sending a
text. It cannot be undone once such a write has happened
because it could have been seen outside the database system.

253 | P a g e
Database Management Systems

10.6. Transaction Isolation


Transaction-processing systems usually allow multiple
transactions to run concurrently.
Allowing multiple transactions to update data concurrently
causes several complications with the consistency of the data, as
we saw earlier. Ensuring consistency despite concurrent
execution of transactions requires extra work; it is far easier to
insist that transactions run serially—that is, one at a time, each
starting only after the previous one has been completed. However,
there are two good reasons for allowing concurrency:
• Improved throughput and resource utilization: A transaction
consists of many steps. Some involve I/O activity; others
involve CPU activity. The CPU and the disks in a computer
system can operate in parallel. Therefore, I/O activity can be
done in parallel with processing at the CPU. Therefore, the
parallelism of the CPU and the I/O system can be exploited to
run multiple transactions in parallel. While a read or write on
behalf of one transaction is in progress on one disk, another
transaction can be running in the CPU, while another disk may
be executing a read or write on behalf of a third transaction. All
of this increases the throughput of the system—that is, the
number of transactions executed in a given amount of time.
Correspondingly, the processor and disk utilization also
increase; in other words, the processor
Moreover, the disk spends less time idle or does not perform
useful work.

• Reduced waiting time: There may be a mix of transactions


running on a system, some short and long. If transactions run
serially, a short transaction may have to wait for an initial long
254 | P a g e
Database Management Systems

transaction to complete, leading to unpredictable delays in


running a transaction. If the transactions operate on different
parts of the database, it is better to let them run
concurrently, sharing the CPU cycles and disk accesses among
them. Concurrent execution reduces the unpredictable delays in
running transactions. Moreover, it also reduces the average
response time: the average time for a transaction to be completed after
submitting it.

The motivation for using concurrent execution in a database is


the same as the motivation for using multiprogramming in an
operating system with scheduling ensuring consistency and
isolation.

The database system must control the interaction among the


concurrent transactions to prevent them from destroying the
consistency of the database. It does so through a variety of
mechanisms called concurrency-control schemes.

Consider the simplified banking system, which has several


accounts, and a set of transactions that access and update those
accounts. Let T1 and T2 be two transactions that transfer funds
from one account to another. Transaction T1 transfers $50 from
account A to account B. It is defined as:

255 | P a g e
Database Management Systems

Transaction T2 moves 10% of account A's balance to account


B. It is defined as follows:

Assume accounts A and B have present values of $1000 and


$2000, respectively. Assume that the two transactions are carried
out in the order T1 followed by T2. TABLE depicts this
execution sequence. Following the execution in Table 2, the final
values of accounts A and B are $855 and $2145, respectively. As
a result, the total amount of money in accounts A and B—that is,
the balance A + B—is preserved until all transactions are
completed.

Similarly, if the transactions are executed in the order T2


followed by T1, the corresponding execution sequence is shown

256 | P a g e
Database Management Systems

in Figure. As predicted, the sum A + B is retained, and the final


values of accounts A and B are $850 and $2150, respectively.

Schedules are the execution sequences that have just been


identified. The sequential schedules: Each serial schedule
comprises a sequence of instructions from different transactions.
The instructions from a single transaction appear together in
that schedule. Recalling a well-known combinatorial formula,
we observe that for a set of n transactions, there exists n
There is factorial (n!) true serial schedules.
As in our previous example, assume that the two
transactions are carried out simultaneously. The figure depicts
one potential schedule. Following this execution, we arrive at
the same condition as when the transactions are executed
serially in the T1 followed by T2. The number A + B is retained.

257 | P a g e
Database Management Systems

Not all concurrent executions produce the desired result.


Consider Figure 5's schedule as an example. Following the
completion of this schedule, the final values of accounts A and B
are $950 and $2100, respectively. This is an inconsistent final
condition since we received $50 during the concurrent execution
period. The execution of the two transactions does not maintain
the number A + B.
Suppose the operating system has complete control over
concurrent execution. In that case, several schedules are
possible, including those that leave the database in an
inconsistent state, such as the one just mentioned. The database
system's responsibility is to ensure that every schedule executed
leaves the database in a consistent state. The database system's
concurrency-control component does this.

We may maintain database consistency under concurrent


execution by ensuring that any schedule executed has the same

258 | P a g e
Database Management Systems

effect as a schedule that might have occurred without


concurrent execution. The schedule should be similar to a serial
schedule in certain ways. Such schedules are referred to as
serializable schedules.

10.7 Serializability
Before we can consider how the concurrency-control
component of the database system can ensure serializability; we
consider determining when a schedule is serializable. Certainly,
serial schedules are serializable, but it is harder to determine

259 | P a g e
Database Management Systems

whether a schedule is serializable if multiple transactions are


interleaved.
Since transactions are programs, it is difficult to determine
exactly what operations a transaction performs and how various
transactions interact. For this reason, we shall not consider the
various types of operations that a transaction can perform on a
data item but instead consider only two operations: read and
write.
We assume that, between a read(Q) instruction and a write(Q)
instruction on a data item Q, a transaction may perform an
arbitrary sequence of operations on the copy of Q residing in the
transaction's local buffer. We, therefore, may show only read
and write instructions in schedules, as we do for schedule 3 in
Figure 4.6.

Conflict serializability: Let us consider a schedule S in which


there are two consecutive instructions, I and J, of transactions Ti
and Tj, respectively (i ≠ j). Suppose I and J refer to different data
items. In that case, we can swap I and J without affecting the
results of any instruction in the schedule. However, if I and J
refer to the same data item Q, then the order of the two steps
may matter. Since we are dealing with only read and write
instructions, there are four cases that we need to consider:

1. I = read(Q), J = read(Q). The order of I and J does not matter


since the same value of Q is read by Ti and Tj, regardless of the
order.
2. I = read(Q), J = write(Q). If I come before J, then Ti does not
read the value of Q that Tj writes. Thus, the order of I and J
matters.
260 | P a g e
Database Management Systems

261 | P a g e
Database Management Systems

3. I = write (Q), and J = read (Q). The order of I and J is


important for the same reasons as in the previous case.
I = write(Q), J = write (Q). Since all of these instructions are
write operations, the order of these instructions does not affect
Ti or Tj. The value obtained by the next read(Q) instruction of S,
on the other hand, is affected since only the output of the latter
of the two write instructions is saved in the database. If there are
no other write(Q) instructions after I and J in S, the order of I
and J directly affects the final value of Q in the database state
generated by schedule S.
If I and J are operations by different transactions on the same
data object, and at least one of these instructions is a write
operation, we claim they conflict.

262 | P a g e
Database Management Systems

The concept of conflict equivalence contributes to the


concept of conflict serializability. If a schedule S is a conflict
equal to a serial schedule, we assume it is conflict serializable.
As a result, schedule 3 is conflict serializable since it is a conflict
similar to serial schedule 1.
Precedence graph: We now present a simple and efficient
method for determining the conflict serializability of a schedule.
Consider a schedule S. We construct a directed graph called a
precedence graph. This graph consists of a pair G = (V, E),
where V is a set of vertices and E is a set of edges. The set of
vertices consists of all the transactions participating in the
schedule. The set of edges consists of all edges Ti →Tj for which
one of three conditions holds:
1. Ti executes write(Q) before Tj executes read(Q).
2. Ti executes read(Q) before Tj executes write(Q).
3. Ti executes write(Q) before Tj executes write(Q).
If an edge Ti → Tj exists in the precedence graph, then, in any
serial schedule, S_
equivalent to S, Ti must appear before Tj.

Figure 11 depicts the precedence graph for schedule 4. It has


the edge over T1 T2 since T1 performs read(A) before T2
performs write (A). It also has the edge T2T1 since T2 performs
read(B) before T1 writes (B).
263 | P a g e
Database Management Systems

Topological sorting: A transaction's serializability order can


be obtained by finding a linear order consistent with the partial
order of the precedence graph. This is known as topological
sorting.

7 Transaction Isolation and Atomicity


Now we will look at the impact of transaction failures
during concurrent execution.

264 | P a g e
Database Management Systems

If a transaction Ti fails for whatever purpose, we must undo


the transaction's impact to ensure the transaction's atomicity.
The atomicity property requires that any transaction Tj that is
dependent on Ti (that is, Tj has read data written by Ti) be
aborted in a framework that allows concurrent execution. To do
this, we must limit the types of schedules included in the
method. Following that, we discuss what schedules are
appropriate for recovery from transaction failure.

7.1 Recoverable Schedules:


Consider the partial schedule 9 in Figure 4.14, in which T7 is a
transaction that performs only one instruction: read(A). We call
this a partial schedule because we have not included a commit
or abort operation for T6. Notice that T7 commits immediately
after executing the read(A) instruction. Thus, T7 commits while
T6 is still in the active state. Now suppose that T6 fails before it
commits. T7 has read the value of data item A written by T6.
Therefore, we say that T7 is dependent on T6. Because of this,
we must abort T7 to ensure atomicity. However, T7 has already
committed and cannot be aborted. Thus, we have a situation
where it is impossible to recover correctly from the failure of T6.
Schedule 9 is an example of a nonrecoverable schedule. A
265 | P a g e
Database Management Systems

recoverable schedule is one where, for each pair of transactions,


Ti and Tj such that Tj reads a data item previously written by Ti,
the commit operation of Ti appears before the commit operation
of Tj. For the example of schedule 9 to be recoverable, T7 would
have to delay committing until after T6 commits.

7.2 Cascade fewer Schedules:


Even if a schedule is recoverable, we may have to roll back
several transactions to recover correctly from the failure of a
transaction Ti. Such situations occur if transactions have read
data written by Ti. As an illustration, consider the partial
schedule in Figure 4.15. Transaction T8 writes a value of A that
is read by transaction T9. Transaction T9 writes a value of A that
is read by transaction T10. Suppose that, at this point, T8 fails.
T8 must be rolled back. Since T9 is dependent on T8, T9 must be
rolled back. Since T10 is dependent on T9, T10 must be rolled
back.

266 | P a g e
Database Management Systems

This phenomenon occurs when a single transaction fails,


resulting in a sequence of transactions failing. Rollbacks are
called cascading rollbacks.
Cascading rollback is undesirable because it results in
undoing a substantial amount of effort. Limiting the schedules
to those that do not allow for cascading rollbacks is preferable.
Such schedules are referred to as cascade-less schedules.
Formally, a cascades schedule is one in which the commit
operation of Ti comes before the read operation of Tj for each
pair of transactions Ti and Tj in which Tj reads a data item
previously written by Ti. It is simple to demonstrate that any
cascade less schedule is recoverable.

10.8 Transaction Isolation Levels


To maintain data consistency, transaction isolation levels are
used. The isolation levels specified by the SQL standard are as
follows:
• Serializable usually ensures serializable execution. However,
some database systems implement this isolation level in a
manner that may, in certain cases, allow non-serializable
executions.
• Repeatable read allows only committed data to be read. Further,
it requires that no other transaction can update between two
reads of a data item by a transaction.
• Read committed allows only committed data to be read but does
not require repeatable reads. For instance, another transaction
may have updated the data item and committed between two
reads of a data item by the transaction.
• Read uncommitted allows uncommitted data to be read. It is
the lowest isolation level allowed by SQL.
267 | P a g e
Database Management Systems

All the isolation levels above additionally disallow dirty writes;


that is, they disallow a data item that has already been written
by another transaction that has not yet been committed or aborted.

10.9 Implementation of Isolation Levels


There are various concurrency-control policies that we can use
to ensure that, even when multiple transactions are executed
concurrently, only acceptable schedules are generated. A
transaction acquires a lock on the entire database before it starts
and releases the lock after it has committed. While a transaction
holds a lock, no other transaction is allowed to acquire the lock, and all
must therefore wait for the lock to be released.

As a result of the locking policy, only one transaction can execute at a


time. Therefore, only serial schedules are generated. These are
trivially serializable, and it is easy to verify that they are
recoverable and cascades.

10.9.1 Locking
Instead of locking the entire database, a transaction could lock
only those data items it accesses. The two-phase locking
protocol is a simple, widely used technique that ensures
serializability. Stated simply, two-phase locking requires a
transaction to have two phases, one where it acquires locks but does
not release any, and a second phase where the transaction releases locks
but does not acquire any. (In practice, locks are usually released
only when the transaction completes its execution and has been
either committed or aborted.)

268 | P a g e
Database Management Systems

Further improvements to locking result if we have two kinds of


locks: shared and exclusive. Shared locks are used for data that
the transaction reads, and exclusive locks are used for those it
writes. Many transactions can hold shared locks on the same
data item simultaneously. However, a transaction is allowed an
exclusive lock on a data item only if no other transaction holds
any lock.

10.9.2 Timestamps
Another category of techniques for implementing isolation
assigns each transaction a timestamp, typically when it begins.
For each data item, the system keeps two timestamps. The read
timestamp of a data item holds the largest (that is, the most
recent) timestamp of those transactions that read the data item.
The write timestamp of a data item holds the transaction's
timestamp that wrote the current value of the data item.
Timestamps ensure that transactions access each data item in
order; otherwise, transactions are aborted and restarted with a
new timestamp.

269 | P a g e
Database Management Systems

CHAPTER 11
CONCURRENCY CONTROL

11.1 Lock-Based Protocols:


One way to ensure isolation is to require that data items be
accessed mutually; while one transaction is accessing a data
item, no other transaction can modify that data item. The most
common method to implement this requirement is to allow a
transaction to access a data item only if it is currently holding a
lock on that item.

11.2 Locks:
1. : The two modes of locks are:
2. 1. Shared. If a transaction Ti has obtained a shared-mode lock
(denoted by S) on item Q, Ti can read, but cannot write, Q.
3. 2. Exclusive. If a transaction Ti has obtained an exclusive-mode
lock (denoted by X) on item Q, Ti can read and write Q.

When a transaction requests a lock on a data object, the


concurrency control manager grants the lock based on the

270 | P a g e
Database Management Systems

transaction's compatibility. Any transaction can apply shared


locks on the same data object. Transactions can read data when
joint locks are used. On an exclusive lock, no other locks can be
used. Transaction Ti is forced to wait until all incompatible locks
owned by other transactions are released.

Improper sharing of resources, such as data objects, results


in deadlock conditions. The machine must roll back one of the
two transactions where there is a deadlock. When a transaction
is rolled back, the data items locked by that transaction are
released. These data items are then made available to the other
transaction, which can resume execution.

271 | P a g e
Database Management Systems

We will require that each transaction in the system adhere to


a set of rules known as a locking protocol, which specifies when
a transaction can lock and unlock each data object.

11.2.1 Granting of Locks


When a transaction requests a lock on a data item in a
particular mode, and no other transaction has a lock on the same
data item in a conflicting mode, the lock can be granted.
However, care must be taken to avoid the following scenario.

11.3 Starvation
Suppose a transaction T2 has a shared-mode lock on a data item,
and another transaction T1 requests an exclusive-mode lock on
the data item. T1 has to wait for T2 to release the shared-mode

272 | P a g e
Database Management Systems

lock. Meanwhile, a transaction T3 may request a shared-mode


lock on the same data item. The lock request is compatible with
the lock granted to T2 so that T3 may be granted the shared-
mode lock. T2 may release the lock, but T1 has to wait for T3 to
finish. However, again, a new transaction T4 may request a
shared-mode lock on the same data item and is granted the lock
before T3 releases it. There may be a sequence of transactions
that each requests a shared-mode lock on the data item. Each
transaction releases the lock a short while after it is granted, but
T1 never gets the exclusive-mode lock on the data item. The
transaction T1 may never make progress and is said to be
starved.

Avoiding Starvation: When a transaction Ti requests a lock on a


data item Q in a particular mode M, the concurrency-control
manager grants the lock provided that:
1. There is no other transaction holding a lock on Q in a mode
that conflicts with M.
2. There is no other transaction waiting for a lock on Q, which
made its lock request before Ti.
Thus, a lock request will never get blocked by a lock request that
is made later.
11.4 The Two-Phase Locking Protocol
One protocol that ensures serializability is the two-phase
locking protocol. This
protocol requires that each transaction issue lock and unlock
requests in two phases:
1. Growing phase. A transaction may obtain locks but may not
release any lock.

273 | P a g e
Database Management Systems

2. Shrinking phase. A transaction may release locks but may


not obtain any new locks.

Initially, a transaction is in a growing phase. The transaction


acquires locks as needed. Once the transaction releases a lock, it
enters the shrinking phase and can issue no more lock requests.
The point in the schedule where the transaction has obtained its
final lock (the end of its growing phase) is called the lock point
of the transaction. Two-phase locking does not ensure freedom
from deadlock. Cascading rollbacks can be avoided by modifying
two-phase locking, called the strict two-phase locking protocol.
Another variant of two-phase locking is the rigorous two-phase
locking protocol, which requires holding all locks until the
transaction commits.
.

274 | P a g e
Database Management Systems

Lock Conversions: This observation leads us to refine the basic


two-phase locking protocol, in which lock conversions are
allowed. We shall provide a mechanism for upgrading a shared
lock to an exclusive lock and downgrading an exclusive lock to a
shared lock. Lock conversion cannot be allowed arbitrarily.
Rather, upgrading can only occur in the growing phase, whereas
downgrading can only occur in the shrinking phase.

When a transaction Ti issues a read(Q) operation, the system


issues a lock-S(Q) instruction followed by the read(Q)
instruction.

275 | P a g e
Database Management Systems

• When Ti issues a write(Q) operation, the system checks to see


whether Ti already holds a shared lock on Q. If it does. The
system issues an upgrade(Q) instruction, followed by the
write(Q) instruction. Otherwise, the system issues a lock-X(Q)
instruction, followed by the write(Q) instruction.
• All locks obtained by a transaction are unlocked after that
transaction commits or aborts.
11.5 Implementation of Locking: A lock manager can receive
messages from transactions and send messages in reply. The
lock-manager process replies to lock-request messages with
lock-grant messages or messages requesting rollback of the
transaction. Unlock messages require only an acknowledgment
in response but may result in a grant message to another
waiting transaction.

The lock manager uses this data structure: For each data item
currently locked, it maintains a linked list of records, one for
each request, in the order in which the requests arrived. It uses a
hash table, indexed on the name of a data item, to find the
linked list (if any) for a data item; this table is called the lock
table. Each record of the linked list for a data item notes which
transaction made the request and what lock mode it requested.
The record also notes if the request has currently been granted.

The lock manager processes requests this way:

• When a lock request message arrives, it adds a record to the


end of the linked list for the data item if the linked list is present.
Otherwise, it creates a new linked list containing only the record
for the request.
276 | P a g e
Database Management Systems

It always grants a lock request on a data item that is not


currently locked. However, suppose the transaction requests a
lock on an item on which a lock is currently held. The lock
manager grants the request only if it is compatible with the
currently held locks. All earlier requests have been granted
already. Otherwise, the request has to wait.
• When the lock manager receives an unlock message from a
transaction, it deletes the record for that data item in the linked
list corresponding to that transaction. It tests the record that
follows, if any, as described in the previous paragraph, to see if
that request can now be granted. If it can, the lock manager
grants that request, processes the record following it, if any,
similarly, and so on.
• If a transaction aborts, the lock manager deletes any waiting
request made by the transaction. Once the database system has
taken appropriate actions to undo the transaction, it releases all
locks held by the aborted transaction.

277 | P a g e
Database Management Systems

11.6 Graph-Based Protocols


To have prior knowledge about the order in which the
database items will be accessed, it is possible to construct
locking protocols that are not two phases but ensure conflict
serializability.

To acquire such prior knowledge, we impose a partial


ordering→on the set
D = {d1, d2, . . . , dh} of all data items. If di → dj, then any
transaction accessing the result of either the logical or physical
organization of the data may be imposed solely for concurrency
control.

Tree Protocol: The only lock instruction allowed is lock-X in the


tree protocol. Each transaction
Ti can lock a data item at most once and must observe the
following rules:
1. The first lock by Ti may be on any data item.
2. Subsequently, a data item Q can be locked by Ti only if the parent of
Q is
currently locked by Ti.
3. Data items may be unlocked at any time.
4. A data item locked and unlocked by Ti cannot subsequently
be relocked by Ti.

278 | P a g e
Database Management Systems

Fig. 11.1 Tree Structure Database Graph

T10: lock-X(B); lock-X(E); lock-X(D); unlock(B); unlock(E);


lock-X(G);unlock(D); unlock(G).
T11: lock-X(D); lock-X(H); unlock(D); unlock(H).
T12: lock-X(B); lock-X(E); unlock(E); unlock(B).
T13: lock-X(D); lock-X(H); unlock(D); unlock(H).

The tree-locking protocol has an advantage over the two-


phase locking protocol. Unlike two-phase locking, it is deadlock-
279 | P a g e
Database Management Systems

free, eliminating the need for rollbacks. Another advantage of


the tree-locking protocol over the two-phase locking protocol is
that unlocking will happen sooner. Early unlocking can result in
shorter wait times and increased concurrency.
However, the protocol has the drawback of requiring a
transaction to lock data objects that it does not access in certain
situations.

11.7 Dead Lock Handling


Dead Lock: A system is in a deadlock state if there is a set
of transactions. Every transaction in the set is waiting for
another transaction in the set. More precisely, there exists a set
of waiting transactions {T0, T1, . . . , Tn} such that T0 is waiting
for a data item that T1 holds, and T1 is waiting for a data item
that T2 holds, and . . . , and Tn−1 is waiting for a data item that
Tn holds. Tn is waiting for a data item that T0 holds. None of
the transactions can make progress in such a situation.

There are two principal methods for dealing with the deadlock
problem. We can use a deadlock prevention protocol to ensure
that the system never enters a deadlock state. Alternatively, we
can allow the system to enter a deadlock state and recover by
using a deadlock detection and deadlock recovery scheme.
11.7.1 Dead Lock Prevention:
Various locking protocols do not guard against deadlocks.
One way to prevent deadlock is to use an ordering of data items
and request locks in a sequence consistent with the ordering.

Another way to prevent deadlock is to use preemption and


transaction rollbacks. To control the preemption, we assign a
280 | P a g e
Database Management Systems

unique timestamp to each transaction. The system uses these


timestamps to decide whether a transaction should wait or roll
back. If a transaction is rolled back, it retains its old timestamp
when restarted.

Two different deadlock-prevention schemes using timestamps


have been proposed:
1. The wait–die scheme is a non-preemptive technique. When Ti
requests a data item currently held by Tj, Ti can only wait if it
has a timestamp smaller than Tj (Ti is older than Tj ). Otherwise,
Ti is rolled back (dies).
2. The wound–wait scheme is a preemptive technique. It is a
counterpart to the wait–die scheme. When Ti requests a data
item currently held by Tj, Ti can only wait if it has a timestamp
larger than Tj (Ti is younger than Tj ). Otherwise, Tj is rolled
back (Ti wounds tj).

11.7.2 Deadlock Detection


Deadlocks can be described precisely in a directed graph
called a wait for a graph. This graph consists of a pair G = (V,
E), where V is a set of vertices and
E is a set of edges. The set of vertices consists of all the
transactions in the system.
Each element in the set E of edges is an ordered pair Ti → Tj. If
Ti → Tj is in E, there is a directed edge from transaction Ti to Tj,
implying that transaction Ti is waiting for transaction Tj to
release a data item that it needs.
When transaction Ti requests a data item held by transaction Tj,
the edge Ti → Tj is inserted in the wait-for graph. This edge is

281 | P a g e
Database Management Systems

removed only when transaction Tj no longer holds a data item


needed by transaction Ti.
A deadlock exists in the system if the wait-for graph contains a
cycle. Each transaction involved in the cycle is said to be
deadlocked. To detect deadlocks, the system needs to maintain
the wait-for graph and periodically invoke an algorithm that
searches for a cycle in the graph.
To illustrate these concepts, consider the wait-for graph in
Figure 4.9, which depicts the following situation:
• Transaction T17 is waiting for transactions T18 and T19.
• Transaction T19 is waiting for transaction T18.
• Transaction T18 is waiting for transaction T20.
Since the graph has no cycle, the system is not in a deadlock
state.

Fig. 11.2 Wait for Graph with Cycle and Without Cycle

Assume that transaction T20 is demanding an object from


T19. T20 T19 is applied to the wait-for graph, resulting in the
new machine state depicted in Figure 10. This time, the period is
depicted on the graph: T18 T20 T19 T18 T18 T18 T18 T18 T18 T18
T18 T18 T18 T18 T18 T18 T18 T

This implies that transactions T18, T19, and T20 are all stuck.

282 | P a g e
Database Management Systems

11.7.3 Recovery from Deadlock:


When a detection algorithm determines that a deadlock
exists, the system must recover from the deadlock. The most
common solution is to roll back one or more transactions to
break the deadlock. Three actions need to be taken:

1. Selection of a victim. Given a set of deadlocked transactions,


we must determine which transaction (or transactions) to roll
back to break the deadlock. We should roll back those
transactions that will incur the minimum cost. Many factors may
determine the cost of a rollback, including:
a. How long the transaction has computed, and how much
longer the transaction will compute before completing its
designated task.
b. How many data items have the transaction used.
c. How many more data items does the transaction need to
complete.
d. How many transactions will be involved in the rollback.

2. Rollback. The simplest solution is a total rollback: Abort the


transaction and then restart it. However, it is more effective to
roll back the transaction only as far as necessary to break the
deadlock. Such partial rollback requires the system to maintain
additional information about the state of all the running
transactions. Specifically, the sequence of lock requests/grants
and updates performed by the transaction must be recorded.
The selected transaction must be rolled back to where it
obtained the first of these locks, undoing all its actions after that
point.

283 | P a g e
Database Management Systems

3. Starvation. In a system where the selection of victims is based


primarily on cost factors, the same transaction may always be
picked as a victim. As a result, this transaction never completes
its designated task. Thus there is starvation. We must ensure
that a transaction can be picked as a victim only a (small) finite
number of times. The most common solution
is to include the number of rollbacks in the cost factor.
11.8 Multiple Granularity
There are circumstances where it would be advantageous
to group several data items and treat them as one aggregate
data item for working purposes, resulting in multiple levels of
granularity. We allow data items of various sizes and define a
hierarchy of data items. The small items are nested within larger
ones. Such a hierarchy can be represented graphically as a tree.
Locks are acquired in root-to-leaf order; they are released in leaf-
to-root order. The protocol ensures serializability but not
freedom from deadlock.

As an illustration, consider the tree of Figure 11.3, which


consists of four levels of nodes. The highest level represents the
entire database. Below it is nodes of type area; the database
consists of exactly these areas. Each area, in turn, has nodes of
type files as its children. Each area contains exactly those files
that are its child nodes. No file is in more than one area. Finally,
each file has nodes of type records. The file consists of exactly
those records that are its child nodes, and no record can be
present in more than one file.
Each node in the tree can be locked individually. We shall use
shared and exclusive lock modes as we did in the two-phase
locking protocol. When a transaction locks a node in either

284 | P a g e
Database Management Systems

shared or exclusive mode, the transaction has implicitly locked


all the node descendants in the same lock mode.
For example, suppose transaction Ti gets an explicit lock on file
Fc of Figure 11.3, in exclusive mode. In that case, it has an
implicit lock in exclusive mode on all the records belonging to
that file. It does not need to lock the individual records of Fc
explicitly.

Fig. 11.3 Granuvality Hierarchy

Intention Locks: When a node is locked in intention mode,


explicit locking occurs at a lower tree level (at a finer
granularity). Until a node is specifically locked, all its ancestors
are given intention locks.
With shared mode, there is an aim mode, and with exclusive
mode, there is one. When a node is locked in intention-shared
(IS) mode, explicit locking occurs at a lower tree level but only
with shared-mode locks. Similarly, suppose a node is locked in
intention-exclusive (IX) mode. In that case, explicit locking
occurs at a lower level, using exclusive-mode or shared-mode
locks. Finally, suppose a node is locked in shared and intention-
exclusive (SIX) mode. In that case, the subtree rooted by that
node is directly locked in shared mode, which is done at a lower
level with exclusive-mode locks.

285 | P a g e
Database Management Systems

The multiple-granularity locking protocol uses these lock


modes to ensure serializability. It demands that a transaction Ti
that tries to lock a node Q obey the following rules:

Fig. 11.4 Comparability Matrix


1. Business transaction Ti must follow Figure 11.4's lock-
compatibility feature.
2. Business transaction Ti must first lock the tree's base, which can
be locked in any mode.
3. Business transaction Ti can only lock a node Q in S or IS mode if
Ti also has the parent of Q locked in IX or IS mode.
4. The transaction Ti can only lock a node Q in X, SIX, or IX mode
if Ti also has the parent of Q locked in IX or SIX modes.
5. Business transaction Ti can only lock a node if it has not already
unlocked any nodes (Ti is two-phase).
Sixth. Transaction Ti can open a node Q only if Ti does not
currently have any of Q's children locked.
Remember that the multiple-granularity protocol allows
locks to be obtained top-down (root-to-leaf) and released
bottom-up (leaf-to-root).
This protocol improves concurrency while decreasing lock
overhead. It is particularly useful in applications that involve a
combination of:
 Brief transactions that only access a few data objects.

286 | P a g e
Database Management Systems

 Long transactions that generate reports from a single file or


collection of files.

11.9 Times Tamp-Based Protocols


Another method for determining the serializability order is to
select an ordering among transactions in advance. The most
common method for doing so is to use a timestamp-ordering
scheme.

Another method for determining the serializability order is to


select an ordering among transactions in advance. The most
common method for doing so is to use a timestamp-ordering
scheme.

Timestamps: We associate a unique fixed timestamp with each


transaction Ti in the system, denoted by TS(Ti ). The database
system assigns this timestamp before the transaction Ti starts
execution. If a transaction Ti has been assigned timestamp TS(Ti
), and a new transaction Tj enters the system, then TS(Ti ) <
TS(Tj ). There are two simple methods for implementing this
scheme:

1. Use the value of the system clock as the timestamp; that is, a
transaction’s timestamp is equal to the value of the clock when
the transaction enters the system.
2. Use a logical counter that is incremented after a new
timestamp has been assigned; a transaction’s timestamp is equal
to the counter's value when the transaction enters the system.
The timestamps of the transactions determine the serializability
order. Thus, if TS(Ti ) < TS(Tj ), then the system must ensure that
287 | P a g e
Database Management Systems

the produced schedule is equivalent to a serial schedule in


which transaction Ti appears before transaction Tj. To
implement this scheme, we associate with each data item Q two
timestamp values:
• W-timestamp(Q) denotes the largest timestamp of any
transaction that executed write(Q).
• R-timestamp(Q) denotes the largest timestamp of any
transaction that executed read(Q) successfully.
These timestamps are updated whenever a new read(Q) or
write(Q) instruction is executed.

The Timestamp-Ordering Protocol


The timestamp-ordering protocol ensures that any
conflicting read and write operations are executed in timestamp
order. This protocol operates as follows:
1. Suppose that transaction Ti issues read(Q).
a. If TS(Ti ) < W-timestamp(Q), then Ti needs to read a value of
Q that was already overwritten. Hence, the read operation is
rejected, and Ti is rolled back.
b. If TS(Ti ) ≥ W-timestamp(Q), then the read operation is
executed, and R-timestamp(Q) is set to the maximum of R-
timestamp(Q) and TS(Ti ).
2. Suppose that transaction Ti issues write(Q).
a. If TS(Ti ) < R-timestamp(Q), then the value of Q that Ti is
producing was needed previously, and the system assumed that
that value would never be produced. Hence, the system rejects
the write operation and rolls Ti back.
b. If TS(Ti ) < W-timestamp(Q), then Ti is attempting to write an
obsolete value of Q. Hence, the system rejects this write
operation and rolls Ti back.
288 | P a g e
Database Management Systems

c. Otherwise, the system executes the write operation and sets


W-timestamp(Q) to TS(Ti ).
Suppose a transaction Ti is rolled back by the concurrency-
control scheme due to either a read or write operation. In that
case, the system assigns it a new timestamp and restarts it.

The timestamp-ordering protocol ensures conflict serializability.


This is because conflicting operations are processed in
timestamp order.
11.10 Validation-Based Protocols
A validation scheme is an appropriate concurrency-control
method in cases where most transactions are read-only
transactions. Thus the rate of conflicts among these transactions
is low. A unique fixed timestamp is associated with each
transaction in the system. The timestamp of the transaction
determines the serializability order. A transaction in this scheme
is never delayed. It must, however, pass a validation test to
complete. If it does not pass the validation test, the system rolls
it back to its initial state.

The validation protocol requires that each transaction Ti


executes in two or three different phases in its lifetime,
depending on whether it is a read-only or an update transaction.
The phases are, in order:

1. Read phase. During this phase, the system executes


transaction Ti. It reads the values of the various data items and
stores them in variables local to Ti. It performs all write
operations on temporary local variables without updating the
database.
289 | P a g e
Database Management Systems

2. Validation phase. The validation test (described below) is


applied to transaction Ti. This determines whether Ti is allowed
to proceed to the writing phase without causing a violation of
serializability. If a transaction fails the validation test, the system
aborts the transaction.
3. Write phase. If the validation test succeeds for transaction Ti,
the temporary local variables that hold the results of any write
operations performed by Ti are copied to the database. Read-
only transactions omit this phase.
Each transaction must go through the phases in the order
shown. However, phases of concurrently executing transactions
can be interleaved. To perform the validation test, we need to
know when the various phases of transactions occurred. We
shall, therefore, associate three different timestamps with each
transaction Ti :

1. Start(Ti), the time when Ti started its execution.


2. Validation(Ti ), when Ti finished its read phase and started
its validation phase.
3. Finish(Ti), the time when Ti finished its write phase.

The validation test for transaction Ti requires that for all


transactions Tk with TS(Tk ) < TS(Ti ), one of the following two
conditions must hold:
1. Finish(Tk ) < Start(Ti ). Since Tk completes its execution before
Ti starts, the serializability order is maintained.
2. The set of data items written by Tk does not intersect with the
set of data items read by Ti. Tk completes its write phase before
Ti starts its validation phase (Start(Ti ) < Finish(Tk ) <
Validation(Ti )). This condition ensures that the writes of Tk
290 | P a g e
Database Management Systems

and Ti do not overlap. Since the writes of Tk do not affect the


read of Ti, and since Ti cannot affect the read of Tk, the
serializability order is indeed maintained.

Fig. 11.5 Schedule Produced By Using Validation

1. Start (Ti), the moment Ti begins its execution.


2. Validation (Ti) is the point at which Ti completes its read phase
and begins its validation phase.
3. Finish (Ti), the time Ti completed its write process.
The validation test for transaction Ti demands that one of
the following two conditions hold for all transactions Tk with TS
(Tk) TS(Ti ):
1. Finish(Tk ) Begin (Ti ). The serializability order is preserved
since Tk completes its execution before Ti begins.
2. The collection of data items written by Tk does not interfere
with the set of data items read by Ti. Tk completes its write
process before Ti begins its validation phase (Start (Ti) Finish
(Tk ) Validation(Ti )). This condition ensures that Tk and Ti's
writes do not overlap. Since Tk's writes have no effect on Ti's
read, and Ti cannot influence Tk's read, the serializability order
is preserved.
Since transactions execute optimistically, knowing they will
be able to finish execution and validate at the end, this

291 | P a g e
Database Management Systems

validation scheme is known as the optimistic concurrency


control scheme.
Multi-version Schemes
A multiversion concurrency-control scheme is based on
creating a new version of a data item for each transaction that
writes that item. When a read operation is issued, the system
selects one of the versions to be read. The concurrency-control
scheme ensures that the version to be read is selected to ensure
serializability by using timestamps. A read operation always
succeeds.
 In multiversion timestamp ordering, a write operation may
result in the rollback of the transaction.
 In multi-version two-phase locking, write operations may result
in a lock wait or, possibly, in deadlock.
With each data item Q, a sequence of versions<Q1, Q2, . . . ,
Qm>is associated. Each version of Qk contains three data fields:
• Content is the value of version Qk.
• W-timestamp(Qk ) is the transaction's timestamp that created
version Qk.
• R-timestamp(Qk ) is the largest timestamp of any transaction
that successfully reads version Qk.

Snapshot Isolation
Snapshot isolation is a multi-version concurrency-control
protocol based on validation. Unlike multi-version two-phase
locking, it does not require transactions to be declared read-only
or updated. Snapshot isolation does not guarantee serializability
but is supported by many database systems.

292 | P a g e
Database Management Systems

CHAPTER 12
RECOVERY SYSTEM &
DATA ON EXTERNAL STORAGE

12.1 Failure Classification: Failures are of the following types:


• Transaction failure. There are two types of errors that may
cause a transaction to fail:
• Logical error. The transaction can no longer continue with its
normal execution because of internal conditions, such as bad
input, data not found, overflow, or exceeded resource limit.
• System error. The system has entered an undesirable state (for
example, deadlock).
• System crash. There is a hardware malfunction or a bug in the
database software or the operating system that causes the loss of
the content of volatile storage and brings transaction processing
to a halt.
• Disk failure. A disk block loses its content due to either a
head crash or failure during a data-transfer operation. Copies of
the data on other disks, or archival backups on tertiary media,
such as DVDs or tapes, are used to recover from the failure.
12.2 Storage: Three categories of storage media are:
• Volatile storage
• Nonvolatile storage
• Stable storage
Stable-Storage Implementation: To implement stable storage,
we need to replicate the needed information in several
nonvolatile storage media (usually disk)with independent
failure modes and update the information remotely to ensure
that failure during data transfer does not damage the needed

293 | P a g e
Database Management Systems

information. RAID systems, however, cannot guard against data


loss due to disasters such as fires or flooding. Remote back up is
implemented in real time applications.

Block transfer between memory and disk storage can result in:
• Successful completion. The transferred information arrived
safely at its destination.
• Partial failure. A failure occurred amid transfer, and the
destination block has incorrect information.
• Total failure. The failure occurred sufficiently early during
the transfer that the destination block remains intact.

Data Access: Block movements between disk and main memory


are initiated through thefollowing two operations:
1. input(B) transfers the physical block B to main memory.
2. output(B) transfers the buffer block B to the disk, and replaces
the appropriate physical block.

Fig. 12.1 Block Storage Operations

294 | P a g e
Database Management Systems

We pass data using the following two operations:


1. The value of data item X is assigned to the local variable xi by
reading (X). This operation is carried out as follows:
a. If the block BX on which X is located is not in main memory, it
generates input (BX).
b. It assigns the value of X from the buffer block to xi.
2. Write(X) assigns the value of the buffer block's local variable xi
to data item X. This operation is carried out as follows:
a. If the block BX on which X is located is not in main memory, it
generates input (BX).
b. It assigns the value of xi to the variable X in buffer BX.

12.3. Recovery and Atomicity


In case of failure, the state of the database system may no
longer be consistent; that is, it may not reflect a state of the
world that the database is supposed to capture. To preserve
consistency, we require that each transaction be atomic. It is the
responsibility of the recovery scheme to ensure the atomicity
and durability property.

12.3.1 Log Records


All updates are recorded on a log in log-based schemes,
which must be kept in stable storage. A transaction is
considered committed when its last log record, the commit log
record for the transaction, has been output to stable storage.

There are several types of log records. An update log record


describes a single database write. It has these fields:
• Transaction identifier is the unique identifier of the
transaction that performed the write operation.
295 | P a g e
Database Management Systems

• Data-item identifier is the unique identifier of the written data


item. Typically, it is the location on disk of the data item,
consisting of the block identifier on which the data item resides,
and an offset.
• Old value, which is the data item's value before the write.
• New value, which is the data item's value after the write.
We represent an update log record as<Ti , Xj , V1, V2>,
indicating Ti has performed a write on data item Xj . Xj had
value V1 before the write and V2 after the write.
Among the types of log records are:
• <Ti start>. Transaction Ti has started.
• <Ti commit>. Transaction Ti has committed.
• <Ti abort>. Transaction Ti has aborted.

12.3.2 Database Modification


As we noted earlier, a transaction creates a log record
before modifying the database. The log records allow the system
to undo changes made by a transaction if the transaction must
be aborted; they allow the system to
redo changes made by a transaction if the transaction has
committed but the system crashed before those changes could
be stored in the database on disk. In order for us to understand
the role of these log records in recovery, we need to consider the
steps a transaction takes in modifying a data item:
1. The transaction performs some computations in its private
part of main memory.
2. The transaction modifies the data block in the disk buffer in
main memory holding the data item.
3. The database system executes the output operation that writes
the data block to disk.
296 | P a g e
Database Management Systems

If a transaction does not modify the database until it has


committed, it uses the deferred-modification technique. If
database modifications occur while the transaction is still active,
the transaction is said to use the immediate-modification
technique.

A recovery algorithm must take into account a variety of factors,


including:
• The possibility that a transaction may have committed
although some of its database modifications exist only in the
disk buffer in main memory and not in the database on disk.
• The possibility that a transaction may have modified the
database while in the active state and, due to a subsequent
failure, may need to abort.

• Undo sets the data item specified in the log record to the old
value using a log record.
• Redo sets the data item specified in the log record to the new
value using a log record.

12.3.3 Concurrency Control and Recovery


If the concurrency control scheme allows a data item X
that has been modified by a transaction T1 to be further
modified by another transaction T2 before T1 commits, then
undoing the effects of T1 by restoring the old value of X (before
T1 updated X) would also undo the effects of T2. To avoid such
situations, recovery algorithms usually require that no other
transaction can modify a data item until the first transaction

297 | P a g e
Database Management Systems

commits or aborts if a transaction has modified a data item. This


requirement can be ensured by acquiring an exclusive lock.

12.3.4 Transaction Commit


We say that a transaction has committed when its commit
log record, the last log record of the transaction, has been output
to stable storage; all earlier log records have already been output
to stable storage. Thus, there is enough information in the log to
ensure that the transaction's updates can be redone even if there
is a system crash. If a system crash occurs before a log record <
Ti commit> is output to stable storage, transaction Ti will be
rolled back.
12.3.5 Using the Log to Redo and Undo Transactions
Using the log, the system can handle any failure that does
not result in the loss of information in nonvolatile storage. The
recovery scheme uses two recovery procedures. These
procedures use the log to find the set of data items updated by
each transaction Ti , and their respective old and new values.
• redo(Ti ) sets the value of all data items updated by
transaction Ti to the new values.
• undo(Ti ) restores all data items updated by transaction Ti to
the old values.

After a system crash, the system consults the log to determine


which transactions need to be redone, which must be undone to
ensure atomicity.
• Transaction Ti needs to be undone if the log contains the
record <Ti start>, but does not contain either the record <Ti
commit>or the record <Ti abort>.

298 | P a g e
Database Management Systems

• Transaction Ti needs to be redone if the log contains the


record<Ti start>and either the record <Ti commit> or the record
<Ti abort>. Itmay seem strange to redo Ti if the record <Ti
abort> is in the log.

Checkpoints: When a system crash occurs, we must consult the


log to determine those transactions that need to be redone and
those that need to be undone. We need to search the entire log to
determine this information. There are two major difficulties with
this approach:
1. The search process is time-consuming.
2. Most of the transactions that, according to our algorithm,
need to be redone have already written their updates into the
database. Although redoing them will cause no harm, it will
nevertheless cause recovery to take longer.
To reduce these types of overhead, we introduce checkpoints.
A checkpoint is performed as follows:
1. Output all log records currently residing in main memory
onto stable storage.
2. Output to the disk all modified buffer blocks.
3. Output onto stable storage a log record of the form
<checkpoint L>, where L is a list of transactions active at the
checkpoint.
A <checkpoint L> record in the log allows the system to
streamline its recovery procedure.
The redo or undo operations must be applied only to L's
transactions and all transactions that started execution after the
<checkpoint L> record was written to the log. Let us denote this
set of transactions as T.

299 | P a g e
Database Management Systems

• For all transactions Tk in T that have no <Tk commit> record


or <Tk abort> record in the log, execute undo(Tk ).
• For all transactions Tk in T such that either the record <Tk
commit> or the record <Tk abort> appears in the log, execute
redo(Tk ).

A fuzzy checkpoint is a checkpoint where transactions can


perform updates even while buffer blocks are written out.

12.4 Recovery Algorithm


The recovery algorithm requires that a data item that an
uncommitted transaction has updated cannot be modified by
any other transaction, until the first transaction has either
committed or aborted.
Transaction Rollback: First, consider transaction rollback
during normal operation (not duringrecovery from a system
crash).Rollback of a transaction Ti is performed as follows:
1. The log is scanned backward, and for each log record of Ti of
the form <Ti , Xj , V1, V2> that is found:
a. The value V1 is written to data item Xj, and
b. Aspecial redo-only log record<Ti , Xj , V1>is written to the
log, where V1 is the value restored to data item Xj during the
rollback. These log records are sometimes called compensation
log records. Such records do not need undo information, since
we never need to undo such an operation. We shall explain later
how they are used.
2. Once the log record <Ti start> is found the backward scan is
stopped, and a log record <Ti abort> is written to the log.

300 | P a g e
Database Management Systems

Recovery after a System Crash: Recovery actions, when the


database system is restarted after a crash, take place in two
phases:
1. In the redo phase, the system replays updates of all
transactions by scanning the log forward from the last
checkpoint. The log records that are replayed include log
records for transactions rolled back before system crash, and
those not committed when the system crash occurred. This
phase also determines all incomplete transactions at the crash
and must therefore be rolled back. Such incomplete transactions
would either have been active at the time of the checkpoint, and
thus would appear in the transaction list in the checkpoint
record, or would have started later; further, such incomplete
transactions would have neither a<Ti abort> nor a <Ti commit>
record in the log.
The specific steps taken while scanning the log are as follows:
a. The list of transactions to be rolled back, undo-list, is initially
set to the list L in the <checkpoint L> log record.
b. Whenever a normal log record of the form <Ti , Xj , V1, V2>,
or a redo-only log record of the form <Ti , Xj , V2> is
encountered, the operation is redone; that is, the value V2 is
written to data item Xj .
c. Whenever a log record of the form <Ti start> is found, Ti is
added to undo-list.
d. Whenever a log record of the form <Ti abort> or <Ti commit>
is found, Ti is removed from undo-list. At the end of the redo
phase, undo-list contains the list of all incomplete transactions,
that is, they neither committed nor completed rollback before
the crash.

301 | P a g e
Database Management Systems

2. In the undo phase, the system rolls back all transactions in the
undo-list. It performs rollback by scanning the log backward
from the end.
a. Whenever it finds a log record belonging to a transaction in
the undolist, it performs undo actions just as if the log record
had been found during the rollback of a failed transaction.
b. When the system finds a <Ti start> log record for a
transaction Ti in undo-list, it writes a <Ti abort> log record to
the log, and removes Ti from undo-list.
c. The undo phase terminates once undo-list becomes empty,
that is, the system has found <Ti start> log records for all
transactions that were initially in undo-list. After the undo
phase of recovery terminates, normal transaction processing can
resume.

Fig.12.2 Example of Loged Actions, and Action During


Recovery

302 | P a g e
Database Management Systems

12.5. Buffer Management


Transaction processing is based on a storage model. Main
memory holds a log buffer, a database buffer, and a system buffer.
The system buffer holds pages of system object code and local
work areas of transactions.

The cost of outputting a block to stable storage is so high that it


is desirable to output multiple log records simultaneously. We
write log records to a log buffer in main memory, where they
stay temporarily until they are output to stable storage. Multiple
log records can be gathered in the log buffer and output to
stable storage in a single output operation. The order of log
records in the stable storage must be the same as the order in
which they were written to the log buffer.
As a result of log buffering, a log record may reside in only main
memory (volatile storage) for a considerable time before it is
output to stable storage. Since such log records are lost if the
system crashes, we must impose additional requirements on the
recovery techniques to ensure transaction atomicity:
• Transaction Ti enters the commit state after the <Ti commit>
log record has been output to stable storage.
• Before the <Ti commit> log record can be output to stable
storage, all log records about transaction Ti must have been
output to stable storage.
• Before a block of data in main memory can be output to the
database (in nonvolatile storage), all log records about data in
that block must have been output to stable storage. This rule is
called the write-ahead logging (WAL) rule. Writing the buffered
log to disk is sometimes referred to as a log force.

303 | P a g e
Database Management Systems

Database Buffering:
One might expect transactions to force-output all modified
blocks to disk when they commit. Such a policy is called the
force policy. The no-force policy alternative allows a transaction
to commit even if it has modified some blocks that have not yet
been written back to disk.
Similarly, one might expect that blocks modified by a still active
transaction should not be written to disk. This policy is called
the no-steal policy. The alternative, the steal policy, allows the
system to write modified blocks to disk even if the transactions
that made those modifications have not all committed. As long
as the write-ahead logging rule is followed, all the recovery
algorithms work correctly even with the steal policy.
When a block B1 is to be output to disk, all log records about
data in B1 must be output to stable storage before B1 is output.
No must write to the block B1 be in progress while the block is
being output, since such a write could violate the write-ahead
logging rule. We can ensure that there are no writes in progress
by using a special means of locking:
• Before a transaction performs a write on a data item, it
acquires an exclusive lock on the block in which the data item
resides. The lock is released immediately after the update has
been performed.
• The following sequence of actions is taken when a block is to
be output:
a) Obtain an exclusive lock on the block, to ensure that no
transaction performs a write on the block.
b) Output log records to stable storage until all log records about
block B1 have been output.
c) Output block B1 to disk.
304 | P a g e
Database Management Systems

d) Release the lock once the block output has completed.

Locks held for a short duration, are often referred to as latches.


As a result of continuous output of modified blocks, the number
of dirty blocks in the buffer, blocks that have been modified in
the buffer but have not been subsequently output, is minimized.
Thus, the number of blocks that must be output during a
checkpoint is minimized.
Fuzzy Checkpointing: The checkpointing strategy
necessitates that all database changes be temporarily halted
while the checkpoint is performed. Suppose the number of
pages in the buffer is high. In that case, a checkpoint can take a
long time to complete, resulting in an unnecessary interruption
in transaction processing. To avoid such interruptions, the
checkpointing technique can be changed to allow updates to
begin after the checkpoint record has been written but before the
modified buffer blocks are written to the disc. The resulting
checkpoint is blurry.

12.6 Failure with Nonvolatile Storage Loss 2


Although failures in which the content of nonvolatile
storage is lost are rare, we nevertheless need to be prepared to
deal with this type of failure. The basic scheme is to dump the
entire database contents to stable storage periodically, once per
day. One approach to database dumping requires that no
transaction may be active during the dump procedure, and uses
a procedure similar to checkpointing:
1. Output all log records in the main memory onto stable
storage.
2. Output all buffer blocks onto the disk.
305 | P a g e
Database Management Systems

3. Copy the contents of the database to stable storage.


4. Output a log record <dump> onto the stable storage.

A dump of the database contents is also referred to as an


archival dump, since we can archive the dumps and use them
later to examine old states of the database. Dumps of a database
and checkpointing of buffers are similar.

The simple dump procedure is costly for the following two


reasons. First, the entire database must be copied to stable
storage, resulting in considerable data transfer. Second, since
transaction processing is halted during the dump procedure,
CPU cycles are wasted. Fuzzy dump schemes have been
developed that allow transactions to be active while the dump is
in progress.

12.7 Aries
Introduction: The ARIES recovery scheme is a state-of-
the-art scheme that supports several features to provide greater
concurrency, reduce logging overheads, and minimize recovery time. It
is also based on repeating history, and allows logical undo
operations. The scheme flushes pages continuously and does not need
to flush all pages at the time of a checkpoint. It uses log sequence
numbers (LSNs) to implement various optimisations that reduce
the time taken for recovery.

12.7.1 ARIES:
ARIES uses many techniques to reduce the time taken for
recovery and reduce checkpointing overhead. In particular,
ARIES can avoid redoing many logged operations that have
306 | P a g e
Database Management Systems

already been applied and reduce the amount of information


logged. The price paid is greater complexity; the benefits are
worth the price. The major differences between ARIES and the
recovery algorithm presented earlier are that ARIES:
1. Uses a log sequence number (LSN) to identify log records, and
stores LSNs in database pages to identify which operations
have been applied to a database page.
2. Supports physiological redo operations, which are physical.
The affected page is physically identified, but can be logical
within the page.
3. Uses a dirty page table to minimize unnecessary redos during
recovery. Dirty pages have been updated in memory, and the
disk version is not up-to-date.
4. Uses a fuzzy-checkpointing scheme that records only
information about dirty pages and associated information and
does not even require writing dirty pages to disk. It
continuously flushes dirty pages in the background instead of
writing them during checkpoints.
12.7.2 Recovery Algorithm:
ARIES recovers from a system crash in three passes.
• Analysis pass: This pass determines which transactions to
undo, which pages were dirty at the crash, and the LSN from
which the redo pass should start.
• Redo pass: This pass starts from a position determined during
analysis, and performs a redo, repeating history, to bring the
database to a state it was in before the crash.
• Undo pass: This pass rolls back all incomplete transactions at
the time of crash.

307 | P a g e
Database Management Systems

12.7.3 Other characteristics:


Among other key features that ARIES provides are:
• Nested top actions: ARIES allows logging operations that
should not be undone even if a transaction gets rolled back.
Such operations that should not be undone are called nested top
actions.
• Recovery independence: Some pages can be recovered
independently from others to be used even while other pages
are being recovered. If some disk pages fail, they can be
recovered without stopping transaction processing on other
pages.
• Savepoints: Transactions can record savepoints and partially
roll back up to a savepoint. This can be useful for deadlock
handling, since transactions can be rolled back up to a point that
permits the release of required locks and then restarted from
that point. Programmers can also use savepoints to undo a
transaction partially, and then continue execution.
• Fine-grained locking: The ARIES recovery algorithm can be
used with index concurrency-control algorithms that permit
tuple-level locking on indices, instead of page-level locking,
which improves concurrency significantly.
• Recovery optimizations: The DirtyPageTable can prefetch
pages during redo, instead of fetching a page only when the
system finds a log record to be applied to the page. Out-of-order
redo is also possible: Redo can be postponed on a page fetched
from disk, and performed when the page is fetched. Meanwhile,
other log records can continue to be processed.

308 | P a g e
Database Management Systems

Fig. 12.3 Architecture of Remote Backup System

12.8 Remote Backup Methods


Remote backup systems provide a high degree of
availability, allowing transaction processing to continue even if
the primary site is destroyed by a fire, flood, or earthquake.
Data and log records from a primary site are continually backed
up to a remote backup site. If the primary site fails, the remote
backup site takes over transaction processing, after executing
certain recovery actions.
Several issues must be addressed in designing a remote backup
system:
• Detection of failure. The remote backup system needs to
detect when the primary has failed. We maintain several
communication links with independent failure modes between
the primary and the remote backup.
• Transfer of control. When the primary fails, the backup site
takes over processing and becomes the new primary. When the
original primary site recovers, it can either play the role of
remote backup, or take over the primary site again.
• Time to recover. If the log at the remote backup grows large,
recovery will take a long time. The remote backup site can
periodically process the redo log records received and perform a

309 | P a g e
Database Management Systems

checkpoint to delete earlier log parts. The delay before the


remote backup takes over can be significantly reduced. A hot-
spare configuration can make takeover by the backup site
almost instantaneous.
• Time to commit. To ensure that the updates of a committed
transaction are durable, a transaction must not be declared
committed until its log records have reached the backup site.
This delay can result in a longer wait to commit a transaction.
Some systems therefore permit lower degrees of durability. The
degrees of durability can be classified as follows:
 One-safe. A transaction commits as soon as its commit log
record is written to stable storage at the primary site.
 Two-very-safe. A transaction commits as soon as its commit log
record is written to stable storage at the primary and the backup
site.
 Two-safe. This scheme is the same as two-very-safe if both
primary and backup sites are active. If only the primary is
active, the transaction is allowed to commit as soon as its
commit log record is written to stable storage at the primary site.
This scheme provides better availability than does two-very-safe,
while avoiding the problem of lost transactions faced by the
one-safe scheme.

An alternative way of achieving high availability is to use a


distributed database, with data replicated at more than one site.
Transactions are then required to update all replicas of any data
item they update.
12.9 File Organizations and Indexing

310 | P a g e
Database Management Systems

: The file of records is an important abstraction in a DBMS,


and is implemented by the files. A file can be created, destroyed,
and have records inserted and deleted from it.
A relation is typically stored in a file of records. The file layer
stores the records in a file in a collection of disk pages. It keeps
track of pages allocated to each file. As records are inserted into
and deleted from the file, it also tracks available space within
pages allocated to the file.

The simplest file structure is an unordered file, or heap file.


Records in a heap file are stored in random order across the
pages of the file. A heap file organization supports retrieval of
all records, or retrieval of a particular record specified by its rid;
the file manager must keep track of the pages allocated for the
file.

An index is a data structure that organizes data records on disk


to optimize certain retrieval operations. An index allows us to
efficiently retrieve all records that satisfy search conditions on the
search key fields of the index. We can also create additional
indexes on a given collection of data records, each with a
different search key, to speed up search operations that are not
efficiently supported by the file organization used to store the
data records.

There are three main alternatives for what to store as a data


entry in an index:
1. A data entry K* is an actual data record (with search key
value k).

311 | P a g e
Database Management Systems

2. A data entry is a (k, rid) pair, where rid is the record id of a


data record with search key value k.
3. A data entry is a (k, rid-list) pair, where rid-list is a list of
record ids of data records with search key value k.
Clustered Indexes:
When a file is organized so that data records are the same
as or close to the ordering of data entries in some index, we say
that the index is clustered; otherwise, it is an unclustered index.
An index can be a clustered only if the data records are sorted
on the search key field. Otherwise, the order of the data records
is random, defined purely by their physical order, and there is
no reasonable way to arrange the data entries in the index in the
same order.
The cost of using an index to answer a range search query can
vary tremendously based on whether the index is clustered. If
the index is clustered, i.e., we are using the search key of a
clustered file, the rids in qualifying data entries point to a
contiguous collection of records. We need to retrieve only a few
data pages.
Two data entries are said to be duplicates if they have the same
value for the search key field associated with the index. A
primary index is guaranteed not to contain duplicates, but an
index on other (collections of) fields can contain duplicates. In
general, a secondary index contains duplicates. If we know that
no duplicates exist, that is, we know that the search key contains
some candidate key, we call the index a unique index.

312 | P a g e
Database Management Systems

Fig. 12.4 Clustering Index

12.10 Index Data Structures


Hash-Based Indexing: We can quickly organize records using a
hashing technique to find records with a given search key value.
In this approach, the files' records are grouped in buckets, where
a bucket consists of a primary page and, possibly, additional pages
linked in a chain. The bucket to which a record belongs can be
determined by applying a special
function, called a hash function, to the search key. Given a
bucket number, a hash-based index structure allows us to
retrieve the primary page for the bucket in one or two disk l/Os.

The record is inserted into the appropriate bucket on inserts,


with 'overflow' pages allocated as necessary. To search for a
record with a given search key value, we apply the hash
function to identify the bucket such records belong to and look
at all pages in that bucket. If we do not have the search key
value for the record.

313 | P a g e
Database Management Systems

Hash indexing is illustrated in Figure 12.5. The data is stored in


a hashed on age; the data entries in this first index file are the
actual data records. Applying the hash function to the age field
identifies the page that the record belongs to. The hash function
h for this example is quite simple; it converts the search key
value to its binary representation and uses the two
least significant bits as the bucket identifier. Figure 12.5 also
shows an index with search key sal that contains (sal, rid) pairs
as data entries.

Fig. 12.5 Indexed Organized File Handling On Age

Tree-Based Indexing: A tree-like data structure can be used


to arrange records as an alternative to hash-based indexing. The
data entries are sorted by search key value. A hierarchical search
data structure is maintained to guide searchers to the
appropriate page of data entries. Figure 12.6 depicts the
employee records arranged in a tree-structured index with age
as the search key.

314 | P a g e
Database Management Systems

The data entries are located at the lowest level of the tree,
known as the leaf level; there were additional employee records,
with ages less than 22 and ages greater than 50. (the lowest and
highest age values that appear in Figure 12.6). Additional
records under the age of 22 would appear in the leaf pages to
the left of page L1, and records over the age of 50 would appear
in the leaf pages to the right of the page.
The B+ tree is an index structure that ensures that all paths
from the root to a leaf in a given tree are equal length, ensuring
that the structure is still balanced in height. Since each non-leaf
node can handle many node-pointers and the tree's height,
finding the correct leaf page is faster than a binary search of the
pages in a sorted file.

Fig 12.6 Tree Structured Index


It is rarely more than three or four. A healthy tree's height is
the length of a path from the root to leaf. The fan-out of the tree
refers to the average number of children for a non-leaf node. A
tree of height h has nh leaf pages if any non-leaf node has n
children.

315 | P a g e
Database Management Systems

12.11 Comparison of File Organizations


We now compare the costs of some simple operations for
several basic file organizations on collecting employee records.
We assume that the files and indexes are organized according to
the composite search key (age,sal). All selection operations are
specified on these fields.
Our goal is to emphasize the importance of choosing an
appropriate file organization. The above list includes the main
alternatives to consider in practice. We can keep the records
unsorted or sort them. We can also choose to build an index on
the data file. Note that even if the data file is sorted, an index
whose search key differs from the sort order behaves like an
index on a heap file.

The operations we consider are these:


• Scan: Fetch all records in the file. The pages in the file must be
fetched from disk into the buffer pool.
• Search with Equality Selection: Fetch all records that satisfy
an equality selection.
• Search with Range Selection: Fetch all records that satisfy a
range selection
• Insert a Record: Insert a given record into the file. We must
identify the page in the file into which the new record must be
inserted, fetch that page from disk, modify it to include the new
record, and then write back the modified page.
• Delete a Record: Delete a specified record using its rid. We
must identify the page that contains the record, fetch it from
disk, modify it, and write it back.

316 | P a g e
Database Management Systems

12.11.1 Cost Model:


In comparison of file organizations, we use
a simple cost model to estimate the cost (in terms of execution
time) of different database operations. We use B to denote the
number of data pages when records are packed onto pages with no
wasted space, and R to denote the number of records per page. The
average time to read or write a disk page is D, and the average
time to process a record (e.g., to compare a field value to a
selection constant) is C.
In the hashed file organization, we use a function, called a hash
function, to map a record into a range of numbers; the time
required to apply the hash function to a record is H. We will use F
to denote the fan-out for tree indexes, which typically is at least
100 as mentioned. Today's typical values are D = 15
milliseconds, C and H = 100 nanoseconds; we therefore expect
the cost of I/O to dominate. I/O is often (even typically) the
dominant component of the cost of database operations.
Considering I/O costs gives us a good first approximation to the
true costs. Further, CPU speeds are steadily rising, whereas disk
speeds are not increasing similarly.
We have chosen to concentrate on the I/O component of the
cost model, and we assume the simple constant C for in-
memory per-record processing cost. Bear the following
observations in mind:
 Real systems must consider other costs, such as CPU costs (and
network transmission costs in a distributed database).
 Even with our decision to focus on I/O costs, an accurate model
would be too complex to convey the essential ideas simply.
Therefore, we use a simplistic model in which we count the
number of pages read from or written to disk as a measure of
317 | P a g e
Database Management Systems

I/O. The cost equals the time required to seek the first page in
the block and transfer all pages in the block. Such blocked access
can be much cheaper than issuing one I/O request per page in
the block, especially if these requests do not follow
consecutively. We would have an additional seek cost for each
page in the block.

Fig. 12.7 A Comapation of I/O Costs

12.12 Indexing Based on Tree Structure


12.12.1 Introduction: We now consider two index data
structures based on tree organizations, called ISAM and B+
trees. These structures provide efficient support for range
searches, including sorted file scans as a special case.
Method of Indexed Sequential Access method (ISAM)
The ISAM data structure is illustrated in Figure 5.5. The
data entries of the ISAM index are in the leaf pages of the tree
and additional overflow pages chained to some leaf page.
Database systems carefully organize the layout of pages so that
page boundaries correspond closely to the physical

318 | P a g e
Database Management Systems

characteristics of the underlying storage device. The ISAM


structure is completely static and facilitates such low-level
optimizations. Each tree node is a disk page, and all the data
resides in the leaf pages.

Fig. 12.8 ISAM


12.12.2 Considerations for Overflow Pages and Locking: Only
leaf pages are modified also has an important advantage
concerning concurrent access. When a page is accessed, it is
typically 'locked' by the requestor to ensure that other page
users do not concurrently modify it. It must be locked in
'exclusive' mode to modify a page, which is permitted only
when no one else holds a lock on the page. Locking can lead to
queues of users waiting to access a page.

B+ Tree: A static structure such as the ISAM index suffers from


the problem that long overflow chains can develop as the file
grows, leading to poor performance. This problem motivated the
development of more flexible, dynamic structures that adjust
gracefully to inserts and deletes.

The B+ tree search structure, which is widely used, is a balanced


tree in which the internal nodes direct the search and the leaf
nodes contain the data entries. Since the tree structure grows and
319 | P a g e
Database Management Systems

shrinks dynamically. To retrieve all leaf pages efficiently, we


must link them using page pointers. By organizing them into a
doubly linked list, we can easily traverse the sequence of leaf
pages (sometimes called the sequence set) in either direction. This
structure is illustrated in
A node within the body
Except for the root node, an internal node of the B+ tree will
contain at least n/2 record pointers.
 An internal node of the tree may only hold n pointers.
The node of the leaf
 The B+ tree's leaf node must contain at least n/2 record pointers
and n/2 key values.
 A leaf node may only have n record pointers and n key values.
 Each leaf node of the B+ tree has one block pointer P that points
to the next leaf node.

Searching for a record in the B+ Tree


Assume we need to find 55 in the B+ tree structure shown
below. First, we will look for the intermediate node, which will
lead us to the leaf node, which might have a record for 55.
As a result, we will find a branch between 50 and 75 nodes
in the intermediary node. Finally, we will be routed to the third
leaf node. In this case, DBMS will conduct a sequential search to
locate 55.
Insertion of a Tree (B+)
Assume we want to insert record 60 into the structure shown
below. After 55, it will enter the third leaf node. It is a balanced
tree, and one of its leaf nodes is already complete, so we cannot
insert 60 here.

320 | P a g e
Database Management Systems

In this case, we must break the leaf node for it to be inserted


into the tree without affecting the fill factor, balance, or order.
The values of the third leaf node are (50, 55, 60, 65, 70), and
the current root node is 50. We will break the tree's leaf node in
the center to maintain the tree's balance. As a result, we can
divide (50, 55) and (60, 65, 70) into two leaf nodes.
If these two must be leaf nodes, the intermediate node from
50 cannot branch. It should be increased by 60, and we will have
pointers to a new leaf node.
When there is an overflow, we may insert an entry. In a
normal case, finding the node where it fits and placing it in that
leaf node is easy.

B+ Tree Removal
Assume we want to remove 60 from the preceding example.
In this case, we must delete 60 from both the intermediate and
fourth leaf nodes. If we delete it from the intermediate node, the
tree will no longer fulfill the B+ tree law. As a result, we must
change it to have a balanced tree.

321 | P a g e
Database Management Systems

REFERENCES
 Raghurama Krishnan, Johannes Gehrke , Database Management
Systems, 3rd edition, Tata McGraw Hill, New Delhi,India.
 Elmasri Navate, Fundamentals of Database Systems, Pearson
Education,India. Abraham Silberschatz, Henry F. Korth, S.
Sudarshan (2005),
 Database System Concepts, 5th edition, McGraw-Hill, New
Delhi,India.
 Peter Rob, Carlos Coronel (2009), Database Systems Design,
Implementation and Management, 7thedition.

322 | P a g e
Database Management Systems

DBMS Abbreviations and Acronyms


ACID – Atomicity, Consistency, Isolation, Durability
AIO Asynchronous I/O
BI - Business Intelligence
BLOB – Binary Large OBject
CAP – Consistency, Availability, Partition tolerance... The three
requirements of a distributed system (as they apply to a
database) according to Eric Brewer's CAP Theorem.
CDM – Copy Data Management
CI – Clustered Index
CK – Candidate Key
CLOB – Character Large OBject
CRUD – Create, Read, Update, and Delete
CS – Cursor Stability - an isolation level supported by different
database management systems.
CTE – Common Table Expression
DB – Database
DBA – Database Administrator
DBMS – Database Management System
DCL – Data Control Language
DDL – Data Definition Language
DML – Data Manipulation Language
DMV - Dynamic Management Views
DR - Disaster Recovery
DRBD - Distributed Replicated Block Device
DRDA – Distributed Relational Database Architecture
DRI - Declarative Referential Integrity
DSS - Decision Support Systems
DTD – Document Type Definition
DW or DWH – Database Warehouse
323 | P a g e
Database Management Systems

EAV – Entity-Attribute-Value (aka. the archenemy)


ERD - Entity Relationship Diagram
ETL – Extract, Transform, Load
FDW - Foreign Data Wrapper (PostgreSQL)
FK – Foreign Key
FLWOR – For, Let, Where, Order, Return - an expression form
used within XQuery to query XML within a database (not sure if
DB2 only)
FS – Filesystem
FTS – Fulltext Search
GBP – Group Buffer Pool
HA – High Availability
HADR – High Availability Disaster Recovery
HDD – Hard Disk Drive
ICP – Index Condition Pushdown (MySQL)
IOPS – IO Per Second
IOT – Index Organized Table (Oracle)
ISAM - Indexed Sequential Access Method
I/O – Input/Output
JDBC – Java Database Connectivity
KV – Key/Value
LAMP - Linux, Apache, MySQL and PHP
LBAC - Label Based Access Control
LOB – Large OBject
LPAR – Logical Partition
LRU – Last Recently Used (algorithm)
LUN – Logical Unit Number
MDC – Multidimensional Clustering Table
MDM – Master Data Management
MDX – Multidimensional Expressions
324 | P a g e
Database Management Systems

MED – Management of External Data


MQT – Materialized Query Table (IBM DB2)
MV – Materialized View
MVCC – Multiversion Concurrency Control
NAS - Network Attached Storage
NCI – Non-clustered Index
NF - Normal Form (ie: 1NF, first normal form)
ODBC – Open Database Connectivity
ODS - Operational Data Store
OLAP – Online Analytical Processing
OLTP – Online Transaction Processing
OODBMS – Object-Oriented Database Management System
OOM – Out Of Memory
ORM – Object-Relational Mapping
OS – Operating System
PK – Primary Key
PL/pgSQL – Procedural Language/SQL (PostgreSQL) used for
writing stored procedures. Similar to PL/SQL.
PL/SQL – Procedural Language/SQL (Oracle) used for writing
stored procedures. Also see SQL PL.
QPS – Queries Per Second
RAC – Real Application Clusters (Oracle)
RAID – Redundant Array of Independent Disks
RBAR – Row By Agonizing Row
RDBMS – Relational Database Management System
RBR – Row-Based Replication (MySQL)
RPO - Recovery Point Objective - how much data you can afford
to lose. If your server went down, this is when you'd be able to
recover the data.

325 | P a g e
Database Management Systems

RR – Repeatable Read - an isolation level supported by different


database management systems.
RS– Read Stability - an isolation level supported by different
database management systems.
SQL Structured Query Language
DQL Data Query Language
DDL Data Definition Language
DML Data Manipulation Language
RDSMS Relational Data Stream Management System
OLAP Online Analytical Processing
OLTP Online Transaction Processing
DRDA Distributed Relational Database Architecture
4D QL 4D Query Language
DMS Database Migration Service
RTO - Recovery Time Objective - how much time it would take
you to recover the data to the RPO
SAN – Storage Area Network
SBR – Statement-Based Replication (MySQL)
SCD – Slowly Changing Dimension
SE – Storage Engine (MySQL and forks)
SEQUEL – Structured English QUEry Language, IBM's
precursor to SQL, is why SQL is sometimes (often?) pronounced
SEQUEL and not S.Q.L.
SP – Stored Procedure
SQL – Structured Query Language
SQL PL – SQL Procedure Language used for writing stored
procedures. Also see PL/SQL.
SQL/XML – an extension of the SQL language used for querying
XML.
SSD – Solid State Drive
326 | P a g e
Database Management Systems

TPS* - Transactions Per Second, a measurement of database


performance.
UAT - User Acceptance Testing
UDF – User Defined Function
UDT – User Defined Type
UR – Uncommitted Read - an isolation level supported by
different database management systems.
URLT - Update Resume; Leave Town - For those DBAs that
don't bother putting together a proper recovery strategy
XML – eXtensible Markup Language
XSD – XML Schema Definition
XSLT – XML Stylesheet Transformation

327 | P a g e

You might also like