0% found this document useful (0 votes)
0 views136 pages

RDBMS Notes

The document provides an overview of Relational Database Management Systems (RDBMS) and Distributed Database Systems (DDBMS), explaining their structures, features, and types. It discusses the advantages of distributed databases such as increased reliability, easier expansion, and improved performance, while also addressing the complexities and functions of DDBMS. Additionally, it covers data communication concepts, concurrency control, and the challenges associated with executing multiple transactions simultaneously.

Uploaded by

jdy6cjn7y2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views136 pages

RDBMS Notes

The document provides an overview of Relational Database Management Systems (RDBMS) and Distributed Database Systems (DDBMS), explaining their structures, features, and types. It discusses the advantages of distributed databases such as increased reliability, easier expansion, and improved performance, while also addressing the complexities and functions of DDBMS. Additionally, it covers data communication concepts, concurrency control, and the challenges associated with executing multiple transactions simultaneously.

Uploaded by

jdy6cjn7y2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Note:- these notes are only for reference please do consider your

subject book first.

RDBMS (Relational Database Management System)

• DBMS:-
✓ Well organized collection of data so as to be able to carry
out operations.

• RDBMS:-
✓ Stands for "Relational Database Management System." An
RDBMS is a DBMS designed specifically for relational
databases. Therefore, RDBMSes are a subset of DBMSes.
✓ A relational database refers to a database that stores data
in a structured format, using rows and columns. This
makes it easy to locate and access specific values within
the database. It is "relational" because the values within
each table are related to each other. Tables may also be
related to other tables. The relational structure makes it
possible to run queries across multiple tables at once.

Distributed Database System:-


A distributed database is a collection of multiple interconnected
databases, which are spread physically across various locations
that communicate via a computer network.
A distributed database is basically a database that is not limited to
one system; it is spread over different sites, i.e., on multiple
computers or over a network of computers. A distributed database
system is located on various sited that don’t share physical
components. This may be required when a particular database
needs to be accessed by various users globally. It needs to be
managed such that for the users it looks like one single database.

Features of distributed database system:-

• Databases in the collection are logically interrelated with


each other. Often they represent a single logical database.
• Data is physically stored across multiple sites. Data in each
site can be managed by a DBMS independent of the other
sites.
• The processors in the sites are connected via a network.
They do not have any multiprocessor configuration.
• A distributed database is not a loosely connected file system.
• A distributed database incorporates transaction processing,
but it is not synonymous with a transaction processing
system.

Distributed Database Management System:-

A distributed database management system (DDBMS) is a


centralized software system that manages a distributed database
in a manner as if it were all stored in a single location.
Features of DDBMS:-
• It is used to create, retrieve, update and delete distributed
databases.
• It synchronizes the database periodically and provides
access mechanisms by the virtue of which the distribution
becomes transparent to the users.
• It ensures that the data modified at any site is universally
updated.
• It is used in application areas where large volumes of data
are processed and accessed by numerous users
simultaneously.
• It is designed for heterogeneous database platforms.
• It maintains confidentiality and data integrity of the
databases.

Types of DDBMS:-
1. Homogeneous Database
2. Heterogeneous Database

1. Homogeneous Database:-
In a homogeneous database, all different sites store database
identically. The operating system, database management
system and the data structures used – all are same at all sites.
Hence, they’re easy to manage.
2. Heterogeneous Database:-
In a heterogeneous distributed database, different sites can use
different schema and software that can lead to problems in
query processing and transactions. Also, a particular site might
be completely unaware of the other sites. Different computers
may use a different operating system, different database
application. They may even use different data models for the
database. Hence, translations are required for different sites to
communicate.
Distributed Data Storage:-
There are 2 ways in which data can be stored on different sites.
These are:

1. Replication
In this approach, the entire relation is stored redundantly at 2 or
more sites. If the entire database is available at all sites, it is a fully
redundant database. Hence, in replication, systems maintain
copies of data.
This is advantageous as it increases the availability of data at
different sites. Also, now query requests can be processed in
parallel.
However, it has certain disadvantages as well. Data needs to be
constantly updated. Any change made at one site needs to be
recorded at every site that relation is stored or else it may lead to
inconsistency. This is a lot of overhead. Also, concurrency control
becomes way more complex as concurrent access now needs to be
checked over a number of sites.
2. Fragmentation
In this approach, the relations are fragmented (i.e., they’re
divided into smaller parts) and each of the fragments is stored in
different sites where they’re required. It must be made sure that
the fragments are such that they can be used to reconstruct the
original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data,
consistency is not a problem.
Fragmentation of relations can be done in two ways:
• Horizontal fragmentation – Splitting by rows – The relation is
fragmented into groups of tuples so that each tuple is assigned
to at least one fragment.
• Vertical fragmentation – Splitting by columns – The schema
of the relation is divided into smaller schemas. Each fragment
must contain a common candidate key so as to ensure lossless
join.

In certain cases, an approach that is hybrid of fragmentation and


replication is used.
Functions of Distributed Database System:-

Distribution basically leads to increased complexity in the system


design and implementation. This is to achieve the potential
advantages such as:
1. Network Transparencies
2. Increased Reliability
3. Improved Performance
4. Easier Expansion

Function of Centralized DBMS:-


1. The basic function of centralized DBMS is that it provides
complete view of our data.
For example, we can have the query for the number of
customers who are willing to buy worldwide.
2. The second basic function of Centralized DBMS is that it is
easy to manage than other distributed systems.
The Distributed Database must be able to provide the following
function in addition to those of a centralized DBMS’s.

Functions of Distributed database system:-


1. Keeping track of data –
The basic function of DDBMS is to keep track of the data
distribution, fragmentation and replication by expanding the
DDBMS catalog.
2. Distributed Query Processing –
The basic function of DDBMS is basically its ability to access
remote sites and to transmits queries and data among the
various sites via a communication network.
3. Replicated Data Management –
The basic function of DDBMS is basically to decide which copy
of a replicated data item to access and to maintain the
consistency of copies of replicated data items.
4. Distributed Database Recovery –
The ability to recover from the individual site crashes and
from new types of failures such as failure of communication
links.
5. Security –
The basic function of DDBMS is to execute Distributed
Transaction with proper management of the security of the
data and the authorization/access privilege of users.
6. Distributed Directory Management –
A directory basically contains information about data in the
database. The directory may be global for the entire DDB, or
local for each site. The placement and distribution of the
directory may have design and policy issues.
7. Distributed Transaction Management –
The basic function of DDBMS is its ability to devise execution
strategies for queries and transaction that access data from
more than one site and to synchronize the access to
distributed data and basically to maintain the integrity of the
complete database.
But these functions basically increase the complexity of a DDBMS
over centralized DBMS.

Advantages of distributed database:-


1. Management of data with different level of transparency –
Ideally, a database should be distribution transparent in the sense
of hiding the details of where each file is physically stored within
the system. The following types of transparencies are basically
possible in the distributed database system:
• Network transparency:
This basically refers to the freedom for the user from the
operational details of the network. These are of two types
Location and naming transparency.
• Replication transparencies:
It basically made user unaware of the existence of copies as we
know that copies of data may be stored at multiple sites for
better availability performance and reliability.
• Fragmentation transparency:
It basically made user unaware about the existence of
fragments it may be the vertical fragment or horizontal
fragmentation.
2. Increased Reliability and availability –
Reliability is basically defined as the probability that a system is
running at a certain time whereas Availability is defined as the
probability that the system is continuously available during a time
interval. When the data and DBMS software are distributed over
several sites one site may fail while other sites continue to operate
and we are not able to only access the data that exist at the failed
site and this basically leads to improvement in reliability and
availability.
3. Easier Expansion –
In a distributed environment expansion of the system in terms of
adding more data, increasing database sizes, or adding more data,
increasing database sizes or adding more processor is much
easier.
4. Improved Performance –
We can achieve interquery and intraquery parallelism by
executing multiple queries at different sites by breaking up a
query into a number of subqueries that basically executes in
parallel which basically leads to improvement in performance.

Distributed DBMS Architectures:-

DDBMS architectures are generally developed depending on


three parameters −
• Distribution − It states the physical distribution of data
across the different sites.
• Autonomy − It indicates the distribution of control of the
database system and the degree to which each constituent
DBMS can operate independently.
• Heterogeneity − It refers to the uniformity or dissimilarity of
the data models, system components and databases.

Architectural Models:-

Some of the common architectural models are −

• Client - Server Architecture for DDBMS


• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture

1. Client - Server Architecture for DDBMS:-


This is a two-level architecture where the functionality is divided
into servers and clients. The server functions primarily
encompass data management, query processing, optimization
and transaction management. Client functions include mainly
user interface. However, they have some functions like
consistency checking and transaction management.
The two different client - server architecture are −

• Single Server Multiple Client


• Multiple Server Multiple Client (shown in the following
diagram)
Peer- to-Peer Architecture for DDBMS:-
In these systems, each peer acts both as a client and a server for
imparting database services. The peers share their resource with
other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
• Global Conceptual Schema − Depicts the global logical view
of data.
• Local Conceptual Schema − Depicts logical data
organization at each site.
• Local Internal Schema − Depicts physical data organization
at each site.
• External Schema − Depicts user view of data.
Multi - DBMS Architectures:-
This is an integrated database system formed by a collection of
two or more autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas −
• Multi-database View Level − Depicts multiple user views
comprising of subsets of the integrated distributed database.
• Multi-database Conceptual Level − Depicts integrated multi-
database that comprises of global logical multi-database
structure definitions.
• Multi-database Internal Level − Depicts the data
distribution across different sites and multi-database to
local data mapping.
• Local database View Level − Depicts public view of local
data.
• Local database Conceptual Level − Depicts local data
organization at each site.
• Local database Internal Level − Depicts physical data
organization at each site.
There are two design alternatives for multi-DBMS −

• Model with multi-database conceptual level.


• Model without multi-database conceptual level.
Data Communication concept:-
Data Communication is defined as exchange of data between two
devices via some form of transmission media such as a cable, wire
or it can be air or vacuum also. For occurrence of data
communication, communicating devices must be a part of
communication system made up of a combination of hardware or
software devices and programs.
For data communication to occur, the communicating devices
must be part of a communication system made up of a
combination of hardware and software.

The effectiveness of a data communication system depends on the


following fundamental characteristics:
1. Delivery:- The System must deliver data to the correct
destination. Data must be received by the intended device or user
and only by that device or user

2. Accuracy:- The system must deliver data accurately. Data that


have been altered in transmission and left uncorrected are rustles

3. Timeliness:- The system must deliver data in a timely manner.


Data delivered late are useless. In the case of video, audio, and
voice data, timely delivery means delivering data as they are
produced, in the same order that they are produced, and without
significant delay. this kind of delivery id called real-time
transmission.

4. Jitter:-Variation in the packet arrival time.

Data Communication system components:-


1.Message: A message in its most general meaning is an object of
communication. It is a vessel which provides information. Yet, it
can also be this information. Therefore, its meaning is dependent
upon the context in which it is used; the term may apply to both
the information and its form.
2. Sender: The sender will have some kind of meaning she wishes
to convey to the receiver. It might not be conscious knowledge, it
might be a sub-conscious wish for communication. What is
desired to be communicated would be some kind of idea,
perception, feeling, or datum. It will be a part of her reality that
she wishes to send to somebody else.

3. Receiver: These messages are delivered to another party. No


doubt, you have in mind a desired action or reaction you hope
your message prompts from the opposite party. Keep in mind, the
other party also enters into the communication process with ideas
and feelings that will undoubtedly influence their understanding
of your message and their response. To be a successful
communicator, you should consider these before delivering your
message, then acting appropriately.

4. Medium: Medium is a means used to exchange / transmit the


message. The sender must choose an appropriate medium for
transmitting the message else the message might not be conveyed
to the desired recipients. The choice of appropriate medium of
communication is essential for making the message effective and
correctly interpreted by the recipient. This choice of
communication medium varies depending upon the features of
communication. For instance – Written medium is chosen when a
message has to be conveyed to a small group of people, while an
oral medium is chosen when spontaneous feedback is required
from the recipient as misunderstandings are cleared then and
there.

5. Protocol: A protocol is a formal description of digital message


formats and the rules for exchanging those messages in or
between computing systems and in telecommunications.
Protocols may include signaling, authentication and error
detection and correction syntax, semantics, and synchronization
of communication and may be implemented in hardware or
software, or both.

6. Feedback: Feedback is the main component of communication


process as it permits the sender to analyze the efficacy of the
message. It helps the sender in confirming the correct
interpretation of message by the decoder. Feedback may be verbal
(through words) or non-verbal (in form of smiles, sighs, etc.). It
may take written form also in form of memos, reports, etc.
Concurrency:-

It means that many users can access data at the same time.

Concurrency Control:-

When more than one transactions are running simultaneously


there are chances of a conflict to occur which can leave database
to an inconsistent state. To handle these conflicts we need
concurrency control in DBMS, which allows transactions to run
simultaneously but handles them in such a way so that the
integrity of data remains intact.

o In the concurrency control, the multiple transactions can be


executed simultaneously.
o It may affect the transaction result. It is highly important to
maintain the order of execution of those transactions.

Problems of concurrency control:-

Several problems can occur when concurrent transactions are


executed in an uncontrolled manner. Following are the three
problems in concurrency control.

1. Lost updates
2. Dirty read
3. Unrepeatable read

1. Lost update problem:-


o When two transactions that access the same database items
contain their operations in a way that makes the value of
some database item incorrect, then the lost update problem
occurs.
o If two transactions T1 and T2 read a record and then update
it, then the effect of updating of the first record will be
overwritten by the second update.
Example:

Here,
o At time t2, transaction-X reads A's value.
o At time t3, Transaction-Y reads A's value.
o At time t4, Transactions-X writes A's value on the basis of
the value seen at time t2.
o At time t5, Transactions-Y writes A's value on the basis of the
value seen at time t3.
o So at time T5, the update of Transaction-X is lost because
Transaction y overwrites it without looking at its current
value.
o Such type of problem is known as Lost Update Problem as
update made by one transaction is lost here.

2. Dirty Read:-
o The dirty read occurs in the case when one transaction
updates an item of the database, and then the transaction
fails for some reason. The updated database item is accessed
by another transaction before it is changed back to the
original value.
o A transaction T1 updates a record which is read by T2. If T1
aborts then T2 now has values which have never formed part
of the stable database.

Example:-
o At time t2, transaction-Y writes A's value.
o At time t3, Transaction-X reads A's value.
o At time t4, Transactions-Y rollbacks. So, it changes A's value
back to that of prior to t1.
o So, Transaction-X now contains a value which has never
become part of the stable database.
o Such type of problem is known as Dirty Read Problem, as
one transaction reads a dirty value which has not been
committed.

3. Inconsistent Retrievals Problem:-


o Inconsistent Retrievals Problem is also known as
unrepeatable read. When a transaction calculates some
summary function over a set of data while the other
transactions are updating the data, then the Inconsistent
Retrievals Problem occurs.
o A transaction T1 reads a record and then does some other
processing during which the transaction T2 updates the
record. Now when the transaction T1 reads the record, then
the new value will be inconsistent with the previous value.

Example:-

Suppose two transactions operate on three accounts.


o Transaction-X is doing the sum of all balance while
transaction-Y is transferring an amount 50 from Account-1
to Account-3.
o Here, transaction-X produces the result of 550 which is
incorrect. If we write this produced result in the database,
the database will become an inconsistent state because the
actual sum is 600.
o Here, transaction-X has seen an inconsistent state of the
database.

Concurrency Control Protocol

Concurrency control protocols ensure atomicity, isolation, and


serializability of concurrent transactions. The concurrency control
protocol can be divided into three categories:

1. Lock based protocol


2. Time-stamp protocol
3. Validation based protocol
1.Lock-Based Protocol:-

In this type of protocol, any transaction cannot read or write data


until it acquires an appropriate lock on it. There are two types of
lock:

1. Shared lock:

o It is also known as a Read-only lock. In a shared lock, the


data item can only read by the transaction.
o It can be shared between the transactions because when the
transaction holds a lock, then it can't update the data on the
data item.

2. Exclusive lock:

o In the exclusive lock, the data item can be both reads as well
as written by the transaction.
o This lock is exclusive, and in this lock, multiple transactions
do not modify the same data simultaneously.

There are four types of lock protocols available:-

1. Simplistic lock protocol

It is the simplest way of locking the data while transaction.


Simplistic lock-based protocols allow all the transactions to get
the lock on the data before insert or delete or update on it. It will
unlock the data item after completing the transaction.

2. Pre-claiming Lock Protocol


o Pre-claiming Lock Protocols evaluate the transaction to list
all the data items on which they need locks.
o Before initiating an execution of the transaction, it requests
DBMS for all the lock on all those data items.
o If all the locks are granted then this protocol allows the
transaction to begin. When the transaction is completed then
it releases all the lock.
o If all the locks are not granted then this protocol allows the
transaction to rolls back and waits until all the locks are
granted.

3. Two-phase locking (2PL)


o The two-phase locking protocol divides the execution phase
of the transaction into three parts.
o In the first part, when the execution of the transaction starts,
it seeks permission for the lock it requires.
o In the second part, the transaction acquires all the locks. The
third phase is started as soon as the transaction releases its
first lock.
o In the third phase, the transaction cannot demand any new
locks. It only releases the acquired locks.
There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item
may be acquired by the transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the


transaction may be released, but no new locks can be acquired.

In the below example, if lock conversion is allowed then the


following phase can happen:

1. Upgrading of lock (from S(a) to X (a)) is allowed in growing


phase.
2. Downgrading of lock (from X(a) to S(a)) must be done in
shrinking phase.

Example:
The following way shows how unlocking and locking work with 2-
PL.

Transaction T1:

o Growing phase: from step 1-3


o Shrinking phase: from step 5-7
o Lock point: at 3

Transaction T2:

o Growing phase: from step 2-6


o Shrinking phase: from step 8-9
o Lock point: at 6
4. Strict Two-phase locking (Strict-2PL)
o The first phase of Strict-2PL is similar to 2PL. In the first
phase, after acquiring all the locks, the transaction continues
to execute normally.
o The only difference between 2PL and strict 2PL is that Strict-
2PL does not release a lock after using it.
o Strict-2PL waits until the whole transaction to commit, and
then it releases all the locks at a time.
o Strict-2PL protocol does not have shrinking phase of lock
release.

It does not have cascading abort as 2PL does.

2.Timestamp Ordering Protocol:-


o The Timestamp Ordering Protocol is used to order the
transactions based on their Timestamps. The order of
transaction is nothing but the ascending order of the
transaction creation.
o The priority of the older transaction is higher that's why it
executes first. To determine the timestamp of the
transaction, this protocol uses system time or logical
counter.
o The lock-based protocol is used to manage the order between
conflicting pairs among transactions at the execution time.
But Timestamp based protocols start working as soon as a
transaction is created.
o Let's assume there are two transactions T1 and T2. Suppose
the transaction T1 has entered the system at 007 times and
transaction T2 has entered the system at 009 times. T1 has
the higher priority, so it executes first as it is entered the
system first.
o The timestamp ordering protocol also maintains the
timestamp of last 'read' and 'write' operation on a data.

Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues


a Read (X) operation:

o If W_TS(X) >TS(Ti) then the operation is rejected.


o If W_TS(X) <= TS(Ti) then the operation is executed.
o Timestamps of all the data items are updated.

2, Check the following condition whenever a transaction Ti issues


a Write(X) operation:

o If TS(Ti) < R_TS(X) then the operation is rejected.


o If TS(Ti) < W_TS(X) then the operation is rejected and Ti is
rolled back otherwise the operation is executed.

Where,

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.

W_TS(X) denotes the Write time-stamp of data-item X.

Advantages and Disadvantages of TO protocol:


o TO protocol ensures serializability since the precedence
graph is as follows:

o TS protocol ensures freedom from deadlock that means no


transaction ever waits.
o But the schedule may not be recoverable and may not even
be cascade- free.

3.Validation Based Protocol:-

Validation phase is also known as optimistic concurrency control


technique. In the validation based protocol, the transaction is
executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and


executed. It is used to read the value of various data items
and stores them in temporary local variables. It can perform
all the write operations on temporary variables without an
update to the actual database.
2. Validation phase: In this phase, the temporary variable value
will be validated against the actual data to see if it violates
the serializability.
3. Write phase: If the validation of the transaction is validated,
then the temporary results are written to the database or
system otherwise the transaction is rolled back.
Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read


phase and starts its validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.

o This protocol is used to determine the time stamp for the


transaction for serialization using the time stamp of the
validation phase, as it is the actual phase which determines if
the transaction will commit or rollback.
o Hence TS(T) = validation(T).
o The serializability is determined during the validation
process. It can't be decided in advance.
o While executing the transaction, it ensures a greater degree
of concurrency and also less number of conflicts.
o Thus it contains transactions which have less number of
rollbacks.

Thomas write Rule:-

Thomas Write Rule provides the guarantee of serializability order


for the protocol. It improves the Basic Timestamp Ordering
Algorithm.

The basic Thomas write rules are as follows:

o If TS(T) < R_TS(X) then transaction T is aborted and rolled


back, and operation is rejected.
o If TS(T) < W_TS(X) then don't execute the W_item(X)
operation of the transaction and continue processing.
o If neither condition 1 nor condition 2 occurs, then allowed to
execute the WRITE operation by transaction Ti and set
W_TS(X) to TS(T).
If we use the Thomas write rule then some serializable schedule
can be permitted that does not conflict serializable as illustrate by
the schedule in a given figure:

Figure: A Serializable Schedule that is not Conflict Serializable

In the above figure, T1's read and precedes T1's write of the same
data item. This schedule does not conflict serializable.

Thomas write rule checks that T2's write is never seen by any
transaction. If we delete the write operation in transaction T2,
then conflict serializable schedule can be obtained which is shown
in below figure.

Figure: A Conflict Serializable Schedule

Multiple Granularity

Let's start by understanding the meaning of granularity.

Granularity: It is the size of data item allowed to lock.


Multiple Granularity:
o It can be defined as hierarchically breaking up the database
into blocks which can be locked.
o The Multiple Granularity protocol enhances concurrency and
reduces lock overhead.
o It maintains the track of what to lock and how to lock.

o It makes easy to decide either to lock a data item or to unlock


a data item. This type of hierarchy can be graphically
represented as a tree.

For example: Consider a tree which has four levels of nodes.

o The first level or higher level shows the entire database.


o The second level represents a node of type area. The higher
level database consists of exactly these areas.
o The area consists of children nodes which are known as files.
No file can be present in more than one area.
o Finally, each file contains child nodes known as records. The
file has exactly those records that are its child nodes. No
records represent in more than one file.
o Hence, the levels of the tree starting from the top level are as
follows:
1. Database
2. Area
3. File
4. Record
In this example, the highest level shows the entire database. The
levels below are file, record, and fields.

There are three additional lock modes with multiple granularity:

Intention Mode Lock

Intention-shared (IS): It contains explicit locking at a lower level


of the tree but only with shared locks.

Intention-Exclusive (IX): It contains explicit locking at a lower


level with exclusive or shared locks.
Shared & Intention-Exclusive (SIX): In this lock, the node is
locked in shared mode, and some node is locked in exclusive
mode by the same transaction.

Compatibility Matrix with Intention Lock Modes: The below table


describes the compatibility matrix for these lock modes:

It uses the intention lock modes to ensure serializability. It


requires that if a transaction attempts to lock a node, then that
node must follow these protocols:

o Transaction T1 should follow the lock-compatibility matrix.


o Transaction T1 firstly locks the root of the tree. It can lock it
in any mode.
o If T1 currently has the parent of the node locked in either IX
or IS mode, then the transaction T1 will lock a node in S or IS
mode only.
o If T1 currently has the parent of the node locked in either IX
or SIX modes, then the transaction T1 will lock a node in X,
SIX, or IX mode only.
o If T1 has not previously unlocked any node only, then the
Transaction T1 can lock a node.
o If T1 currently has none of the children of the node-locked
only, then Transaction T1 will unlock a node.
Observe that in multiple-granularity, the locks are acquired in
top-down order, and locks must be released in bottom-up order.

o If transaction T1 reads record Ra9 in file Fa, then transaction


T1 needs to lock the database, area A1 and file Fa in IX mode.
Finally, it needs to lock Ra2 in S mode.
o If transaction T2 modifies record Ra9 in file Fa, then it can do
so after locking the database, area A1 and file Fa in IX mode.
Finally, it needs to lock the Ra9 in X mode.
o If transaction T3 reads all the records in file Fa, then
transaction T3 needs to lock the database, and area A in IS
mode. At last, it needs to lock Fa in S mode.
o If transaction T4 reads the entire database, then T4 needs to
lock the database in S mode.

Recovery with Concurrent Transaction:-


o Whenever more than one transaction is being executed, then
the interleaved of logs occur. During recovery, it would
become difficult for the recovery system to backtrack all logs
and then start recovering.
o To ease this situation, 'checkpoint' concept is used by most
DBMS.

Database Transaction:-

A Database Transaction is a logical unit of processing in a DBMS


which entails one or more database access operation. In a
nutshell, database transactions represent real-world events of any
enterprise.

All types of database access operation which are held between the
beginning and end transaction statements are considered as a
single logical transaction in DBMS. During the transaction the
database is inconsistent. Only once the database is committed the
state is changed from one consistent state to another.
Facts about Database Transactions

• A transaction is a program unit whose execution may or may


not change the contents of a database.
• The transaction concept in DBMS is executed as a single
unit.
• If the database operations do not update the database but
only retrieve data, this type of transaction is called a read-
only transaction.
• A successful transaction can change the database from one
CONSISTENT STATE to another
• DBMS transactions must be atomic, consistent, isolated and
durable
• If the database were in an inconsistent state before a
transaction, it would remain in the inconsistent state after
the transaction.

Why do you need concurrency in Transactions?

• A database is a shared resource accessed. It is used by many


users and processes concurrently. For example, the banking
system, railway, and air reservations systems, stock market
monitoring, supermarket inventory, and checkouts, etc.

Not managing concurrent access may create issues like:

• Hardware failure and system crashes


• Concurrent execution of the same transaction, deadlock, or
slow performance

States of Transactions

• The various states of a transaction concept in DBMS are


listed below:

State Transaction types

Active State A transaction enters into an active state when the


execution process begins. During this state read or
write operations can be performed.

Partially A transaction goes into the partially committed state


Committed after the end of a transaction.

Committed When the transaction is committed to state, it has


State already completed its execution successfully.
Moreover, all of its changes are recorded to the
database permanently.

Failed State A transaction considers failed when any one of the


checks fails or if the transaction is aborted while it is in
the active state.

Terminated State of transaction reaches terminated state when


State certain transactions which are leaving the system can't
be restarted.

a state transition diagram that highlights how a transaction moves


between these various states.

1. Once a transaction states execution, it becomes active. It can


issue READ or WRITE operation.
2. Once the READ and WRITE operations complete, the
transactions becomes partially committed state.
3. Next, some recovery protocols need to ensure that a system
failure will not result in an inability to record changes in the
transaction permanently. If this check is a success, the
transaction commits and enters into the committed state.
4. If the check is a fail, the transaction goes to the Failed state.
5. If the transaction is aborted while it's in the active state, it
goes to the failed state. The transaction should be rolled back
to undo the effect of its write operations on the database.
6. The terminated state refers to the transaction leaving the
system.

What are ACID Properties?

ACID Properties are used for maintaining the integrity of


database during transaction processing. ACID in DBMS stands
for Atomicity, Consistency, Isolation, and Durability.
• Atomicity: A transaction is a single unit of operation. You
either execute it entirely or do not execute it at all. There
cannot be partial execution.
• Consistency: Once the transaction is executed, it should
move from one consistent state to another.
• Isolation: Transaction should be executed in isolation from
other transactions (no Locks). During concurrent transaction
execution, intermediate transaction results from
simultaneously executed transactions should not be made
available to each other. (Level 0,1,2,3)
• Durability: · After successful completion of a transaction, the
changes in the database should persist. Even in the case of
system failures.

Schedule:-

A Schedule is a process creating a single group of the multiple


parallel transactions and executing them one by one. It should
preserve the order in which the instructions appear in each
transaction. If two transactions are executed at the same time, the
result of one transaction may affect the output of other.

In other words A series of operation from one transaction to


another transaction is known as schedule. It is used to preserve
the order of the operation in each of the individual transaction.
1. Serial Schedule:-

The serial schedule is a type of schedule where one transaction is


executed completely before starting another transaction. In the
serial schedule, when the first transaction completes its cycle,
then the next transaction is executed.

For example: Suppose there are two transactions T1 and T2 which


have some operations. If it has no interleaving of operations, then
there are the following two possible outcomes:

1. Execute all the operations of T1 which was followed by all the


operations of T2.
2. Execute all the operations of T1 which was followed by all the
operations of T2.
o In the given (a) figure, Schedule A shows the serial schedule
where T1 followed by T2.
o In the given (b) figure, Schedule B shows the serial schedule
where T2 followed by T1.

2. Non-serial Schedule:-

o If interleaving of operations is allowed, then there will be


non-serial schedule.
o It contains many possible orders in which the system can
execute the individual operations of the transactions.
o In the given figure (c) and (d), Schedule C and Schedule D
are the non-serial schedules. It has interleaving of
operations.

3. Serializable schedule:-

o The serializability of schedules is used to find non-serial


schedules that allow the transaction to execute concurrently
without interfering with one another.
o It identifies which schedules are correct when executions of
the transaction have interleaving of their operations.
o A non-serial schedule will be serializable if its result is equal
to the result of its transactions executed serially.
Here,

o Schedule A and Schedule B are serial schedule.


o Schedule C and Schedule D are Non-serial schedule.
Testing of Serializability: -
Serialization Graph is used to test the Serializability of a schedule.

Assume a schedule S. For S, we construct a graph known as precedence


graph. This graph has a pair G = (V, E), where V consists a set of vertices,
and E consists a set of edges. The set of vertices is used to contain all the
transactions participating in the schedule. The set of edges is used to
contain all edges Ti ->Tj for which one of the three conditions holds:

1. Create a node Ti → Tj if Ti executes write (Q) before Tj executes read


(Q).
2. Create a node Ti → Tj if Ti executes read (Q) before Tj executes write
(Q).
3. Create a node Ti → Tj if Ti executes write (Q) before Tj executes write
(Q).

o If a precedence graph contains a single edge Ti → Tj, then all the instructions of
Ti are executed before the first instruction of Tj is executed.

o If a precedence graph for schedule S contains a cycle, then S is non-serializable.


If the precedence graph has no cycle, then S is known as serializable.

For example:
Explanation:

Read(A): In T1, no subsequent writes to A, so no new edges


Read(B): In T2, no subsequent writes to B, so no new edges
Read(C): In T3, no subsequent writes to C, so no new edges
Write(B): B is subsequently read by T3, so add edge T2 → T3
Write(C): C is subsequently read by T1, so add edge T3 → T1
Write(A): A is subsequently read by T2, so add edge T1 → T2
Write(A): In T2, no subsequent reads to A, so no new edges
Write(C): In T1, no subsequent reads to C, so no new edges
Write(B): In T3, no subsequent reads to B, so no new edges
Precedence graph for schedule S1:

The precedence graph for schedule S1 contains a cycle that's why Schedule
S1 is non-serializable.
Explanation:

Read(A): In T4,no subsequent writes to A, so no new edges


Read(C): In T4, no subsequent writes to C, so no new edges
Write(A): A is subsequently read by T5, so add edge T4 → T5
Read(B): In T5,no subsequent writes to B, so no new edges
Write(C): C is subsequently read by T6, so add edge T4 → T6
Write(B): A is subsequently read by T6, so add edge T5 → T6
Write(C): In T6, no subsequent reads to C, so no new edges
Write(A): In T5, no subsequent reads to A, so no new edges
Write(B): In T6, no subsequent reads to B, so no new edges

Precedence graph for schedule S2:

The precedence graph for schedule S2 contains no cycle that's why


ScheduleS2 is serializable.

Conflict Serializable schedule: -


A schedule is said to be conflict serializable if it can transform into a serial
schedule after swapping of non-conflicting operations. It is a type of
serializability that can be used to check whether the non-serial schedule is
conflict serializable or not.

Conflict operations: -
The two operations are called conflicting operations, if all the following
three conditions are satisfied:
• Both the operation belongs to separate transactions.
• Both works on the same data item.
• At least one of them contains one write operation.
Note: Conflict pairs for the same data item are:
Read-Write
Write-Write
Write-Read

Conflict Equivalent Schedule: -

Two schedules are called as a conflict equivalent schedule if one schedule


can be transformed into another schedule by swapping non-conflicting
operations.

Example of conflict serializability:


Schedule S2 (Non-Serial Schedule):
Time Transaction T1 Transaction T2 Transaction T3
t1 Read(X)
t2 Read(Y)
t3 Read(X)
t4 Read(Y)
t5 Read(Z)
t6 Write(Y)
t7 Write(Z)
t8 Read(Z)
t9 Write(X)
t10 Write(Z)

Precedence graph for schedule S2:


In the above schedule, there are three transactions: T1, T2, and T3. So, the
precedence graph contains three vertices.
To draw the edges between these nodes or vertices, follow the below steps:
Step1: At time t1, there is no conflicting operation for read(X) of
Transaction T1.
Step2: At time t2, there is no conflicting operation for read(Y) of
Transaction T3.
Step3: At time t3, there exists a conflicting operation Write(X) in
transaction T1 for read(X) of Transaction T3. So, draw an edge from
T3→T1.

Step4: At time t4, there exists a conflicting operation Write(Y) in


transaction T3 for read(Y) of Transaction T2. So, draw an edge from
T2→T3.
Step5: At time t5, there exists a conflicting operation Write (Z) in
transaction T1 for read (Z) of Transaction T2. So, draw an edge from
T2→T1.

Step6: At time t6, there is no conflicting operation for Write(Y) of


Transaction T3.
Step7: At time t7, there exists a conflicting operation Write (Z) in
transaction T1 for Write (Z) of Transaction T2. So, draw an edge from
T2→T1, but it is already drawn.
After all the steps, the precedence graph will be ready, and it does not
contain any cycle or loop, so the above schedule S2 is conflict serializable.
And it is equivalent to a serial schedule. Above schedule S2 is transformed
into the serial schedule by using the following steps:

Step1: Check the vertex in the precedence graph where indegree=0. So,
take the vertex T2 from the graph and remove it from the graph.

Step 2: Again check the vertex in the left precedence graph


where indegree=0. So, take the vertex T3 from the graph and remove it
from the graph. And draw the edge from T2→T3.

Step3: And at last, take the vertex T1 and connect with T3.
Precedence graph equivalent to schedule S2

Schedule S2 (Serial Schedule): -


Time Transaction T1 Transaction T2 Transaction T3
t1 Read(Y)
t2 Read(Z)
t3 Write(Z)
t4 Read(Y)
t5 Read(X)
t6 Write(Y)
t7 Read(X)
t8 Read(Z)
t9 Write(X)
t10 Write(Z)

Schedule S3 (Non-Serial Schedule): -


Time Transaction T1 Transaction T2
t1 Read(X)
t2 Read(X)
t3 Read (Y)
t4 Write(Y)
t5 Read(Y)
t6 Write(X)
To convert this schedule into a serial schedule, swap the non- conflicting
operations of T1 and T2.

Time Transaction T1 Transaction T2


t1 Read(X)
t2 Read (Y)
t3 Write(Y)
t4 Read(X)
t5 Read(Y)
t6 Write(X)
Then, finally get a serial schedule after swapping all the non-conflicting
operations, so this schedule is conflict serializable.

View Serializability: -
View Serializability is a process to find out that a given schedule is view
serializable or not.

To check whether a given schedule is view serializable, we need to check


whether the given schedule is View Equivalent to its serial schedule.

If a given schedule is found to be view equivalent to some serial schedule,


then it is called as a view serializable schedule.

View Equivalent Schedules-

Consider two schedules S1 and S2 each consisting of two transactions T1


and T2.
Schedules S1 and S2 are called view equivalent if the following three
conditions hold true for them-

Condition-01:

For each data item X, if transaction Ti reads X from the database initially in
schedule S1, then in schedule S2 also, Ti must perform the initial read of X
from the database.

Thumb Rule
“Initial readers must be same for all the data items”.

Condition-02:
If transaction Ti reads a data item that has been updated by the transaction
Tj in schedule S1, then in schedule S2 also, transaction Ti must read the
same data item that has been updated by the transaction Tj.

Thumb Rule
“Write-read sequence must be same.”.

Condition-03:

For each data item X, if X has been updated at last by transaction Ti in


schedule S1, then in schedule S2 also, X must be updated at last by
transaction Ti.

Thumb Rule
“Final writers must be same for all the data items”.

Checking Whether a Schedule is View Serializable Or Not-

Method-01:

Check whether the given schedule is conflict serializable or not.


• If the given schedule is conflict serializable, then it is surely view
serializable. Stop and report your answer.
• If the given schedule is not conflict serializable, then it may or may
not be view serializable. Go and check using other methods.
Thumb Rules
• All conflict serializable schedules are view serializable.
• All view serializable schedules may or may not be conflict
serializable.

Method-02:

Check if there exists any blind write operation.


(Writing without reading is called as a blind write).
• If there does not exist any blind write, then the schedule is surely not
view serializable. Stop and report your answer.
• If there exists any blind write, then the schedule may or may not be
view serializable. Go and check using other methods.

Thumb Rule
No blind write means not a view serializable schedule.

Method-03:

In this method, try finding a view equivalent serial schedule.


• By using the above three conditions, write all the dependencies.
• Then, draw a graph using those dependencies.
• If there exists no cycle in the graph, then the schedule is view
serializable otherwise not.

Example-01:

Check whether the given schedule S is view serializable or not-


Solution-

• We know, if a schedule is conflict serializable, then it is surely view


serializable.
• So, let us check whether the given schedule is conflict serializable or
not.

Checking Whether S is Conflict Serializable Or Not-

Step-01:

List all the conflicting operations and determine the dependency between
the transactions-
• W1(B) , W2(B) (T1 → T2)
• W1(B) , W3(B) (T1 → T3)
• W1(B) , W4(B) (T1 → T4)
• W2(B) , W3(B) (T2 → T3)
• W2(B) , W4(B) (T2 → T4)
• W3(B) , W4(B) (T3 → T4)

Step-02:

Draw the precedence graph-

• Clearly,
there exists no cycle in the precedence graph.
• Therefore, the given schedule S is conflict serializable.
• Thus, we conclude that the given schedule is also view serializable.

Example-02:

Check whether the given schedule S is view serializable or not-


Solution-

• We know, if a schedule is conflict serializable, then it is surely view


serializable.
• So, let us check whether the given schedule is conflict serializable or
not.

Checking Whether S is Conflict Serializable Or Not-

Step-01:

List all the conflicting operations and determine the dependency between
the transactions-
• R1(A) , W3(A) (T1 → T3)
• R2(A) , W3(A) (T2 → T3)
• R2(A) , W1(A) (T2 → T1)
• W3(A) , W1(A) (T3 → T1)

Step-02:
Draw the precedence graph-

• Clearly,
there exists a cycle in the precedence graph.
• Therefore, the given schedule S is not conflict serializable.

Now,
• Since, the given schedule S is not conflict serializable, so, it may or
may not be view serializable.
• To check whether S is view serializable or not, let us use another
method.
• Let us check for blind writes.

Checking for Blind Writes-

• There exists a blind write W3 (A) in the given schedule S.


• Therefore, the given schedule S may or may not be view serializable.

Now,
• To check whether S is view serializable or not, let us use another
method.
• Let us derive the dependencies and then draw a dependency graph.

Drawing a Dependency Graph-


• T1 firstly reads A and T3 firstly updates A.
• So, T1 must execute before T3.
• Thus, we get the dependency T1 → T3.
• Final updation on A is made by the transaction T1.
• So, T1 must execute after all other transactions.
• Thus, we get the dependency (T2, T3) → T1.
• There exists no write-read sequence.

Now, let us draw a dependency graph using these dependencies-

• Clearly,
there exists a cycle in the dependency graph.
• Thus, we conclude that the given schedule S is not view serializable.

Example-03:

Check whether the given schedule S is view serializable or not-


Solution-

• We know, if a schedule is conflict serializable, then it is surely view


serializable.
• So, let us check whether the given schedule is conflict serializable or
not.

Checking Whether S is Conflict Serializable Or Not-


Step-01:

List all the conflicting operations and determine the dependency between
the transactions-
• R1(A) , W2(A) (T1 → T2)
• R2(A) , W1(A) (T2 → T1)
• W1(A) , W2(A) (T1 → T2)
• R1(B) , W2(B) (T1 → T2)
• R2(B) , W1(B) (T2 → T1)

Step-02:

Draw the precedence graph-

• Clearly,
there exists a cycle in the precedence graph.
• Therefore, the given schedule S is not conflict serializable.

Now,
• Since, the given schedule S is not conflict serializable, so, it may or
may not be view serializable.
• To check whether S is view serializable or not, let us use another
method.
• Let us check for blind writes.

Checking for Blind Writes-


• There exists no blind write in the given schedule S.
• Therefore, it is surely not view serializable.

Alternatively,
• You could directly declare that the given schedule S is not view
serializable.
• This is because there exists no blind write in the schedule.
• You need not check for conflict serializability.

Recoverability of schedule: -
Sometimes a transaction may not execute completely due to a software
issue, system crash or hardware failure. In that case, the failed transaction
has to be rollback. But some other transaction may also have used value
produced by the failed transaction. So we also have to rollback those
transactions.

The above table 1 shows a schedule which has two transactions. T1 reads
and writes the value of A and that value is read and written by T2. T2
commits but later on, T1 fails. Due to the failure, we have to rollback T1. T2
should also be rollback because it reads the value written by T1, but T2 can't
be rollback because it already committed. So this type of schedule is known
as irrecoverable schedule.
Irrecoverable schedule: The schedule will be irrecoverable if Tj reads
the updated value of Ti and Tj committed before Ti commit.

The above table 2 shows a schedule with two transactions. Transaction T1


reads and writes A, and that value is read and written by transaction T2.
But later on, T1 fails. Due to this, we have to rollback T1. T2 should be
rollback because T2 has read the value written by T1. As it has not
committed before T1 commits so we can rollback transaction T2 as well. So
it is recoverable with cascade rollback.

Recoverable with cascading rollback: The schedule will be


recoverable with cascading rollback if Tj reads the updated value of Ti.
Commit of Tj is delayed till commit of Ti.

The above Table 3 shows a schedule with two transactions. Transaction T1


reads and write A and commits, and that value is read and written by T2. So
this is a cascade less recoverable schedule.
Failure Classification

To find that where the problem has occurred, we generalize a failure into
the following categories:

1. Transaction failure
2. System crash
3. Disk failure

1. Transaction failure

The transaction failure occurs when it fails to execute or when it


reaches a point from where it can't go any further. If a few transaction
or process is hurt, then this is called as transaction failure.

Reasons for a transaction failure could be -

1. Logical errors: If a transaction cannot complete due to some


code error or an internal error condition, then the logical error
occurs.
2. Syntax error: It occurs where the DBMS itself terminates an
active transaction because the database system is not able to
execute it. For example, The system aborts an active
transaction, in case of deadlock or resource unavailability.

2. System Crash
3. System failure can occur due to power failure or other hardware
or software failure. Example: Operating system error.

Fail-stop assumption: In the system crash, non-volatile storage


is assumed not to be corrupted.

3. Disk Failure
4. It occurs where hard-disk drives or storage drives used to fail
frequently. It was a common problem in the early days of
technology evolution.
5. Disk failure occurs due to the formation of bad sectors, disk
head crash, and unreachability to the disk or any other failure,
which destroy all or part of disk storage.

Log-Based Recovery: -
o The log is a sequence of records. Log of each transaction is
maintained in some stable storage so that if any failure occurs, then it
can be recovered from there.
o If any operation is performed on the database, then it will be recorded
in the log.
o But the process of storing the logs should be done before the actual
transaction is applied in the database.

Let's assume there is a transaction to modify the City of a student. The


following logs are written for this transaction.

o When the transaction is initiated, then it writes 'start' log.


1. <Tn, Start>
o When the transaction modifies the City from 'Noida' to 'Bangalore',
then another log is written to the file.
1. <Tn, City, 'Noida', 'Bangalore' >
o When the transaction is finished, then it writes another log to indicate
the end of the transaction.
1. <Tn, Commit>

There are two approaches to modify the database:

1. Deferred database modification:


o The deferred modification technique occurs if the transaction does
not modify the database until it has committed.
o In this method, all the logs are created and stored in the stable
storage, and the database is updated when a transaction commits.
2. Immediate database modification:
o The Immediate modification technique occurs if database
modification occurs while the transaction is still active.
o In this technique, the database is modified immediately after every
operation. It follows an actual database modification.

Recovery using Log records

When the system is crashed, then the system consults the log to find which
transactions need to be undone and which need to be redone.

1. If the log contains the record <Ti, Start> and <Ti, Commit> or <Ti,
Commit>, then the Transaction Ti needs to be redone.
2. If log contains record<Tn, Start> but does not contain the record
either <Ti, commit> or <Ti, abort>, then the Transaction Ti needs to
be undone.

Checkpoint: -
o The checkpoint is a type of mechanism where all the previous logs are
removed from the system and permanently stored in the storage disk.
o The checkpoint is like a bookmark. While the execution of the
transaction, such checkpoints are marked, and the transaction is
executed then using the steps of the transaction, the log files will be
created.
o When it reaches to the checkpoint, then the transaction will be
updated into the database, and till that point, the entire log file will be
removed from the file. Then the log file is updated with the new step
of transaction till next checkpoint and so on.
o The checkpoint is used to declare a point before which the DBMS was
in the consistent state, and all transactions were committed.

Recovery using Checkpoint

In the following manner, a recovery system recovers the database from this
failure:
o The recovery system reads log files from the end to start. It reads log
files from T4 to T1.
o Recovery system maintains two lists, a redo-list, and an undo-list.
o The transaction is put into redo state if the recovery system sees a log
with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>. In the
redo-list and their previous list, all the transactions are removed and
then redone before saving their logs.
o For example: In the log file, transaction T2 and T3 will have <Tn,
Start> and <Tn, Commit>. The T1 transaction will have only <Tn,
commit> in the log file. That's why the transaction is committed after
the checkpoint is crossed. Hence it puts T1, T2 and T3 transaction
into redo list.
o The transaction is put into undo state if the recovery system sees a log
with <Tn, Start> but no commit or abort log found. In the undo-list,
all the transactions are undone, and their logs are removed.
o For example: Transaction T4 will have <Tn, Start>. So T4 will be put
into undo list since this transaction is not yet complete and failed
amid.
Blocking: -
Blocking occurs when one or more sessions request a lock on a same
resource, such as a row, page, or table.
when one connection holds a lock and another second connection also
requires a lock that case second connection will to be blocked until the first
connection complete.

In other words, Database blocking occurs when a connection to the SQL


server locks one or more records, and a second connection to the SQL
server requires a conflicting lock type on the record, or records, locked by
the first connection. This results in the second connection waiting until the
first connection releases its locks. A connection waits, by default, an
unlimited amount of time for the blocking lock to cease. One connection
can block another connection, regardless of whether they emanate from the
same application or separate applications on different client computers.

A certain amount of blocking is normal and unavoidable. But too much


blocking can cause connections (representing applications and user) to wait
extensive periods of time,

Deadlock in DBMS: -

A deadlock is a condition where two or more transactions are waiting


indefinitely for one another to give up locks. Deadlock is said to be one of
the most feared complications in DBMS as no task ever gets finished and
is in waiting state forever.

For example: In the student table, transaction T1 holds a lock on some


rows and needs to update some rows in the grade table. Simultaneously,
transaction T2 holds locks on some rows in the grade table and needs to
update the rows in the Student table held by Transaction T1.

Now, the main problem arises. Now Transaction T1 is waiting for T2 to


release its lock and similarly, transaction T2 is waiting for T1 to release
its lock. All activities come to a halt state and remain at a standstill. It
will remain in a standstill until the DBMS detects the deadlock and
aborts one of the transactions.

Deadlock Avoidance

o When a database is stuck in a deadlock state, then it is better to


avoid the database rather than aborting or restating the database.
This is a waste of time and resource.
o Deadlock avoidance mechanism is used to detect any deadlock
situation in advance. A method like "wait for graph" is used for
detecting the deadlock situation but this method is suitable only for
the smaller database. For the larger database, deadlock prevention
method can be used.

Deadlock Detection

In a database, when a transaction waits indefinitely to obtain a lock, then


the DBMS should detect whether the transaction is involved in a
deadlock or not. The lock manager maintains a Wait for the graph to
detect the deadlock cycle in the database.
Wait for Graph
o This is the suitable method for deadlock detection. In this method,
a graph is created based on the transaction and their lock. If the
created graph has a cycle or closed loop, then there is a deadlock.
o The wait for the graph is maintained by the system for every
transaction which is waiting for some data held by the others. The
system keeps checking the graph if there is any cycle in the graph.

The wait for a graph for the above scenario is shown below:

Deadlock Prevention

o Deadlock prevention method is suitable for a large database. If the


resources are allocated in such a way that deadlock never occurs,
then the deadlock can be prevented.
o The Database management system analyzes the operations of the
transaction whether they can create a deadlock situation or not. If
they do, then the DBMS never allowed that transaction to be
executed.
Wait-Die scheme

In this scheme, if a transaction requests for a resource which is already


held with a conflicting lock by another transaction then the DBMS simply
checks the timestamp of both transactions. It allows the older transaction
to wait until the resource is available for execution.

Let's assume there are two transactions Ti and Tj and let TS(T) is a
timestamp of any transaction T. If T2 holds a lock by some other
transaction and T1 is requesting for resources held by T2 then the
following actions are performed by DBMS:

1. Check if TS(Ti) < TS(Tj) - If Ti is the older transaction and Tj has


held some resource, then Ti is allowed to wait until the data-item is
available for execution. That means if the older transaction is
waiting for a resource which is locked by the younger transaction,
then the older transaction is allowed to wait for resource until it is
available.
2. Check if TS(Ti) < TS(Tj) - If Ti is older transaction and has held
some resource and if Tj is waiting for it, then Tj is killed and
restarted later with the random delay but with the same
timestamp.

Wound wait scheme


o In wound wait scheme, if the older transaction requests for a
resource which is held by the younger transaction, then older
transaction forces younger one to kill the transaction and release
the resource. After the minute delay, the younger transaction is
restarted but with the same timestamp.
o If the older transaction has held a resource which is requested by
the Younger transaction, then the younger transaction is asked to
wait until older releases it.

Relational Algebra: -
Every database management system must define a query language to allow
users to access the data stored in the database. Relational Algebra is a
procedural query language used to query the database tables to access data
in different ways.

Types of relational operation:-

1. Select Operation:-
o The select operation selects tuples that satisfy a given predicate.
o It is denoted by sigma (σ).

1. Notation: σ p(r)

Where:

σ is used for selection prediction


r is used for relation
p is used as a propositional logic formula which may use connectors like:
AND OR and NOT. These relational can use as relational operators like =,
≠, ≥, <, >, ≤.

For example: LOAN Relation

BRANCH_NAME LOAN_NO AMOUNT


Downtown L-17 1000
Redwood L-23 2000
Perryride L-15 1500
Downtown L-14 1500
Mianus L-13 500
Roundhill L-11 900
Perryride L-16 1300

Input:

σ BRANCH_NAME="perryride" (LOAN)

Output:

BRANCH_NAME LOAN_NO AMOUNT


Perryride L-15 1500
Perryride L-16 1300

2. Project Operation:
o This operation shows the list of those attributes that we wish to
appear in the result. Rest of the attributes are eliminated from the
table.
o It is denoted by ∏.

Notation: ∏ A1, A2, An (r)


Where

A1, A2, A3 is used as an attribute name of relation r.

Example: CUSTOMER RELATION

NAME STREET CITY


Jones Main Harrison
Smith North Rye
Hays Main Harrison
Curry North Rye
Johnson Alma Brooklyn
Brooks Senator Brooklyn
Input:

∏ NAME, CITY (CUSTOMER)

Output:

NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn

3. Union Operation:
o Suppose there are two tuples R and S. The union operation contains
all the tuples that are either in R or S or both in R & S.
o It eliminates the duplicate tuples. It is denoted by ∪.

Notation: R ∪ S

A union operation must hold the following condition:

o R and S must have the attribute of the same number.


o Duplicate tuples are eliminated automatically.

Example:

DEPOSITOR RELATION

CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
CUSTOMER_NAME
Johnson
Smith
Hayes

Lindsay A-284

BORROW RELATION

CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17

Input:

1. ∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITO


R)

Output:
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes

4. Set Intersection:
o Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in both R & S.
o It is denoted by intersection ∩.

Notation: R ∩ S

Example: Using the above DEPOSITOR table and BORROW table

Input:

∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITO


R)

Output:

CUSTOMER_NAME
Smith
Jones

5. Set Difference:
o Suppose there are two tuples R and S. The set intersection operation
contains all tuples that are in R but not in S.
o It is denoted by intersection minus (-).

Notation: R - S
Example: Using the above DEPOSITOR table and BORROW table

Input:

∏ CUSTOMER_NAME (BORROW) -
∏ CUSTOMER_NAME (DEPOSITOR)

Output:

CUSTOMER_NAME
Jackson
Hayes
Willians
Curry

6. Cartesian product
o The Cartesian product is used to combine each row in one table with
each row in the other table. It is also known as a cross product.
o It is denoted by X.

Notation: E X D
Example:

EMPLOYEE

EMP_ID EMP_NAME EMP_DEPT


1 Smith A
2 Harry C
3 John B

DEPARTMENT

DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal

Input:
EMPLOYEE X DEPARTMENT

Output:

EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME


1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal

7. Rename Operation:

The rename operation is used to rename the output relation. It is denoted


by rho (ρ).

Example: We can use the rename operator to rename STUDENT relation


to STUDENT1.

ρ(STUDENT1, STUDENT)

Join Operations:

A Join operation combines related tuples from different relations, if and


only if a given join condition is satisfied. It is denoted by ⋈.

Example:

EMPLOYEE

EMP_CODE EMP_NAME
101 Stephan
102 Jack
103 Harry

SALARY

EMP_CODE SALARY
101 50000
102 30000
103 25000

Operation: (EMPLOYEE ⋈ SALARY)

Result:

EMP_CODE EMP_NAME SALARY


101 Stephan 50000
102 Jack 30000
103 Harry 25000

Types of Join operations:


1. Natural Join:
o A natural join is the set of tuples of all combinations in R and S that
are equal on their common attribute names.
o It is denoted by ⋈.

Example: Let's use the above EMPLOYEE table and SALARY table:

Input:

∏EMP_NAME, SALARY (EMPLOYEE ⋈ SALARY)


Output:

EMP_NAME SALARY
Stephan 50000
Jack 30000
Harry 25000
2. Outer Join:

The outer join operation is an extension of the join operation. It is used to


deal with missing information.

Example:

EMP_NAME STREET CITY


Ram Civil line Mumbai
Shyam Park street Kolkata
Ravi M.G. Street Delhi
Hari Nehru nagar Hyderabad
EMPLOYEE

FACT_WORKERS

EMP_NAME BRANCH SALARY


Ram Infosys 10000
Shyam Wipro 20000
Kuber HCL 30000
Hari TCS 50000

Input:

(EMPLOYEE ⋈ FACT_WORKERS)

Output:

EMP_NAME STREET CITY BRANCH SALARY


Ram Civil line Mumbai Infosys 10000
Shyam Park street Kolkata Wipro 20000
Hari Nehru nagar Hyderabad TCS 50000

An outer join is basically of three types:

a. Left outer join


b. Right outer join
c. Full outer join
a. Left outer join:
o Left outer join contains the set of tuples of all combinations in R and
S that are equal on their common attribute names.
o In the left outer join, tuples in R have no matching tuples in S.
o It is denoted by ⟕.

Example: Using the above EMPLOYEE table and FACT_WORKERS table

Input:

EMPLOYEE ⟕ FACT_WORKERS

EMP_NAME STREET CITY BRANCH SALARY


Ram Civil line Mumbai Infosys 10000
Shyam Park street Kolkata Wipro 20000
Hari Nehru street Hyderabad TCS 50000
Ravi M.G. Street Delhi NULL NULL

b. Right outer join:


o Right outer join contains the set of tuples of all combinations in R and
S that are equal on their common attribute names.
o In right outer join, tuples in S have no matching tuples in R.
o It is denoted by ⟖.

Example: Using the above EMPLOYEE table and FACT_WORKERS


Relation

Input:

1. EMPLOYEE ⟖ FACT_WORKERS

Output:

EMP_NAME BRANCH SALARY STREET CITY


Ram Infosys 10000 Civil line Mumbai
Shyam Wipro 20000 Park street Kolkata
Hari TCS 50000 Nehru street Hyderabad
Kuber HCL 30000 NULL NULL

c. Full outer join:


o Full outer join is like a left or right join except that it contains all rows
from both tables.
o In full outer join, tuples in R that have no matching tuples in S and
tuples in S that have no matching tuples in R in their common
attribute name.
o It is denoted by ⟗.

Example: Using the above EMPLOYEE table and FACT_WORKERS table

Input:

EMPLOYEE ⟗ FACT_WORKERS

Output:

EMP_NAME STREET CITY BRANCH SALARY


Ram Civil line Mumbai Infosys 10000
Shyam Park street Kolkata Wipro 20000
Hari Nehru street Hyderabad TCS 50000
Ravi M.G. Street Delhi NULL NULL
Kuber NULL NULL HCL 30000
3. Equi join:

It is also known as an inner join. It is the most common join. It is based on


matched data as per the equality condition. The equi join uses the
comparison operator(=).

Example:

CLASS_ID NAME
1 John
2 Harry
3 Jackson
CUSTOMER RELATION

PRODUCT

PRODUCT_ID CITY
1 Delhi
2 Mumbai
3 Noida

Input:

CUSTOMER ⋈ PRODUCT

Output:

CLASS_ID NAME PRODUCT_ID CITY


1 John 1 Delhi
2 Harry 2 Mumbai
3 Harry 3 Noida

Introduction to Query Optimization: -


Query: A query is a request for information from a database.
Query Plans: A query plan (or query execution plan) is an ordered set of
steps used to access data in a SQL relational database management
system.
Query Optimization: A single query can be executed through different
algorithms or re-written in different forms and structures. Hence, the
question of query optimization comes into the picture – Which of these
forms or pathways is the most optimal? The query optimizer attempts to
determine the most efficient way to execute a given query by considering
the possible query plans.
Importance: The goal of query optimization is to reduce the system
resources required to fulfill a query, and ultimately provide the user with
the correct result set faster.
• First, it provides the user with faster results, which makes the
application seem faster to the user.
• Secondly, it allows the system to service more queries in the same
amount of time, because each request takes less time than
unoptimized queries.
• Thirdly, query optimization ultimately reduces the amount of
wear on the hardware (e.g. disk drives), and allows the server to
run more efficiently (e.g. lower power consumption, less memory
usage).
There are broadly two ways a query can be optimized:
1. Analyze and transform equivalent relational expressions: Try to
minimize the tuple and column counts of the intermediate and
final query processes (discussed here).
2. Using different algorithms for each operation: These underlying
algorithms determine how tuples are accessed from the data
structures they are stored in, indexing, hashing, data retrieval and
hence influence the number of disk and block accesses (discussed
in query processing).

Query Processing in DBMS: -


Query Processing is the activity performed in extracting data from the
database. In query processing, it takes various steps for fetching the data
from the database. The steps involved are:

1. Parsing and translation


2. Optimization
3. Evaluation

The query processing works in the following way:

Parsing and Translation

As query processing includes certain activities for data retrieval. Initially,


the given user queries get translated in high-level database languages such
as SQL. It gets translated into expressions that can be further used at the
physical level of the file system. After this, the actual evaluation of the
queries and a variety of query -optimizing transformations and takes place.
Thus before processing a query, a computer system needs to translate the
query into a human-readable and understandable language. Consequently,
SQL or Structured Query Language is the best suitable choice for humans.
But, it is not perfectly suitable for the internal representation of the query
to the system. Relational algebra is well suited for the internal
representation of a query. The translation process in query processing is
similar to the parser of a query. When a user executes any query, for
generating the internal form of the query, the parser in the system checks
the syntax of the query, verifies the name of the relation in the database, the
tuple, and finally the required attribute value. The parser creates a tree of
the query, known as 'parse-tree.' Further, translate it into the form of
relational algebra. With this, it evenly replaces all the use of the views when
used in the query.

Thus, we can understand the working of a query processing in the below-


described diagram:

Suppose a user executes a query. As we have learned that there are various
methods of extracting the data from the database. In SQL, a user wants to
fetch the records of the employees whose salary is greater than or equal to
10000. For doing this, the following query is undertaken:
select emp_name from Employee where salary>10000;

Thus, to make the system understand the user query, it needs to be


translated in the form of relational algebra. We can bring this query in the
relational algebra form as:

o σsalary>10000 (πsalary (Employee))


o πsalary (σsalary>10000 (Employee))

After translating the given query, we can execute each relational algebra
operation by using different algorithms. So, in this way, a query processing
begins its working.

Evaluation

For this, with addition to the relational algebra translation, it is required to


annotate the translated relational algebra expression with the instructions
used for specifying and evaluating each operation. Thus, after translating
the user query, the system executes a query evaluation plan.

Query Evaluation Plan


o In order to fully evaluate a query, the system needs to construct a
query evaluation plan.
o The annotations in the evaluation plan may refer to the algorithms to
be used for the particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation
Primitives. The evaluation primitives carry the instructions needed
for the evaluation of the operation.
o Thus, a query evaluation plan defines a sequence of primitive
operations used for evaluating a query. The query evaluation plan is
also referred to as the query execution plan.
o A query execution engine is responsible for generating the output
of the given query. It takes the query execution plan, executes it, and
finally makes the output for the user query.

Optimization
o The cost of the query evaluation can vary for different types of
queries. Although the system is responsible for constructing the
evaluation plan, the user does need not to write their query
efficiently.
o Usually, a database system generates an efficient query evaluation
plan, which minimizes its cost. This type of task performed by the
database system and is known as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated
cost analysis of each operation. It is because the overall operation cost
depends on the memory allocations to several operations, execution
costs, and so on.

Finally, after selecting an evaluation plan, the system evaluates the query
and produces the output of the query.

Steps for Query Optimization: -


Query optimization involves three steps, namely query tree generation,
plan generation, and query plan code generation.
Step 1 − Query Tree Generation
A query tree is a tree data structure representing a relational algebra
expression. The tables of the query are represented as leaf nodes. The
relational algebra operations are represented as the internal nodes. The
root represents the query as a whole.
During execution, an internal node is executed whenever its operand
tables are available. The node is then replaced by the result table. This
process continues for all internal nodes until the root node is executed and
replaced by the result table.
For example, let us consider the following schemas −
EMPLOYEE

EmpID EName Salary DeptNo DateOfJoining

DEPARTMENT
DNo DName Location

Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"}
{(EMPLOYEE)})$$
The corresponding query tree will be −

Example 2
Let us consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"}
{(DEPARTMENT)}) \bowtie_{DNo=DeptNo}{(EMPLOYEE)}$
Following is the query tree for the above query.
Step 2 − Query Plan Generation
After the query tree is generated, a query plan is made. A query plan is an
extended query tree that includes access paths for all operations in the
query tree. Access paths specify how the relational operations in the tree
should be performed. For example, a selection operation can have an
access path that gives details about the use of B+ tree index for selection.
Besides, a query plan also states how the intermediate tables should be
passed from one operator to the next, how temporary tables should be
used and how operations should be pipelined/combined.
Step 3− Code Generation
Code generation is the final step in query optimization. It is the executable
form of the query, whose form depends upon the type of the underlying
operating system. Once the query code is generated, the Execution
Manager runs it and produces the results.

Approaches to Query Optimization: -

Among the approaches for query optimization, exhaustive search and


heuristics-based algorithms are mostly used.
Exhaustive Search Optimization
In these techniques, for a query, all possible query plans are initially
generated and then the best plan is selected. Though these techniques
provide the best solution, it has an exponential time and space complexity
owing to the large solution space. For example, dynamic programming
technique.
Heuristic Based Optimization
Heuristic based optimization uses rule-based optimization approaches for
query optimization. These algorithms have polynomial time and space
complexity, which is lower than the exponential complexity of exhaustive
search-based algorithms. However, these algorithms do not necessarily
produce the best query plan.
Some of the common heuristic rules are −
• Perform select and project operations before join operations. This is
done by moving the select and project operations down the query
tree. This reduces the number of tuples available for join.
• Perform the most restrictive select/project operations at first before
the other operations.
• Avoid cross-product operation since they result in very large-sized
intermediate tables.

Methods of Query optimization: -


The query optimizer uses these two techniques to determine which process
or expression to consider for evaluating the query.

1. Cost based Optimization (Physical): -

This is based on the cost of the query. The query can use different paths
based on indexes, constraints, sorting methods etc. This method mainly
uses the statistics like record size, number of records, number of records
per block, number of blocks, table size, whether whole table fits in a block,
organization of tables, uniqueness of column values, size of columns etc.
Suppose, we have series of table joined in a query.

T1 ∞ T2 ∞ T3 ∞ T4∞ T5 ∞ T6
For above query we can have any order of evaluation. We can start taking
any two tables in any order and start evaluating the query. Ideally, we can
have join combinations in (2(n-1))! / (n-1)! ways. For example, suppose we
have 5 tables involved in join, then we can have 8! / 4! = 1680
combinations. But when query optimizer runs, it does not evaluate in all
these ways always. It uses Dynamic Programming where it generates the
costs for join orders of any combination of tables. It is calculated and
generated only once. This least cost for all the table combination is then
stored in the database and is used for future use. i.e.; say we have a set of
tables, T = { T1 , T2 , T3 .. Tn}, then it generates least cost combination for
all the tables and stores it.

• Dynamic Programming
As we learnt above, the least cost for the joins of any combination of table is
generated here. These values are stored in the database and when those
tables are used in the query, this combination is selected for evaluating the
query.
While generating the cost, it follows below steps :
Suppose we have set of tables, T = {T1 , T2 , T3 .. Tn}, in a DB. It picks the
first table, and computes cost for joining with rest of the tables in set T. It
calculates cost for each of the tables and then chooses the best cost. It
continues doing the same with rest of the tables in set T. It will generate 2n
– 1 cases and it selects the lowest cost and stores it. When a query uses
those tables, it checks for the costs here and that combination is used to
evaluate the query. This is called dynamic programming.
In this method, time required to find optimized query is in the order of 3n,
where n is the number of tables. Suppose we have 5 tables, then time
required in 35 = 243, which is lesser than finding all the combination of
tables and then deciding the best combination (1680). Also, the space
required for computing and storing the cost is also less and is in the order
of 2n. In above example, it is 25 = 32.
• Left Deep Trees
This is another method of determining the cost of the joins. Here, the tables
and joins are represented in the form of trees. The joins always form the
root of the tree and table is kept at the right side of the root. LHS of the root
always point to the next join. Hence it gets deeper and deeper on LHS.
Hence it is called as left deep tree.

Here instead of calculating the best join cost for set of tables, best join cost
for joining with each table is calculated. In this method, time required to
find optimized query is in the order of n2n, where n is the number of tables.
Suppose we have 5 tables, then time required in 5*25 =160, which is lesser
than dynamic programming. Also, the space required for computing storing
the cost is also less and is in the order of 2n. In above example, it is 25 = 32,
same as dynamic programming.
• Interesting Sort Orders
This method is an enhancement to dynamic programming. Here, while
calculating the best join order costs, it also considers the sorted tables. It
assumes, calculating the join orders on sorted tables would be efficient. i.e.;
suppose we have unsorted tables T1 , T2 , T3 .. Tn and we have join on these
tables.
(T1 ∞T2)∞ T3 ∞… ∞ Tn
This method uses hash join or merge join method to calculate the cost.
Hash Join will simply join the tables. We get sorted output in merge join
method, but it is costlier than hash join. Even though merge join is costlier
at this stage, when it moves to join with third table, the join will have less
effort to sort the tables. This is because first table is the sorted result of first
two tables. Hence it will reduce the total cost of the query.
But the number of tables involved in the join would be relatively less and
this cost/space difference will be hardly noticeable.
All these cost based optimizations are expensive and are suitable for large
number of data. There is another method of optimization called heuristic
optimization, which is better compared to cost based optimization.
2. Heuristic Optimization (Logical): -
This method is also known as rule based optimization. This is based on the
equivalence rule on relational expressions; hence the number of
combination of queries get reduces here. Hence the cost of the query too
reduces.
This method creates relational tree for the given query based on the
equivalence rules. These equivalence rules by providing an alternative way
of writing and evaluating the query, gives the better path to evaluate the
query. This rule need not be true in all cases. It needs to be examined after
applying those rules. The most important set of rules followed in this
method is listed below:
• Perform all the selection operation as early as possible in the
query. This should be first and foremost set of actions on the
tables in the query. By performing the selection operation, we
can reduce the number of records involved in the query, rather
than using the whole tables throughout the query.
Suppose we have a query to retrieve the students with age 18 and studying
in class DESIGN_01. We can get all the student details from STUDENT
table, and class details from CLASS table. We can write this query in two
different ways.

Here both the queries will return same result. But when we observe them
closely we can see that first query will join the two tables first and then
applies the filters. That means, it traverses whole table to join, hence the
number of records involved is more. But he second query, applies the filters
on each table first. This reduces the number of records on each table (in
class table, the number of record reduces to one in this case!). Then it joins
these intermediary tables. Hence the cost in this case is comparatively less.

Instead of writing query the optimizer creates relational algebra and tree
for above case.
• Perform all the projection as early as possible in the query. This
is similar to selection but will reduce the number of columns in
the query.
Suppose for example, we have to select only student name, address and
class name of students with age 18 from STUDENT and CLASS tables.

Here again, both the queries look alike, results alike. But when we compare
the number of records and attributes involved at each stage, second query
uses less records and hence more efficient.

• Next step is to perform most restrictive joins and selection


operations. When we say most restrictive joins and selection
means, select those set of tables and views which will result in
comparatively less number of records. Any query will have
better performance when tables with few records are joined.
Hence throughout heuristic method of optimization, the rules
are formed to get less number of records at each stage, so that
query performance is better. So is the case here too.
Suppose we have STUDENT, CLASS and TEACHER tables. Any student can
attend only one class in an academic year and only one teacher takes a
class. But a class can have more than 50 students. Now we have to retrieve
STUDENT_NAME, ADDRESS, AGE, CLASS_NAME and
TEACHER_NAME of each student in a school.
∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME
((STUDENT ∞ CLASS_ID CLASS)∞ TECH_IDTEACHER)
Not So efficient
∏STD_NAME, ADDRESS, AGE, CLASS_NAME, TEACHER_NAME
(STUDENT ∞ CLASS_ID (CLASS∞ TECH_IDTEACHER))
Efficient
In the first query, it tries to select the records of students from each class.
This will result in a very huge intermediary table. This table is then joined
with another small table. Hence the traversing of number of records is also
more. But in the second query, CLASS and TEACHER are joined first,
which has one to one relation here. Hence the number of resulting record is
STUDENT table give the final result. Hence this second method is more
efficient.
Sometimes we can combine above heuristic steps with cost based
optimization technique to get better results.

All these methods need not be always true. It also depends on the table size,
column size, type of selection, projection, join sort, constraints, indexes,
statistics etc. Above optimization describes the best way of optimizing the
queries.

External Sort- Merge Algorithm: -


Till now, we saw that sorting is an important term in any database system.
It means arranging the data either in ascending or descending order. We
use sorting not only for generating a sequenced output but also for
satisfying conditions of various database algorithms. In query processing,
the sorting method is used for performing various relational operations
such as joins, etc. efficiently. But the need is to provide a sorted input value
to the system. For sorting any relation, we have to build an index on the
sort key and use that index for reading the relation in sorted order.
However, using an index, we sort the relation logically, not physically.
Thus, sorting is performed for cases that include:

Case 1: Relations that are having either small or medium size than main
memory.

Case 2: Relations having a size larger than the memory size.

In Case 1, the small or medium size relations do not exceed the size of the
main memory. So, we can fit them in memory. So, we can use standard
sorting methods such as quicksort, merge sort, etc., to do so.

For Case 2, the standard algorithms do not work properly. Thus, for such
relations whose size exceeds the memory size, we use the External Sort-
Merge algorithm.

The sorting of relations which do not fit in the memory because their size is
larger than the memory size. Such type of sorting is known as External
Sorting. As a result, the external-sort merge is the most suitable method
used for external sorting.

External Sort-Merge Algorithm

Here, we will discuss the external-sort merge algorithm stages in detail:

In the algorithm, M signifies the number of disk blocks available in the


main memory buffer for sorting.

Stage 1: Initially, we create a number of sorted runs. Sort each of them.


These runs contain only a few records of the relation.

1. i = 0;

1. repeat
2. read either M blocks or the rest of the relation having a smaller si
ze;
3. sort the in-memory part of the relation;
4. write the sorted data to run file Ri;
5. i =i+1;
1. Until the end of the relation

In Stage 1, we can see that we are performing the sorting operation on the
disk blocks. After completing the steps of Stage 1, proceed to Stage 2.

Stage 2: In Stage 2, we merge the runs. Consider that total number of


runs, i.e., N is less than M. So, we can allocate one block to each run and
still have some space left to hold one block of output. We perform the
operation as follows:

1. read one block of each of N files Ri into a buffer block in memory;


2. repeat
3. select the first tuple among all buffer blocks (where selection is
made in sorted order);
4. write the tuple to the output, and then delete it from the buffer
block;
5. if the buffer block of any run Ri is empty and not EOF(Ri)
6. then read the next block of Ri into the buffer block;
7. Until all input buffer blocks are empty

After completing Stage 2, we will get a sorted relation as an output. The


output file is then buffered for minimizing the disk-write operations. As this
algorithm merges N runs, that's why it is known as an N-way merge.

However, if the size of the relation is larger than the memory size, then
either M or more runs will be generated in Stage 1. Also, it is not possible to
allocate a single block for each run while processing Stage 2. In such a case,
the merge operation process in multiple passes. As M-1 input buffer blocks
have sufficient memory, each merge can easily hold M-1 runs as its input.
So, the initial phase works in the following way:

o It merges the first M-1 runs for getting a single run for the next one.
o Similarly, it merges the next M-1 runs. This step continues until it
processes all the initial runs. Here, the number of runs has a reduced
M-1 value. Still, if this reduced value is greater than or equal to M, we
need to create another pass. For this new pass, the input will be the
runs created by the first pass.
o The work of each pass will be to reduce the number of runs by M-1
value. This job repeats as many times as needed until the number of
runs is either less than or equal to M.
o Thus, a final pass produces the sorted output.

Estimating cost for External Sort-Merge Method

The cost analysis of external sort-merge is performed using the above-


discussed stages in the algorithm:

o Assume br denote number of blocks containing records of relation r.


o In the first stage, it reads each block of the relation and writes them
back. It takes a total of 2br block transfers.
o Initially, the value of the number of runs is [br/M]. As the number of
runs decreases by M-1 in each merge pass, so it needs a total number
of [log M-1(br/M)] merge passes.

Every pass read and write each block of the relation only once. But with two
exceptions:

o The final pass can give a sorted output without writing its result to the
disk
o There might be chances that some runs may not be read or written
during the pass.

Neglecting such small exceptions, the total number of block transfers for
external sorting comes out:

b r (2 Γ log M-1 (b r /M) ˥ + 1)

We need to add the disk seek cost because each run needs seeks reading
and writing data for them. If in Stage 2, i.e., the merge phase, each run is
allocated with bb buffer blocks or each run reads bb data at a time, then each
merge needs [br /bb] seeks for reading the data. The output is written
sequentially, so if it is on the same disk as input runs, the head will need to
move between the writes of consecutive blocks. Therefore, add a total of
2[br /bb] seeks for each merge pass and the total number of seeks comes
out:

2 Γ b r /M ˥ + Γb r /b b ˥(2ΓlogM-1(b r /M)˥ - 1)

Thus, we need to calculate the total number of disk seeks for analyzing the
cost of the External merge-sort algorithm.

Example of External Merge-sort Algorithm

Let's understand the working of the external merge-sort algorithm and also
analyze the cost of the external sorting with the help of an example.

Suppose that for a relation R, we are performing the external sort-merge. In


this, assume that only one block can hold one tuple, and the memory can
hold at most three blocks. So, while processing Stage 2, i.e., the merge
stage, it will use two blocks as input and one block for output.
Temporal Database concept: -
Temporal database is a database with built-in support for handling data
involving time.

A temporal database is generally understood as a database capable of


supporting storage and reasoning of time-based data. For example, medical
applications may be able to benefit from temporal database support — a
record of a patient's medical history has little value unless the test results,
e.g. the temperatures, are associated to the times at which they are valid,
since we may wish to do reasoning about the periods in time in which the
patients temperature changed.

Different Forms of Temporal Databases: -

The two different notions of time - valid time and transaction time - allow
the distinction of different forms of temporal databases. A historical
database stores data with respect to valid time, a rollback database stores
data with respect to transaction time. A bitemporal database stores data
with respect to both valid time and transaction time.

Commercial DBMS are said to store only a single state of the real world,
usually the most recent state. Such databases usually are called snapshot
databases. A snapshot database in the context of valid time and transaction
time is depicted in the following picture:
On the other hand, a bitemporal DBMS such as TimeDB stores the history
of data with respect to both valid time and transaction time. Note that the
history of when data was stored in the database (transaction time) is
limited to past and present database states, since it is managed by the
system directly which does not know anything about future states.

A table in the bitemporal relational DBMS TimeDB may either be a


snapshot table (storing only current data), a valid-time table (storing when
the data is valid wrt. the real world), a transaction-time table (storing when
the data was recorded in the database) or a bitemporal table (storing both
valid time and transaction time). An extended version of SQL allows to
specify which kind of table is needed when the table is created. Existing
tables may also be altered (schema versioning). Additionally, it
supports temporal queries, temporal modification statements and temporal
constraints.

The states stored in a bitemporal database are sketched in the picture


below. Of course, a temporal DBMS such as TimeDB does not store each
database state separately as depicted in the picture below. It stores valid
time and/or transaction time for each tuple, as described above.
3 types of time Available: -

1. Valid Time
2. Transaction Time
3. Bitemporal Time

Valid Time: - Valid time is a time period during which a fact is true in the
real world.

Use-

• Financial Application-eg. History of stock market share price


• Reservation system-eg. When was flight booked
• Medical system-eg. Patient record
• Computer application-eg. History of file backup

Transaction Time: -it is the time period during which a fact stored in the
database was known.

Use-

• To rollback database.

Name Department Salary Transaction Time


James PHP 10000 2016
Ben SEO 20000 2016
Eric HR 15000 2016

Bitemporal Time: - Bitemporal data combines both valid time and


transaction time. It stores data with respect of both valid time and
transaction time.

More preferable.

Name Department Salary Transaction Time Valid start time Valid end time
James PHP 10000 2016 2005 2006
Ben SEO 20000 2016 2000 2002
Eric HR 15000 2016 2003 2007
Multimedia Database: -
Multimedia database is the collection of interrelated multimedia data that
includes text, graphics (sketches, drawings), images, animations, video,
audio etc and have vast amounts of multisource multimedia data. The
framework that manages different types of multimedia data which can be
stored, delivered and utilized in different ways is known as multimedia
database management system. There are three classes of the multimedia
database which includes static media, dynamic media and dimensional
media.

Contents of the Multimedia Database:-


The multimedia database stored the multimedia data and information
related to it. This is given in detail as follows −
Media data-
This is the multimedia data that is stored in the database such as images,
videos, audios, animation etc.
Media format data-
The Media format data contains the formatting information related to the
media data such as sampling rate, frame rate, encoding scheme etc.
Media keyword data-
This contains the keyword data related to the media in the database. For an
image the keyword data can be date and time of the image, description of
the image etc.
Media feature data-
Th Media feature data describes the features of the media data. For an
image, feature data can be colors of the image, textures in the image etc.
There are still many challenges to multimedia databases, some of which
are :
1. Modelling –
Working in this area can improve database versus information
retrieval techniques thus, documents constitute a specialized area
and deserve special consideration.
2. Design –
The conceptual, logical and physical design of multimedia
databases has not yet been addressed fully as performance and
tuning issues at each level are far more complex as they consist of
a variety of formats like JPEG, GIF, PNG, MPEG which is not
easy to convert from one form to another.
3. Storage –
Storage of multimedia database on any standard disk presents the
problem of representation, compression, mapping to device
hierarchies, archiving and buffering during input-output
operation. In DBMS, a ”BLOB”(Binary Large Object) facility
allows untyped bitmaps to be stored and retrieved.
4. Performance –
For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of
parallel processing may alleviate some problems but such
techniques are not yet fully developed. Apart from this
multimedia database consume a lot of processing time as well as
bandwidth.
5. Queries and retrieval –
For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query
formulation, query execution and optimization which need to be
worked upon.

Areas where multimedia database is applied are: -

• Documents and record management: -


Industries and businesses that keep detailed records and variety
of documents. Example: Insurance claim record.

• Knowledge dissemination: -
Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. Example:
Electronic books.

• Education and training: -


Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of
learning. Example: Digital libraries.

• Marketing, advertising, retailing, entertainment and travel.


Example: a virtual tour of cities.

• Real-time control and monitoring: -


Coupled with active database technology, multimedia
presentation of information can be very effective means for
monitoring and controlling complex tasks Example:
Manufacturing operation control.

Data Mining: -

• Data mining is the process of extracting the useful information stored


in the large database.
• It is the extraction of hidden predictive information.
• Data Mining is the practice of automatically searching the large
stores of data to discover patterns.
• Data Mart is a powerful new technology with great potential that
helps organization to focus on the most important information in
their data warehouse.
• It uses mathematical algorithms to segment the data and evaluates
the probability of future events.
• Data mining is a powerful tool used to retrieve the useful information
from available data warehouses.
• Data mining can be applied to relational databases, object-oriented
databases, data warehouses, structured-unstructured databases etc.
• Data mining is also known as Knowledge Discovery in
Databases (KDD).

Different steps of KDD as per the above diagram are:


1. Data cleaning: removes irrelevant data from the database.
2. Data integration: The heterogeneous data sources are merged into a
single data source.
3. Data selection: retrieves the relevant data to the analysis process from
the database.
4. Data transformation: The selected data is transformed in forms which
are suitable for data mining.
5. Data mining: The various techniques are applied to extract the data
patterns.
6. Pattern evaluation: evaluates different data patterns.
7. Knowledge representation: This is the final step of KDD which represents
the knowledge.

Types of Data Mining: -

Data mining can be performed on the following types of data:


Relational Database:

A relational database is a collection of multiple data sets formally organized


by tables, records, and columns from which data can be accessed in various
ways without having to recognize the database tables. Tables convey and
share information, which facilitates data searchability, reporting, and
organization.

Data warehouses:

A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights.
The huge amount of data comes from multiple places such as Marketing
and Finance. The extracted data is utilized for analytical purposes and helps
in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.

Data Repositories:

The Data Repository generally refers to a destination for data storage.


However, many IT professionals utilize the term more clearly to refer to a
specific kind of setup within an IT structure. For example, a group of
databases, where an organization has kept various kinds of information.

Object-Relational Database:

A combination of an object-oriented database model and relational


database model is called an object-relational model. It supports Classes,
Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close


the gap between the Relational database and the object-oriented model
practices frequently utilized in many programming languages, for example,
C++, Java, C#, and so on.

Transactional Database:

A transactional database refers to a database management system (DBMS)


that has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while
back, today, most of the relational database systems support transactional
database activities.

Advantages of Data Mining: -

o The Data Mining technique enables organizations to obtain


knowledge-based data.
o Data mining enables organizations to make lucrative modifications in
operation and production.
o Compared with other statistical data applications, data mining is a
cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as
the prediction of trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze
enormous amounts of data in a short time.

Disadvantages of Data Mining: -

o There is a probability that the organizations may sell useful data of


customers to other organizations for money. As per the report,
American Express has sold credit card purchases of their customers
to other organizations.
o Many data mining analytics software is difficult to operate and needs
advance training to work on.
o Different data mining instruments operate in distinct ways due to the
different algorithms used in their design. Therefore, the selection of
the right data mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to
severe consequences in certain conditions.

Data Mining Applications: -


Data Mining is primarily used by organizations with intense consumer
demands- Retail, Communication, Financial, marketing company,
determine price, consumer preferences, product positioning, and impact on
sales, customer satisfaction, and corporate profits. Data mining enables a
retailer to use point-of-sale records of customer purchases to develop
products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health


system. It uses data and analytics for better insights and to identify best
practices that will enhance health care services and reduce costs. Analysts
use data mining approaches such as Machine learning, Multi-dimensional
database, Data visualization, Soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure
that the patients get intensive care at the right place and at the right time.
Data mining also enables healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:


Market basket analysis is a modeling method based on a hypothesis. If you
buy a specific group of products, then you are more likely to buy another
group of products. This technique may enable the retailer to understand the
purchase behavior of a buyer. This data may assist the retailer in
understanding the requirements of the buyer and altering the store's layout
accordingly. Using a different analytical comparison of results between
various stores, between customers in different demographic groups can be
done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with


developing techniques that explore knowledge from the data generated
from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of
educational support, and promoting learning science. An organization can
use data mining to make precise decisions and also to predict the results of
the student. With the results, the institution can concentrate on what to
teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data


mining tools can be beneficial to find patterns in a complex manufacturing
process. Data mining can be used in system-level designing to obtain the
relationships between product architecture, product portfolio, and data
needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and


holding Customers, also enhancing customer loyalty and implementing
customer-oriented strategies. To get a decent relationship with the
customer, a business organization needs to collect data and analyze the
data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection:


Billions of dollars are lost to the action of frauds. Traditional methods of
fraud detection are a little bit time consuming and sophisticated. Data
mining provides meaningful patterns and turning data into information. An
ideal fraud detection system should protect the data of all the users.
Supervised methods consist of a collection of sample records, and these
records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether
the document is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from
him is a very challenging task. Law enforcement may use data mining
techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it seeks
meaningful patterns in data, which is usually unstructured text. The
information collected from the previous investigations is compared, and a
model for lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an


enormous amount of data with every new transaction. The data mining
technique can help bankers by solving business-related problems in
banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find these
data for better targeting, acquiring, retaining, segmenting, and maintain a
profitable customer.

Challenges of Implementation in Data mining: -

Although data mining is very powerful, it faces many challenges during its
execution. Various challenges could be related to performance, data,
methods, and techniques, etc. The process of data mining becomes effective
when the challenges or problems are correctly recognized and adequately
resolved.
Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data
mining. The data in the real-world is heterogeneous, incomplete, and noisy.
Data in huge quantities will usually be inaccurate or unreliable. These
problems may occur due to data measuring instrument or because of
human errors. Suppose a retail chain collects phone numbers of customers
who spend more than $ 500, and the accounting employees put the
information into their system. The person may make a digit mistake when
entering the phone number, which results in incorrect data. Even some
customers may not be willing to disclose their phone numbers, which
results in incomplete data. The data could get changed due to human or
system error. All these consequences (noisy and incomplete data)makes
data mining challenging.
Data Distribution:

Real-worlds data is usually stored on various platforms in a distributed


computing environment. It might be in a database, individual systems, or
even on the internet. Practically, It is a quite tough task to make all the data
to a centralized data repository mainly due to organizational and technical
concerns. For example, various regional offices may have their servers to
store their data. It is not feasible to store, all the data from all the offices on
a central server. Therefore, data mining requires the development of tools
and algorithms that allow the mining of distributed data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data,


including audio and video, images, complex data, spatial data, time series,
and so on. Managing these various types of data and extracting useful
information is a tough task. Most of the time, new technologies, new tools,
and methodologies would have to be refined to obtain specific information.

Performance:

The data mining system's performance relies primarily on the efficiency of


algorithms and techniques used. If the designed algorithm and techniques
are not up to the mark, then the efficiency of the data mining process will
be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security,


governance, and privacy. For example, if a retailer analyzes the details of
the purchased items, then it reveals data about buying habits and
preferences of the customers without their permission.

Data Visualization:

In data mining, data visualization is a very important process because it is


the primary method that shows the output to the user in a presentable way.
The extracted data should convey the exact meaning of what it intends to
express. But many times, representing the information to the end-user in a
precise and easy way is difficult. The input data and the output information
being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

Data Mining Techniques: -

Data mining includes the utilization of refined data analysis tools to find
previously unknown, valid patterns and relationships in huge data sets.
These tools can incorporate statistical models, machine learning
techniques, and mathematical algorithms, such as neural networks or
decision trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of


machine learning, database management, and statistics, professionals in
data mining have devoted their careers to better understanding how to
process and make conclusions from the huge amount of data, but what are
the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have
been developed and used, including association, classification, clustering,
prediction, sequential patterns, and regression.
1. Classification:

This technique is used to obtain important and relevant information about


data and metadata. This data mining technique helps to classify data in
different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data


sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
ii. Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example.
Object-oriented database, transactional database, relational database,
and so on..
iii. Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge discovered or
data mining functionalities. For example, discrimination,
classification, clustering, characterization, etc. some frameworks tend
to be extensive frameworks offering a few data mining functionalities
together..
iv. Classification of data mining frameworks according to data mining
techniques used:
This classification is as per the data analysis approach utilized, such
as neural networks, machine learning, genetic algorithms,
visualization, statistics, data warehouse-oriented or database-
oriented, etc.
The classification can also take into account, the level of user
interaction involved in the data mining procedure, such as query-
driven systems, autonomous systems, or interactive exploratory
systems.
2. Clustering:

Clustering is a division of information into groups of connected objects.


Describing the data by a few clusters mainly loses certain confine details,
but accomplishes improvement. It models data by its clusters. Data
modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of
view, clusters relate to hidden patterns, the search for clusters is
unsupervised learning, and the subsequent framework represents a data
concept. From a practical point of view, clustering plays an extraordinary
job in data mining applications. For example, scientific data exploration,
text mining, information retrieval, spatial database applications, CRM, Web
analysis, computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining


technique to identify similar data. This technique helps to recognize the
differences and similarities between the data. Clustering is very similar to
the classification, but it involves grouping chunks of data together based on
their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and


analyze the relationship between variables because of the presence of the
other factor. It is used to define the probability of the specific variable.
Regression, primarily a form of planning and modeling. For example, we
might use it to project certain costs, depending on other factors such as
availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.

4. Association Rules:

This data mining technique helps to discover a link between two or more
items. It finds a hidden pattern in the data set.

Association rules are if-then statements that support to show the


probability of interactions between data items within large data sets in
different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or
medical data sets.
The way the algorithm works is that you have various data, For example, a
list of grocery items that you have been buying for the last six months. It
calculates a percentage of items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence
over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased
when item A is purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items
in the data set, which do not match an expected pattern or expected
behavior. This technique may be used in various domains like intrusion,
detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the
rest of the dataset. The majority of the real-world datasets have an outlier.
Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating


sequential data to discover sequential patterns. It comprises of finding
interesting subsequences in a set of sequences, where the stake of a
sequence can be measured in terms of different criteria like length,
occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize


similar patterns in transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as


trends, clustering, classification, etc. It analyzes past events or instances in
the right sequence to predict a future event.

Data Warehouse: -

A Data Warehouse (DW) is a relational database that is designed for query


and analysis rather than transaction processing. It includes historical data
derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and


focuses on providing support for decision-makers for data modeling and
analysis.

A Data Warehouse is a group of data specific to the entire organization, not


only to a particular group of users.

It is not used for daily operations and transaction processing but used for
making decisions.

A Data Warehouse can be viewed as a data system with the following


attributes:

o It is a database designed for investigative tasks, using data from


various applications.
o It supports a relatively small number of clients with relatively long
interactions.
o It includes current and historical data to provide a historical
perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store
of information in support of management's decisions."

Characteristics of Data warehouse: -

Next →

Data Warehouse Tutorial

Data Warehouse is a relational database management system (RDBMS) construct to


meet the requirement of transaction processing systems. It can be loosely described
as any centralized data repository which can be queried for business benefits. It is a
database that stores information oriented to satisfy decision-making requests. It is a
group of decision support technologies, targets to enabling the knowledge worker
(executive, manager, and analyst) to make superior and higher decisions. So, Data
Warehousing support architectures and tool for business executives to systematically
organize, understand and use their information to make strategic decisions.

Data Warehouse environment contains an extraction, transportation, and loading


(ETL) solution, an online analytical processing (OLAP) engine, customer analysis
tools, and other applications that handle the process of gathering information and
delivering it to business users.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses


on providing support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.

It is not used for daily operations and transaction processing but used for making
decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various


applications.

o It supports a relatively small number of clients with relatively long


interactions.

o It includes current and historical data to provide a historical perspective of


information.
o Its usage is read-intensive.

o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of


information in support of management's decisions."

Characteristics of Data Warehouse

Subject-Oriented: -

A data warehouse target on the modeling and analysis of data for


decision-makers. Therefore, data warehouses typically provide a concise
and straightforward view around a particular subject, such as customer,
product, or sales, instead of the global organization's ongoing operations.
This is done by excluding data that are not useful concerning the subject
and including all data needed by the users to understand the subject.
Integrated: -

A data warehouse integrates various heterogeneous data sources like


RDBMS, flat files, and online transaction records. It requires performing
data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among
different data sources.
Time-Variant: -

Historical information is kept in a data warehouse. For example, one can


retrieve files from 3 months, 6 months, 12 months, or even previous data
from a data warehouse. These variations with a transactions system,
where often only the most current file is kept.

Non-Volatile: -

The data warehouse is a physically separate data storage, which is


transformed from the source operational RDBMS. The operational
updates of data do not occur in the data warehouse, i.e., update, insert,
and delete operations are not performed. It usually requires only two
procedures in data accessing: Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery, and
concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and
data should not change.
History of Data Warehouse: -

The idea of data warehousing came to the late 1980's when IBM researchers
Barry Devlin and Paul Murphy established the "Business Data Warehouse."

In essence, the data warehousing idea was planned to support an


architectural model for the flow of information from the operational system
to decisional support environments. The concept attempt to address the
various problems associated with the flow, mainly the high costs associated
with it.

In the absence of data warehousing architecture, a vast amount of space


was required to support multiple decision support environments. In large
corporations, it was ordinary for various decision support environments to
operate independently.

Goals of Data Warehousing: -

o To help reporting as well as analysis


o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse: -

Data Warehouse is needed for the following reasons:

1) Business User: Business users require a data warehouse to view


summarized data from the past. Since these people are non-technical,
the data may be presented to them in an elementary form.

2) Store historical data: Data Warehouse is required to store the time


variable data from the past. This input is made to be used for various
purposes.

3) Make strategic decisions: Some strategies may be depending upon


the data in the data warehouse. So, data warehouse contributes to
making strategic decisions.

4) For data consistency and quality: Bringing the data from different
sources at a commonplace, the user can effectively undertake to bring
the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant
degree of flexibility and quick response time.

Benefits of Data Warehouse: -

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of
data.
3. The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
4. Queries that would be complex in many normalized databases could
be easier to build and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots
of information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount
of historical data.

Data Warehouse Architecture: -

A data warehouse architecture is a method of defining the overall


architecture of data communication processing and presentation that
exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital
components.

Production applications such as payroll accounts payable product


purchasing and inventory control are designed for online transaction
processing (OLTP). Such applications gather detailed data from day to
day operations.

Data Warehouse applications are designed to support the user ad-hoc


data requirements, an activity recently dubbed online analytical
processing (OLAP). These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via
OLTP applications. In contrast, a warehouse database is updated from
operational systems periodically, usually during off-hours. As OLTP data
accumulates in production databases, it is regularly extracted, filtered,
and then loaded into a dedicated warehouse server that is accessible to
users. As the warehouse is populated, it must be restructured tables de-
normalized, data cleansed of errors and redundancies and new fields and
keys added to reflect the needs to the user for sorting, combining, and
summarizing data.

Data warehouses and their architectures very depending upon the


elements of an organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts

Data Warehouse Architecture: Basic


Operational System

An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an
organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored,


and every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.

Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query


performance. The summarized record is updated continuously as new
information is loaded into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the


business managers for strategic decision-making. These customers interact
with the warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into
the warehouse.

We can do this programmatically, although data warehouses uses a staging


area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational


method coming from multiple source systems, especially for enterprise data
warehouses where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple


groups within our organization.

We can do this by adding data marts. A data mart is a segment of a data


warehouses that can provided information for reporting and analysis on a
section, unit, department or operation in the company, e.g., sales, payroll,
production, etc.

The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical
data for purchases and sales or mine historical information to make
predictions about customer behavior.
Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse


system:
1. Separation: Analytical and transactional processing should be keep apart
as much as possible.

2. Scalability: Hardware and software architectures should be simple to


upgrade the data volume, which has to be managed and processed, and the
number of user's requirements, which have to be met, progressively
increase.

3. Extensibility: The architecture should be able to perform new operations


and technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data


stored in the data warehouses.

5. Administerability: Data Warehouse management should not be


complicated.

Types of Data Warehouse Architectures: -

Single-Tier Architecture: -

Single-Tier architecture is not periodically used in practice. Its purpose is to


minimize the amount of data stored to reach this goal; it removes data
redundancies.
The figure shows the only layer physically available is the source layer. In
this method, data warehouses are virtual. This means that the data
warehouse is implemented as a multidimensional view of operational data
created by specific middleware, or an intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the


requirement for separation between analytical and transactional
processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional
workloads.

Two-Tier Architecture: -

The requirement for separation plays an essential role in defining the two-
tier architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a
separation between physically available sources and data warehouses, in
fact, consists of four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source


of data. That data is stored initially to corporate relational databases
or legacy databases, or it may come from an information system
outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted,
cleansed to remove inconsistencies and fill gaps, and integrated to
merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse,
validate, filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically
centralized individual repository: a data warehouse. The data
warehouses can be directly accessed, but it can also be used as a
source for creating data marts, which partially replicate data
warehouse contents and are designed for specific enterprise
departments. Meta-data repositories store information on sources,
access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible
accessed to issue reports, dynamically analyze information, and
simulate hypothetical business scenarios. It should feature aggregate
information navigators, complex query optimizers, and customer-
friendly GUIs.

Three-Tier Architecture: -

The three-tier architecture consists of the source layer (containing multiple


source system), the reconciled layer and the data warehouse layer
(containing both data warehouses and data marts). The reconciled layer sits
between the source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard


reference data model for a whole enterprise. At the same time, it separates
the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly
used to accomplish better some operational tasks, such as producing daily
reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically
to benefit from cleaning and integration.

This architecture is especially useful for the extensive, enterprise-wide


systems. A disadvantage of this structure is the extra file storage space used
through the extra redundant reconciled layer. It also makes the analytical
tools a little further away from being real-time.
Different Layers of Data Warehouse Architecture
Below are the different layers:
There are four different types of layers which will always be present in Data
Warehouse Architecture.

1. Data Source Layer

• The Data Source Layer is the layer where the data from the source is
encountered and subsequently sent to the other layers for desired
operations.
• The data can be of any type.
• The Source Data can be a database, a Spreadsheet or any other kinds
of a text file.
• The Source Data can be of any format. We cannot expect to get data
with the same format considering the sources are vastly different.
• In Real Life, Some examples of Source Data can be
• Log Files of each specific application or job or entry of employers in a
company.
• Survey Data, Stock Exchange Data, etc.
• Web Browser Data and many more.

2. Data Staging Layer


The following steps take place in Data Staging Layer.

Step #1: Data Extraction

The Data received by the Source Layer is feed into the Staging Layer where
the first process that takes place with the acquired data is extraction.

Step #2: Landing Database

• The extracted data is temporarily stored in a landing database.


• It retrieves the data once the data is extracted.

Step #3: Staging Area

• The Data in Landing Database is taken and several quality checks and
staging operations are performed in the staging area.
• The Structure and Schema are also identified and adjustments are
made to data that are unordered thus trying to bring about a
commonality among the data that has been acquired.
• Having a place or set up for the data just before transformation and
changes is an added advantage that makes the Staging process very
important.
• It makes data processing easier.

Step #4: ETL

• It is an Extraction, Transformation, and Load.


• ETL Tools are used for integration and processing of data where logic
is applied to rather raw but somewhat ordered data.
• This data is extracted as per the analytical nature that is required and
transformed to data that is deemed fit to be stored in the Data
Warehouse.
• After Transformation, the data or rather an information is
finally loaded into the data warehouse.
• Some examples of ETL tools are Informatica, SSIS, etc.

3. Data Storage Layer

• The processed data is stored in the Data Warehouse.


• This Data is cleansed, transformed, and prepared with a definite
structure and thus provides opportunities for employers to use data
as required by the Business.
• Depending upon the approach of the Architecture, the data will be
stored in Data Warehouse as well as Data Marts. Data Marts will be
discussed in the later stages.
• Some also include an Operational Data Store.

4. Data Presentation Layer

• This Layer where the users get to interact with the data stored in the
data warehouse.
• Queries and several tools will be employed to get different types of
information based on the data.
• The information reaches the user through the graphical
representation of data.
• Reporting Tools are used to get Business Data and Business logic is
also applied to gather several kinds of information.
• Meta Data Information and System operations and performance are
also maintained and viewed in this layer.

You might also like