Dbms Material
Dbms Material
UNIT I: Introduction
Introduction
The information storage and retrieval has become very important in our day-to-day life. The old
era of manual system is no longer used in most of the places. For example, to book your airline
tickets or to deposit your money in the bank the database systems may be used. The database
system makes most of the operations automated. A very good example for this is the billing
system used for the items purchased in a super market. Obviously this is done with the help of a
database application package. Inventory systems used in a drug store or in a manufacturing
industry are some more examples of database. We can add similar kind of examples to this list.
Apart from these traditional database systems, more sophisticated database systems are used in
the Internet where a large amount of information is stored and retrieved with efficient search
engines. For instance, http://www.google.com is a famous web site that enables users tosearch for
their favorite information on the net. In a database we can store starting from text data to very
complex data like audio, video, etc.
Database Management Systems (DBMS)
A database is a collection of related data stored in a standard format, designed to be shared by
multiple users. A database is defined as “A collection of interrelated data items that can be
processed by one or more application programs”.
A database can also be defined as “A collection of persistent data that is used by the
application systems of some given enterprise”. An enterprise can be a single individual (with a
small personal database), or a complete corporation or similar large body (with a large shared
database), or anything in between.
Example: A Bank, a Hospital, a University, a Manufacturing company
Data
Data is the raw material from which useful information is derived. The word data is the plural of
Datum. Data is commonly used in both singular and plural forms. It is defined as raw facts or
observations. It takes variety of forms, including numeric data, text and voice and images. Data is
a collection of facts, which is unorganized but can be made organized into useful information.
The term Data and Information come across in our daily life and are often interchanged.
Example: Weights, prices, costs, number of items sold etc.
Information
Data that have been processed in such a way as to increase the knowledge of the person who uses the
data. The term data and information are closely related. Data are raw material resources that are
processed into finished information products. The information as data that has been processed in such
way that it can increase the knowledge of the person who uses it.
In practice, the database today may contain either data or information.
Data Processing
The process of converting the facts into meaningful information is known as data processing. Data
1
processing is also known as information processing.
Metadata
Data that describe the properties or characteristics of other data.Data is only become useful when
placed in some context. The primary mechanism for providing context for data is Metadata.
Metadata are data that describe the properties, or characteristics of other data. Some of these
properties include data definition, data structures and rules or constraints. The Metadata describes
the properties of data but do not include that data.
Database System Applications
Databases are widely used. Here are some representative applications:
1. Banking: For customer information, accounts, and loans, and banking transactions.
2. Airlines: For reservations and schedule information. Airlines were among the first to use
databases in a geographically distributed manner - terminals situated around the world accessed
the central database system through phone lines and other data networks.
3. Universities: For student information, course registrations, and grades.
4. Credit card transactions: For purchases on credit cards and generation of monthly statements.
5. Telecommunication: For keeping records of calls made, generating monthly bills,
maintaining balances on prepaid calling cards, and storing information about the
communication networks.
6. Finance: For storing information about holdings, sales, and purchases of financial
instruments such as stocks and bonds.
7. Sales: For customer, product, and purchase information.
8. Manufacturing: For management of supply chain and for tracking production of items in
factories, inventories of items in warehouses / stores, and orders for items.
9. Human resources: For information about employees, salaries, payroll taxes and benefits, and
for generation of paychecks.
File Systems Versus A DBMS (Characteristics)
In earlier days, the databases were created directly on top of file systems. File system has
many disadvantages.
1. Not enough primary memory to process large data sets. If data is maintained in other storage
devices like disks, tapes and bringing relevant data to main memory, it increases the cost of
performance. Problem in accessing the large data due to addressing the data using 32 bit or
64 bit mode addressing mechanism.
2. Programs must be written to process the user request to process the data stored in files which are
complex in nature because of large volume of data to be searched.
3. Inconsistent data and complexity in providing concurrent accesses.
4. Not sufficiently flexible to enforce security policies in which different users
havepermission to access different subsets of the data.
2
A DBMS is a piece of software that is designed to make the preceding tasks easier. By storing
data in a DBMS, rather than as a collection of operating system Files, we can use the DBMS's
3
features to manage the data in a robust and efficient manner.
Advantages of DBMS
One of the main advantages of using a database management system is that the organization
can exert via the DBA, centralized management and control over the data. The database administrator
is the focus of the centralized control.
The following are the major advantages of using a Database Management System (DBMS): Data
independence: Application programs should be as independent as possible from detailsof data
representation and storage. The DBMS can provide an abstract view of the data toinsulate
application code from such details.
Efficient data access: A DBMS utilizes a variety of sophisticated techniques to store and retrieve
data efficiently. This feature is especially important if the data is stored on external storage devices.
Data integrity and security: The DBMS can enforce integrity constraints on the data. The DBMS
can enforce access controls that govern what data is visible to different classes of users.
Data administration: When several users share the data, centralizing the administration of data can
offer significant improvements. It can be used for organizing the data representation to minimize
redundancy and for fine-tuning the storage of the data to make retrieval efficient.
Concurrent access and crash recovery: A DBMS schedules concurrent accesses to the data in such a
manner that users can think of the data as being accessed by only one user at a time. Further, the
DBMS protects users from the effects of system failures. .
Reduced application development time: Clearly, the DBMS supports many important functions
that are common to many applications accessing data stored in the DBMS.
Disadvantages of DBMS
The disadvantage of the DBMS system is overhead cost. The processing overhead
introduced by the DBMS to implement security, integrity, and sharing of the data causes a
degradation of the response and throughput times. An additional cost is that of migration from a
traditionally separate application environment to an integrated one.
Even though centralization reduces duplication, the lack of duplication requires that the
database be adequately backup so that in the case of failure the data can be recovered.
Backup and recovery operations are complex in a DBMS environment, and this is an
incrementin a concurrent multi-user database system. A database system requires a certain
amount of controlled redundancies and duplication to enable access to related data items.
Data Models
A data model is a collection of high-level data description constructs that hide many low-
level storage details. A DBMS allows a user to define the data to be stored in terms of a data
model. Mostdatabase management systems today are based on the relational data model.
A schema is a description of a particular collection of data, using the given data model. The
relational model of data is the most widely used model today.
Main concept: relation, basically a table with rows and columns. Every relation has a schema,
which describes the columns, or fields.
Data Model is a collection of high-level data description constructs that hide many low-
4
level storage details. A DBMS allows a user to define the data to be stored in terms of a data
model. Most database management systems today are based on the Relational data model.
Relational
5
models include – IBM’s DB2, Informix, Oracle, Sybase, Microsoft’s Access, Foxbase, Paradox,
Tandem and Teradata.
Categories of data models
Conceptual (high-level, semantic) data models: Provide concepts that are close to the way
many users perceive data (Also called entity-based or object-based data models).
Physical (low-level, internal) data models: Provide concepts that describe details of how data
is stored in the computer.
Implementation (representational) data models: Provide concepts that fall between theabove
two.
1. Hierarchical models:
Advantages:
Hierarchical model is simple to construct and operate on.
Corresponds to a number of natural hierarchical organized domains – e.g., assemblies
in manufacturing, personal organization in companies.
Language is simple; uses constructs like GET, GET UNIQUE, GET NEXT, GET
NEXT WITHINPARENT etc.,
Disadvantages:
Navigational and procedural nature of processing.
Database is visualized as a linear arrangement of records.
Little scope for “query optimization”.
One-to-many relationships.
2. Network model:
Advantages:
7
Every relation has a schema, which describes the columns, or fields.
Student information in a university database may be stored in a relation with the
following schema
Students (sid: string, name: string, login: string, age: integer, gpa: real)
Conceptual schema:
The conceptual schema(also called as logical schema) describes the stored data in terms of the
data model of the DBMS.
In a relational DBMS, the conceptual schema describes all relations that are stored in
the database.
In our sample university database, these relations contain information about entities, such as
students and faculty, and about relationships, such as students’ enrollment in courses.
Students(sid: string, name: string, login: string, age: integer, gpa:real)
Faculty(fid: string, fname: string, salary : real) Courses(cid: string, cname: string, credits: integer)
Rooms(nw: integer, address: string, capacity: integer)Enrolled (sid: string, cid: string, grade: string)
Teaches (fid: string, cid: string)
The choice of relations, and the choice of fields for each relation, is not always obvious,
andthe process of arriving at a good conceptual schema is called conceptual database
design.
8
Physical Schema:
9
It summarizes how the relations described in the conceptual schema are actually stored
on secondary storage devices such as disks and tapes.
Decides what file organizations to use to store the relations and create auxiliary data
structures, called indexes, to speed up data retrieval operations.
A sample physical schema for the university database is to store all relations as unsorted
files of records.
o Create indexes on the first column of the students, faculty and courses relations, the
salary column of faculty, and the capacity of column of rooms.
External Schema:
This schema allows data access to be customized at the level of individual users or
groups ofusers.
A database has exactly one conceptual schema and one physical schema, but it may have
several external schemas.
An external schema is a collection of one or more views and relations from the conceptual
schema.
A view is conceptually a relation, but the records in a view are not stored in the DBMS.
Data Independence
Application programs are insulated from changes in the way the data is structured and stored. Data
independence is achieved through use of the three levels of data abstraction.
Logical data independence: users can be shielded from changes in the logical structure of the data,
or changes in the choice of relations to be stored. This is the independence to change the conceptual
schema without having to change the external schemas and their application programs.
Physical data independence: the conceptual schema insulated users from changes in physical
storage details. This is the independence to change the internal schema without having to changethe
conceptual schema.
Architecture of a DBMS
The functional components of a database system can be broadly divided into query processor
components and storage manager components. The query processor includes:
1. DML Compiler: It translates DML statements in a query language into low-level instructions that
the query evaluation engine understands.
2. Embedded DML Pre-compiler: It converts DML statements embedded in an application program
to normal procedure calls in the host language. The pre-compiler must interact with the DML
compiler to generate the appropriate code.
3. DDL Interpreter: It interprets DDL Stateline its and records them in a set of tables containing
metadata.
4. Transaction Manager: Ensures that the database remains in a consistent (correct) state despite
system failures, and that concurrent transaction executions proceed without conflicting.
5. File Manager: Manages the allocation of space on disk storage and the data structures used to
10
represent information stored on disk.
6. Buffer Manager: Is responsible for fetching data from disk storage into main memory
11
anddeciding what data to cache in memory.
Also some data structures are required as part of the physical system implementation:
1. Data Files: The data files store the database by itself.
2. Data Dictionary: It stores metadata about the structure of the database, as it is used heavily.
3. Indices: It provides fast access to data items that hold particular values.
4. Statistical Data: It stores statistical information about the data in the database. Thisinformation
used by the query processor to select efficient ways to execute a query.
13
The DBA is responsible for many critical tasks:
Design of the conceptual and physical schemas: The DBA is responsible for interacting with
the users of the system to understand what data is to be stored in the DBMS and how it is likely
to be used. Based on this knowledge, the DBA must design the conceptual schema (decide what
relations to store) and the physical schema (decide how to store them).
Security and authorization: The DBA is responsible for ensuring that unauthorized data
access is not permitted. In general, not everyone should be able to access all the data. In a
relational DBMS, users can be granted permission to access only certain views and relations.
Data availability and recovery from failures: The DBA must take steps to ensure that if the
system fails, users can continue to access as much of the uncorrupted data as possible.
Database tuning: The needs of users are likely to evolve with time. The DBA is responsible
for modifying the database, in particular the conceptual and physical schemas, to ensure adequate
performance as user requirements change.
Database Environment
A database management system (DBMS) is a collection of programs that enables users to create
and maintain a database. The DBMS is hence a general-purpose software system that facilitates the
processes of defining, constructing, manipulating, and sharing databases among various users and
applications.
Defining a database involves specifying the data types, structures, and constraints for the data to be
stored in the database.
Constructing the database is the process of storing the data itself on some storage medium that is
controlled by the DBMS.
Manipulating a database includes such functions as querying the database to retrieve specificdata,
updating the database to reflect changes in the mini world, and generating reports from the data.
Sharing a database allows multiple users and programs to access the database concurrently. Other
important functions provided by the DBMS include protecting the database and maintaining
it over a long period of time.
Protection includes both system protection against hardware or software malfunction (or
crashes), and security protection against unauthorized or malicious access. A typical large
database may have a life cycle of many years, so the DBMS must be able to maintain the
database system by allowing the system to evolve as requirements change over time. We can
call the database and DBMS software together a database system.
14
Database Architecture
Database architecture uses programming languages to design a particular type of software for
businesses or organizations. Database architecture focuses on the design, development,
implementation and maintenance of computer programs that store and organize information for
businesses, agencies and institutions.
The architecture of a DBMS can be seen as either single tier or multi-tier. The tiers are
classified as follows:
1- tier architecture,2-tier architecture, 3-tier architecture…. n-tier architecture
1- tier architecture:
One-tier architecture involves putting all of the required components for a software application or
technology on a single server or platform.
2- tier architecture:
The two-tier is based on Client Server architecture. The two-tier architecture is like client server
application. The direct communication takes place between client and server. There is no
intermediate between client and server.
3-tier architecture:
A 3-tier architecture separates its tiers from each other based on the complexity of the users and
how they use the data present in the database. It is the most widely used architecture to design a
DBMS
15
Centralized DBMS Architecture
Architectures for DBMSs have followed trends similar to those for general computer system
architectures. Earlier architectures used mainframe computers to provide the main processing for all
functions of the system, including user application programs and user interface programs, as well as
all the DBMS functionality.
As prices of hardware declined, most users replaced their terminals with personal computers
(PCs) and workstations. At first, database systems used these computers in the same way as
they had used display terminals, so that the DBMS itself was still a centralized DBMS in which
all the DBMS functionality, application program execution, and user interface processing were
carried out on one machine.
Gradually, DBMS systems started to exploit the available processing power at the user side, which
led to client/server DBMS architectures.
Client/Server Architecture:
The client/server architecture was developed to deal with computing environments in which a
large number of PCs, workstations, file servers, printers, database servers, Web servers, and
other equipment are connected via a network. The idea is to define specialized servers with
16
specific functionalities.
The resources provided by specialized servers can be accessed by many client machines. The
17
client machines provide the user with the appropriate interfaces to utilize these servers, as well as
with local processing power to run local applications. This concept can be carried over to software,
with specialized software-such as a DBMS or a CAD (computer-aided design) package being stored
on specific server machines and being made accessible to multiple clients.
18
Entity Relationship Model
Introduction
The entity-relationship (ER) data model allows us to describe the data involved in a real-world
enterprise in terms of objects and their relationships and is widely used to develop an initial database
design.
The ER model is important primarily for its role in database design. It provides useful
concepts that allow us to move from an informal description of what users want from their
database to a more detailed and precise, description that can be implemented in a DBMS.
Even though the ER model describes the physical database model, it is basically useful in the
design and communication of the logical database model.
Overview of Database Design
Our primary focus is the design of the database. The database design process can be divided
into six steps:
Requirements Analysis
The very first step in designing a database application is to understand what data is to be
stored inthe database, what applications must be built on the database, and what operations must
be performed on the database. In other words, we must find out what the users want from the
database. This process involves discussions with user groups, a study of the current operating
environment, how it is expected to change an analysis of any available documentation on
existing applications and so on.
Schema Refinement
The fourth step in database design is to analyze the collection, of relations (tables) in our
relationaldatabase schema to identify future problems, and to refine (clear) it.
19
Physical Database Design
This step may simply involve building indexes on some tables and clustering some tables, or
itmay involve redesign of parts of the database schema obtained from the earlier design steps.
Attribute: An attribute describes a property associated with entities. Attribute will have a name and
a value for each entity.
Domain: A domain defines a set of permitted values for an attribute
Entity Relationship Model: An ERM is a theoretical and conceptual way of showing data
relationships in software development. It is a database modeling technique that generates an abstract
diagram or visual representation of a system's data that can be helpful in designing a relational
database. ER model allows us to describe the data involved in a real-world enterprise in terms of
objects and their relationships and is widely used to develop an initial database design.
Representation of Entities and Attributes
ENTITIES: Entities are represented by using rectangular boxes. These are named with the entity
name that they represent.
ATTRIBUTES: Attributes are the properties of entities. Attributes are represented by means of
ellipses. Every ellipse represents one attribute and is directly connected to its entity.
Types of attributes:
Simple attribute − Simple attributes are atomic values, which cannot be divided further. For
a person can have more than one phone number, email_address, etc.
Binary Relationship: A relationship among 2 entity sets. Example: A professor teaches a course
and a course is taught by a professor.
Ternary Relationship: A relationship among 3 entity sets. Example: A professor teaches a course
in so and so semester.
Cardinality:
Defines the number of entities in one entity set, which can be associated with the number of
entities of other set via relationship set. Cardinality ratios are categorized into 4. They are.
1. One-to-One relationship: When only one instance of entities is associated with the
relationship, then the relationship is one-to-one relationship. Each entity in A is associated
with at most one entity in B and each entity in B is associated with at most one entity in A.
22
2. One-to-many relationship: When more than one instance of an entity is associated with a
relationship, then the relationship is one-to-many relationship. Each entity in A is associated
with zero or more entities in B and each entity in B is associated with at most one entity in A.
3. Many-to-one relationship: When more than one instance of entity is associated with the
relationship, then the relationship is many-to-one relationship. Each entity in A is associated
with at most one entity in B and each entity in B is associated with 0 (or) more entities in A.
4. Many-to-Many relationship: If more than one instance of an entity on the left and more than
one instance of an entity on the right can be associated with the relationship, then it depicts
many-to-many relationship. Each entity in A is associated with 0 (or) more entities in B and
23
5. each entity in B is associated with 0 (or) more entities in A.
Relationship Set:
A set of relationships of similar type is called a relationship set. Like entities, a relationship too can
have attributes. These attributes are called descriptive attributes.
Participation Constraints:
Total Participation − If Each entity in the entity set is involved in the relationship then the
participation of the entity set is said to be total. Total participation is represented by double lines.
Partial participation − If, Not all entities of the entity set are involved in the relationship
then such a participation is said to be partial. Partial participation is represented by single
lines.
Example:
Consider a relationship set called Manages between the Employees and Departments entity
sets such that each department has at most one manager, although a single employee is allowed to
manage more than one department. The restriction that each department has at most one manager is
an example of a key constraint, and it implies that each Departments entity appears in at mostone
Manages relationship in any allowable instance of Manages. This restriction is indicated in the ER
diagram of below Figure by using an arrow from Departments to Manages. Intuitively, the arrow
states that given a Departments entity, we can uniquely determine the Manages relationship in which
it appears.
24
Key Constraints for Ternary Relationships
If an entity set E has a key constraint in a relationship set R, each entity in an instance of E
appearsin at most one relationship in (a corresponding instance of) R. To indicate a key constraint
on entityset E in relationship set R, we draw an arrow from E to R.
Below figure show a ternary relationship with key constraints. Each employee works in at
most onedepartment, and at a single location.
Weak Entities
Strong Entity set: If each entity in the entity set is distinguishable or it has a key then such an entity
set is known as strong entity set.
Weak Entity set: If each entity in the entity set is not distinguishable or it doesn't has a key
then such an entity set is known as weak entity set.
eno is key so it is represented by solid underline. dname is partial key. It can't distinguish the tuples
in the Dependent entity set. so dname is represented by dashed underline.
Weak entity set is always in total participation with the relation. If entity set is weak then the
relationship is also known as weak relationship, since the dependent relation is no longer needed when
the owner left.
Ex: policy dependent details are not needed when the owner (employee) of that policy left or
fired from the company or expired. The detailed ER Diagram is as follows.
25
The cardinality of the owner entity set is with weak relationship is 1 : m. Weak entity set is uniquely
identifiable by partial key and key of the owner entity set.
Dependent entity set is key to the relation because the all the tuples of weak entity set are associated
with the owner entity set tuples.
Dependents is an example of a weak entity set. A weak entity can be identified uniquely only by
considering some of its attributes in conjunction with the primary key of another entity, which is
called the identifying owner.
The following restrictions must hold:
The owner entity set and the weak entity set must participate in a one-to-many relationshipset
(one owner entity is associated with one or more weak entities, but each weak entity hasa
single owner). This relationship set is called the identifying relationship set of the weak entity
set.
The weak entity set must have total participation in the identifying relationship set
E-R Diagrams Implementation
Now we are in a position to write the ER diagram for the Company database which was
introducedin the beginning of this unit. The readers are strictly advised to follow the steps shown
in this unit to design an ER diagram for any chosen problem.
26
The underlined attributes are the primary keys and DepName is the partial key of
Dependents.Also, DLocation may be treated as a multivalued attribute.
This step is relatively a simple one. Simply apply the business rules and your common sense. So,
we write the structural constraints for our example as follows:
27
Class Hierarchies
To classify the entities in an entity set into subclass entity is known as class hierarchies. Example,
we might want to classify Employees entity set into subclass entities Hourly-Emps entity set and
Contract-Emps entity set to distinguish the basis on which they are paid. Then the class hierarchy
is illustrated as follows.
This class hierarchy illustrates the inheritance concept. Where, the subclass attributes ISA (read as
: is a) super class attributes; indicating the “is a” relationship (inheritance concept).Therefore, the
attributes defined for a Hourly-Emps entity set are the attributes of Hourly-Emps plus attributes
of Employees (because subclass can have superclass properties). Likewise the attributes definedfor
a Contract-Emps entity set are the attributes of Contract-Emps plus attributes of Employees.
Example: Can Akbar be both an Hourly-Emps entity and a Contract-Emps entity?The answer
28
is, No.
29
Other example, can Akbar be both a Contract-Emps entity and a Senior-Emps entity (among
them)?
The answer is, Yes. Thus, this is a specialisation hierarchy property. We denote this
bywriting “Contract-Emps OVERLAPS Senior-Emps”.
are really two distinct relationships, Sponsors and Monitors, each with its own attributes.
Conceptual Database Design With The ER Model (ER Design Issues)
The following are the ER design issues:
1. Use entry sets attributes
2. Use of Entity sets or relationship sets
3. Binary versus entry relationship sets
4. Aggregation versus ternary relationship.
30
1. Use of Entity Sets versus Attributes
Consider the relationship set (called Works In2) shown in Figure
31
Intuitively, it records the interval during which an employee works for a department. Now
suppose that it is possible for an employee to work in a given department over more than one
period.
This possibility is ruled out by the ER diagram’s semantics. The problem is that we want
to record several values for the descriptive attributes for each instance of the Works_In2
relationship. (This situation is analogous to wanting to record several addresses for each
employee.) We can address this problem by introducing an entity set called, say, Duration, with
attributes from and to, as shown in Figure
There is at most one employee managing a department, but a given employee could manage several
departments; we store the starting date and discretionary budget for each manager- department pair.
This approach is natural if we assume that a manager receives a separate discretionary budget for
each department that he or she manages.
32
But what if the discretionary budget is a sum that covers all departments managed by that
employee? In this case each Manages2 relationship that involves a given employee will have
the same value in the dbudget field. In general such redundancy could be significant and could
cause a variety of problems. Another problem with this design is that it is misleading.
We can address these problems by associating dbudget with the appointment of the
employee asmanager of a group of departments. In this approach, we model the appointment as
an entity set,say Mgr_Appts, and use a ternary relationship, say Man ages3, to relate a manager,
an appointment, and a department. The details of an appointment (such as the discretionary
budget) are not repeated for each department that is included in the appointment now, although
there is still one Manages3 relationship instance per such department. Further, note that each
departmenthas at most one manager, as before, because of the key constraint. This approach is
illustrated inbelow Figure.
Dependents is a weak entity set, and each dependent entity is uniquely identified by taking pname
in conjunction with the policyid of a policy entity (which, intuitively,
33
The first requirement suggests that we impose a key constraint on Policies with respect to
Covers, but this constraint has the unintended side effect that a policy can cover only one dependent.
The second requirement suggests that we impose a total participation constraint on Policies. This
solution is acceptable if each policy covers at least one dependent. The third requirement forces us
to introduce an identifying relationship that is binary (in our version of ER diagrams, although there
are versions in which this is not the case).
Even ignoring the third point above, the best way to model this situation is to use two binary
relationships, as shown in below figure.
Consider the constraint that each sponsorship (of a project by a department) be monitored by at most
one employee. We cannot express this constraint in terms of the Sponsors2 relationship set. Also we
can express the constraint by drawing an arrow from the aggregated relationship. Sponsors to the
relationship Monitors. Thus, the presence of such a constraint serves as another reason for using
aggregation rather than a ternary relationship set.
34
35
UNIT-II
RELATIONAL MODEL
Introduction
Relational Model was proposed by E.F Codd to model data in the form of relations or tables.
After designing the conceptual model of database using ER diagram, we need to convert the
conceptual model in the relational model which can be implemented using any RDBMS
(Relational Data Base Management System) like SQL, MY SQL etc.
The relational model is very simple and elegant; a database is a collection of one or more relations,
where each relation is a table with rows and columns.This simple tabular representation enables even
new users to understand the contents of a database, and it permits the use of simple, high-level
languages to query the data.
Relational Model
Relational Model represents how date is stored in relational databases. A Relational database
stores data in the form of relations (tables).
Consider a relation STUDENT with attributes ROLL_NO, NAME, ADDRESS, PHONE and
AGE as shown in table.
ROLL NAM ADDRES PHONE AGE
_NO E S
1 Nishm Hyderabad 9455123451 28
a
2 Sai Guntur 9652431843 27
3 Sweth Nellore 9156253131 26
a
4 Raji Ongole 9215635311 25
Attribute: Attributes are the properties that define a relation. Ex: ROLL_NO, NAME
ROLL_
NO
1
36
Null values: The value which is not known or unavailable is called NULL VALUE. It is
representedby blank space.
Cardinality: The number of tuples are present in the relation is called as its cardinality.
Ex: The Cardinality of the STUDENT table is 4.
Concept Of Domain
The domain of a database is the set of all allowable values (or) attributes of the database.
Ex: Gender (Male, Female, Others).
Relation
A relation is defined as a set of tuples and attributes.
A relation consists of Relation schema and relation instance.
Relation schema: A relation schema represents the name of the relation with its attributes.
Ex: STUDENT (ROLL_NO, NAME, ADDRESS, PHONE and AGE) is Relation schema
forSTUDENT.
Relation instance: The set of tuples of a relation at a particular instance of a time is called
Relation Instance.
An instance of „Employee „relation
Constraints
37
On modeling the design of the relational data base, we can put some rules(conditions) like
what values are allowed to be inserted in the relation
Constraints are the rules enforced on the data columns of a table. These are used to limit the
type of data that can go in to a table
This Ensure the accuracy and reliability of the data in the database. Constraints could be
either on a column level on a table level.
Domain Constraints In DBMS
In DBMS table is viewed as a combination of rows and columns
For example, if you are having a column called month and you want only (jan, feb,
march……) as values allowed to be entered for that particular column which is referred to
as domain for that particular column
Definition: Domain constraint ensures two things it makes sure that the data value entered for that
particular column matches with the data type defined by that column
It shows that the constraints (NOT NULL/UNIQUE/PRIMARY KEY/FOREIGN
KEY/CHECK/DEFAULT)
1. Not Null:
Null represents a record where data may be missing data or data for that record may be
optional.
Once not null is applied to a particular column, you cannot enter null values to that column.
A not null constraint cannot be applied at table level.
Example:
Create table EMPLOYEE (id int Not null, name varchar Not null, Age intnot null, address
char (25), salary decimal (18,2), primary key(id));
In the above example we have applied not null on three columns id, name and age which
means whenever a record is entered using insert statement all three columns should contain
avalue other than null.
We have two other columns address and salary, where not null is not applied which means
that you can leave the row as empty.
2. Unique:
Some times we need to maintain only. Unique data in the column of a database table, this is
possible by using a Unique constraint.
Example:
Create table PERSONS (id int unique, last_name varchar (25) not null,First name varchar (25),
age int);
39
In the above example, as we have used unique constraint on ID column we are not supposedto
enter the data that is already present, simply no two ID values are same.
3. Default:
Default in SQL is used to add default data to the columns.
When a column is specified as default with same value then all the rows will use the
same value i.e., each and every time while entering the data we need not enter that
value.
But default column value can be customised i.e., it can be over ridden when inserting a data
for that row based on the requirement.
Create table EMPLOYEE (id int Not null, last_name varchar (25) Not null,first_name varchar
(25), Age int, city varchar (25) Default Hyderabad);
As a result, whenever you insert a new row each time you need not enter a value for this
default column that is entering a column value for a default column is optional.
4. Check:
Check constraint ensures that the data entered by the user for that column is within the rangeof
values or possible values specified.
40
Example: Create table STUDENT (id int, name varchar (25), age int,check(age>=18));
41
As we have used a check constraint as (age>=18) which means value entered by user for this
age column while inserting the data must be less than or equal to 18.
5. Primary Key:
A primary key is a constraint in a table which uniquely identifies each row record in a
database table by enabling one or more column in the table as primary key.
6. Foreign Key:
The foreign key constraint is a column or list of columns which points to the primary
key column of another table.
The main purpose of the foreign key is only those values are allowed in the present table
that will match to the primary key column of another table.
42
From the above two tables, COURSE_ID is a primary key of the table STUDENT_MARKS
and alsobehaves as a foreign key as it is same in STUDENT_DETAILS and
STUDENT_MARKS.
Example:
(Reference Table)
Create table CUSTOMER1 (id int, name varchar (25), course varchar (10),primary key (ID));
(Child table)
These constraints are used to ensure the uniqueness of each record or row in the data
table.
Entity Integrity constraints says that no primary key can take NULL VALUE,
since using primary key we identify each tuple uniquely in a relation.
Example:
43
Explanation:
In the above relation, EID is made primary key, and the primary key can‟ttake
NULL values but in the 3rd tuple, the primary key is NULL, so it is violating
Entity integrity constraints.
The referential integrity constraint is specified between two relations or tables and usedto
maintain the consistency among the tuples in two relations.
This constraint is enforced through foreign key, when an attribute in the foreign key of
relation R1 have the same domain as primary key of relation R2, then the foreign key
ofR1 is said to reference or refer to the primary key of relation R2.
The values of the foreign key in a tuple of relation R1 can either take the values of the
primary key for some tuple in Relation R2, or can take NULL values, but can‟t be empty.
Explanation:
In the above, DNO of the first relation is the foreign key and DNO in the second relation is
the primary key
44
DNO=22 in the foreign key of the first relation is not available in the second relation so, since
DNO=22 is not defined in the primary key of the second relation therefore Referential integrity
constraints is violated here.
Basic SQL (introduction)
SQL stands for Structure Query Language it is used for storing and managing data in
relational database management system.
It is standard language for relational database system. It enables a user to create, read,
updateand delete relational databases and tables.
All the RDBMS like MYSQL, Oracle, MA access and SQL Server use SQL as their
standarddatabase language.
SQL allows users to Query the database in a number of ways using statements like common
English.
Rules: SQL follows following rules
Characteristics of SQL:
SQL is easy to learn.
SQL is used to access data from relational database management system.
SQL is used to describe the data.
SQL is used to create and drop the database and table.
SQL allows users to set permissions on tables, procedures and views.
45
views, relationship, primary key, foreign key.
In actual, the data is physically stored in files that may be in unstructured form, but to retrieve
it and use it, we need to keep them in a structured manner. To do this a database schema is used.
It provides knowledge about how the data is organized in a database and how it is associated
with other data.
A database schema object includes the following:
Consistent formatting for all data entries.
Database objects and unique keys for all data entries.
Tables with multiple columns, and each column contains its names and datatypes.
The given diagram is an example of a database schema it contains three tables, their
data types. This also represents the relationships between the tables and primary keysas
well as foreign keys.
46
SQL Commands:
SQL commands are categorized into three types.
2. Data Manipulation Language (DML): used to update, store and retrieve data from tables.
3. Data Control Language (DCL): used to control the access of database created using DDL
andDML.
SQL DATATYPES :
SQL data type is used to define the values that a column can contain
Every column is required to have a name and data type in the database table. DATA TYPES OF
SQL :
SQL DATA
TYPES
Binary data Numeric data Extract String datatype Date data type
type type numeric data
type
1. BINARY DATATYPES:
There are three types of binary data types which are given below
47
DATA DESCRIPTION
TYPE
Binary It has a maximum length of 800 bytes. It contains
a fixed- length binary data
Var binary It has a maximum length of 800 bytes. It
contains avariable - length binary
data
Image It has a maximum length of 2,147,483,647 bytes. It
contains
a variable - length binary data
2. NUMERIC DATATYPE:
DATA DESCCRIPTION
TYPE
Int It is used to specify an integer value
Small int It is used to specify small integer value
Bit It has the number of bits to store
Decimal It specifies a numeric value that can have a decimal
number
Numeric It is used to specify a numeric value
DATA DESCRIPTION
TYPE
Date It is used to store the year, month, and days value
Time It is used to store the hour, minute, and seconds value
Time stamp It stores the year, month, hour, minute, and the
second value
5. STRING DATATYPE:
DATA DESCRIPTION
TYPE
48
Char It has a maximum length of 8000 characters. It contains fixed-
length non-
Unicode characters.
Varchar It has a maximum length of 8000 characters. It contains variable-
length
non-Unicode characters.
Text It has a maximum length of 2,147,483,647 characters. It contains
variable-
length non-Unicode characters.
SQL TABLE: SQL table is a collection of data which is organized in terms of rows and
columns.
In DBMS, the table is known as relation and row as a
tuple Let‟s see an example of the “EMPLOYEE “table
SYNTAX:
….
“column N” “datatype”);
49
EXAMPLE:
SQL > create table employee (emp_id int, emp_name varchar (25), phone_no int,address char
(30));
If you create the table successfully, you can verify the table by looking at the message by
the sql server. else you can use DESC command as follows
SQL > DESC employee;
SYNTAX:
EXAMPLE:
3. DROP TABLE:
The drop table command deletes a table in the data base
The following example SQL deletes the table “EMPLOYEE”
50
SYNTAX :
EXAMPLE:
Insert:
SQL insert statement is a sql query. It is used to insert a single multiple records in a table.
Syntax:
NAME ID CITY
Alekhya 501 Hyderabad
Deepti 502 Guntur
Ramya 503 Nellore
1. Update:
The SQL Commands update are used to modify the data that is already in the database.
SQL Update statement is used to change the data of records held by tables which rows is
to be update, it is decided by condition to specify condition, we use “WHERE” clause.
The update statement can be written in following form:
Syntax:
Update table_name set column_name=expression where condition;
51
Example:
Let‟s take an example: here we are going to update an entry in the table.
NAME ID CITY
Alekhya 501 Hyderabad
Deepti 502 Guntur
Rasi 503 Nellore
2. Delete:
The SQL delete statement is used to delete rows from a table.
Generally, delete statement removes one or more records from a table.
Syntax:
Example:
NAME ID CITY
Deepti 502 Guntur
Rasi 503 Nellore
52
UNIT-III
SQL
SQL Clause
1. Group by:
SQL group by statement is used to arrange identical data into groups.
The group by statement is used with the SQL select statement.
The group by statement follows the WHERE clause in a SELECT statement and precedes the
ORDER BY clause.
Syntax:
Select column from table_name where column group by column, order bycolumn;
Example:
Select company count (*) from product group by company;
Output:
Com 2
1
Com 3
2
Com 5
3
53
2. Having clause:
Having clause is used to specify a search condition for a group or an aggregate.
Having clause is used in a group by clause, if you are not using group by clause then you
canuse having function like a where clause.
Syntax:
Select column1, column2 from table_name
Where conditions
Example:
select company count (*) from product
Output:
Com 3 5
Com 2 2
3. Order by clause:
The order by clause sorts the result _set in ascending or descending order.
Syntax:
Where condition
Sample table:
54
Example:
Output:
NAME ID CITY
Alekhya 501 Hyderabad
Deepti 502 Guntur
Rasi 503 Nellore
Syntax:
Select column1, column2,.................column from table_name where[condition];
= Equal to
> Greater than
< Less than
>= Greater than or equal
to
<= Less than or equal to
<> Not equal to
SQL operators:
SQL statements generally contain some reserved words or characters that are used to perform
operations such as arithmetic and logical operations etc. Their reserved words are known as
operators.
SQL arithmetic operator:
We can use various arithmetic operators on the data stored in tables.
Arithmetic operators are:
+ Addition
- Subtraction
/ Division
* Multiplication
% modulus
55
1. Addition (+):
It is used to perform addition operation on data items.
Sample table:
EMP_ID EMP_NAME SALARY
1 Alex 25000
2 John 55000
3 Daniel 52000
4 Sam 12312
2. Subtraction (-):
It is used to perform subtraction on the data items.
Example:
Select emp_id, emp_name, salary, salary-100 as “salary-100” from
subtraction;
EMP_I EMP_NA SALARY SALARY-100
D ME
1 Alex 25000 24900
2 John 55000 54900
3 Daniel 52000 51900
4 Sam 90000 89900
Here we have done subtraction of 100 for each emp‟s salary.
3. Division (/):
The division function is used to integer division (x is divided by y).an integer value
is returned.
Example:
Select emp_id, emp_name, salary, salary/100 as “salary/100” fromdivision;
5. Modulus (%):
It is used to get remainder when one data is divided by another.
Select emp_id, emp_name, salary, salary%25000 as “salary%25000”
frommodulus; Output:
EMP_ID EMP_NAME SALAR SALARY%2
Y 5000
1 Alex 25000 0
2 John 55000 5000
3 Daniel 52000 2000
4 Sam 90000 15000
Here we have done modulus operation to each emp‟s salary.
Logical operations:
Logical operations allow you to test for the truth of a condition.
The following table illustrates the SQL logical operator.
OPERATOR MEANING
ALL Returns true if all comparisons are true
AND Returns true if both expressions are true
ANY Returns true if any one of the
comparisons is
true
BETWEEN Return true if the operand is within a
range
IN Return true if the operand is equal to
one of
the values in a list
EXISTS Return true if the sub query contains any
rows
1. AND:
The AND operator allows you to construct multiple condition in the WHERE clause of an
SQLstatement such as select.
The following example finds all employees where salaries are greater than the 5000 and lessthan
7000.
Select first_name, last_name, salary from employees wheresalary>5000 AND
salary<7000 order by salary;
57
Output:
FIRST_NAME LAST_NAME SALARY
John Wesley 6000
Eden Daniel 6000
Luis Popp 6900
Shanta Suji 6500
1. ALL:
The ALL operator compares a value to all values in another value set.
The following example finds all employees whose salaries are greater than all salaries
of employees.
EX:
select first_name, last_name, salary from employees where salary>=ALL (select salary from
employees where department_id =8) order by salaryDESC;
Output:
FIRST_NA LAST_NAME SALARY
ME
Steven King 24000
John Russel 17000
Neena Kochhar 14000
2. ANY:
The ANY operator compares a value to any value in a set ascending to condition.
The following example statement finds all employees whose salaries are greater than the average
salary of every department.
EX:
select first_name, last_name, salary from employees where salary >ANY(select avg (salary) from
employees‟ group by department_id) order byfirst_name, last_name;
Output:
FIRST_NA LAST_NAME SALARY
ME
Alexander Hunold 9000.00
Charles Johnson 6200.00
David Austin 4800.00
Eden Flip 9000.00
1. Between:
The between operator searches for values that are within a set of values.
For example, the following statement finds all employees where salaries are between 9000and
12000.
EX: select first_name, last_name, salary from employees where salarybetween 9000 AND
12000order by salary;
Output:
FIRST_NAME LAST_NAME SALARY
Alexander Hunold 9000.00
58
Den Richards 10000.00
Nancy Prince 12000.00
59
2. IN:
The IN operator compares a value to list of specified values. The IN operator return true
if compared value matches at least one value in the list.
The following statement finds all employees who work in department _id 8 or 9. EX:
Output:
FIRST_NAME LAST_NAME DEPARTMEN
T_ID
John Russel 8
Jack Livingstone 8
Steven King 9
Neena Kochhar 9
3. Exists:
The EXISTS operator tests if a sub query contains any rows.
For example, the following statement finds all employees who have dependents.
select first_name, last_name from employees where EXISTS (select 1from dependent d
where d.employee_id=e.employee_id);
FIRST_NAME LAST_NAM
E
Steven King
Neena Kochhar
Alexander Hunold
60
61
Some important date and time functions are below:
Output: 05-DEC-2021.
ADD_MONTHS: This function returns a date after adding data with specified no of months. EX:
Output: 31-MAR-17.
Output: 05-MAR-22.
Output: 05-DEC-2021.
NEXT_DAY: This function represents both day and date and returns the day of the next given
day.
Output: 07-DEC-21.
63
MONTHS_BETWEEN: It is used to find no of months between two given dates.
Output: -4.
ROUND: It gives the nearest value or round off value for the argument pass. (or) It returns a date
rounded to a specific unit of measure.
Output: 01-JAN-22.
TRUNC: This function returns the date with the time(co-efficient) portion of the date truncated to
the unit specified.
Output: 01-DEC-21.
TO_DATE: This function converts date which is in the character string to a date value.
Output: 01-JAN-17.
Output: 05 12 2021.
LEAST: This function displays the oldest date present in the argument list.
dual;
Output: 01-MAR-21.
GREATEST: This function displays the latest date present in the argument list.
Output: 28-DEC-21.
Aggregate Functions:
64
Aggregate Functions take a collection of values as input and returns a single value.
1. 1.Count () 2. Sum () 3. Avg () 4. Max () 5. Min ()
65
1. Count (): This function returns number of rows returned by a query.
2. Sum (): It will add/ sum all the column values in the query.
3. Avg (): Avg function used to calculate average values of the set of rows.
4. Max (): This function is used to find maximum value from the set of values.
5. Min (): This function is used to find minimum value from the set of values.
From table_name
Where condition);
66
SQL NUMERIC FUNCTIONS:
Numeric functions are used to perform operations on numbers and return numbers. Following are
1. ABS (): It returns the absolute value of a number. EX: select ABS (-243.5) from dual;
OUTPUT: 243.5
2. ACOS (): It returns the cosine of a number. EX: select ACOS (0.25) from dual;
OUTPUT: 1.318116071652818
3. ASIN (): It returns the arc sine of a number. EX: select ASIN (0.25) from dual;
OUTPUT: 0.253680255142
4. CEIL (): It returns the smallest integer value that is a greater than or equal to a number. EX:
select CEIL (25.77) from dual;
OUTPUT: 26
67
5. FLOOR (): It returns the largest integer value that is a less than or equal to a number. EX: select
FLOOR (25.75) from dual;
OUTPUT: 25
6. TRUNCATE (): This does not work for SQL server. It returns the truncated to 2 places right of
the decimal point.
EX: select TRUNCATE (7.53635, 2) from dual;
OUTPUT: 7.53
7. MOD (): It returns the remainder when two numbers are divided. EX: select MOD (55,2)
from dual;
OUTPUT: 1.
8. ROUND (): This function rounds the given value to given number of digits of precision. EX:
select ROUND (14.5262,2) from dual;
OUTPUT: 14.53.
9. POWER (): This function gives the value of m raised to the power of n. EX: select
POWER (4,9) from dual;
OUTPUT: 262144.
10. SQRT (): This function gives the square root of the given value n.EX: Select SQRT
(576) from dual;
OUTPUT: 24.
11. LEAST (): This function returns least integer from given set of integers.EX: select LEAST
OUTPUT: 1.
12. GREATEST (): This function returns greatest integer from given set of integers. EX: select
OUTPUT: 22
68
STRING CONVERSION FUNCTIONS OF SQL:
String Functions are used to perform an operation on input string and return the output string.
Following are the string functions
1. CONCAT (): This function is used to add two words (or) strings.
2. INSTR (): This function is used to find the occurrence of an alphabet. EX: instr
3. LOWER (): This function is used to convert the given string into lowercase. EX: select
OUTPUT: database
1
4. UPPER (): This function is used to convert the lowercase string into uppercase. EX: select
OUTPUT: DATABASE
5. LPAD (): This function is used to make the given string of the given size by adding
OUTPUT: 00system
6. RPAD (): This function is used to make the given string as long as the given size by
adding thegivensymbol on the right.
OUTPUT: system00
7. LTRIM (): This function is used to cut the given substring from the original string. EX:
OUTPUT: base
8. RTRIM (): This function is used to cut the given substring from the original string. EX:
OUTPUT: data.
9. INITCAP (): This function returns the string with first letter of each word starts with uppercase.
10. LENGTH (): Tis function returns the length of the given
OUTPUT: 11.
2
11. SUBSTR (): This function returns a portion of a string beginning at the character position.
EX:
select SUBSTR („MY WORLD IS AMAZING‟,12,3) from dual;
OUTPUT: AM.
TRANSLATE (): This function returns a string after replacing some set of characters into another set.
EX: select TRANSLATE („Delhi is the capital of India‟,‟i‟,‟a‟) from dual;
OUTPUT: Delha as the capatal andaa
When we want to create tables with relationship , we need to use Referential integrity constraints. The
referential integrity constraint enforces relationship between tables.
3
EX: SQL> CREATE TABLE marks(sid VARCHAR2(4),marks NUMBER(3), PRIMARY
Data constraints: All business of the world run on business data being gathered, stored and
analyzed. Business managers determine a set of business rules that must be applied to their data
prior to it being stored in the database/table of ensure its integrity.
For instance , no employee in the sales department can have a salary of less than Rs.1000/- . Such
rules have to be enforced on data stored. If not, inconsistent data is maintained in database.
Integrity constraints are the rules in real life, which are to be imposed on the data. If the data is not
satisfying the constraints then it is considered as inconsistent. These rules are to be enforced on
data because of the presence of these rules in real life. These rules are called integrity constraints.
Every DBMS software must enforce integrity constraints, otherwise inconsistent datais generated.
4
Example for Integrity Constraints :-
1. Domain integrity constraints - A domain means a set of values assigned to a column. i.e
Aset of permitted values. Domain constraints are handled by
Defining proper data type
Specifying not null
constraint Specifying check
constraint. Specifying
default constraint
Column level :-
6
Composite key cannot be defined at column level.
Table level :-
constraint is declared after declaring all columns.
use table level to declare constraint for combination of columns.(i.e composite key)
not null cannot be defined. Another type is possible at Alter level
To add these constraints , we can use constraint with label or with out label. TWO BASIC TYPES -
Syntax:
ALTER TABLE <table_name> ADD CONSTRAINT cont_label NAME _OF_
THE_CONSTRAINT (column);
7
vi) Declaring Constraint at “Alter” level (Constraints without label)
Syntax:
ALTER TABLE <table_name> ADD NAME _OF_ THE_CONSTRAINT (column); Note:’
Constraint ‘ clause is not required when constraints declared without a label.
NOT NULL:
It ensures that a table column cannot be left empty.
Column declared with NOT NULL is a mandatory column i.e data must be entered.
The NOT NULL constraint can only be applied at column level.
It allows DUPLICATE data.
Used to avoid null values into columns.
501 GITA
502 RAJU503
503
504
Here, SID column not allowed any null values and it can allow duplicate values , but sname can
allows it.
Ex 2:
CHECK :
Used to impose a conditional rule a table column.
It defines a condition that each row must satisfy.
Check constraint validates data based on a condition .
8
Value entered in the column should not violate the condition.
Check constraint allows null values.
Check constraint can be declared at table level or column level.
There is no limit to the number of CHECK constraints that can be defined on
a condition.
Limitations :-
Here, sid should start with ‘C ‘and a length of sid is exactly 4 characters. And sname shouldends
with letter ‘ A’
9
SQL> SELECT *FROM check_table;
SID SNAME
// with label
@ ALTER LEVEL
Here, we add check constraint to new table with columns.
SQL> ALTER TABLE check_alter ADD CONSTRAINT ck CHECK ( sid LIKE 'C%');
CONSTRAINT
1
0
DEFAULT
-If values are not provided for table column , default will be considered.
-This prevents NULL values from entering the columns , if a row is inserted without a value fora
column.
-The default value can be a literal, an expression, or a SQL function.
-The default expression must match the data type of the column.
- The DEFAULT constraint is used to provide a default value for a column.
-The default value will be added to all new records IF no other value is specified. Syntax:
This defines what value the column should use when no value has been supplied explicitly when
inserting a record in the table.
UNIQUE
Columns declared with UNIQUE constraint does not accept duplicate values.
One table can have a number of unique keys.
Unique key can be defined on more than one column i.e composite unique key
A composite key UNIQUE key is always defined at the table level only.
By default UNIQUE columns accept null values unless declared with NOT
NULL constraint
Oracle automatically creates UNIQUE index on the column declared with
UNIQUE constraint
1
UNIQUE constraint can be declared at column level and table level.
CREATE TABLE
table_unique( sid NUMBER(4) UNIQUE,sname VARCHAR2(10));
//UNIQUE @ TABLE
LEVEL
SYNTAX: UNIQUE(COLUMN_LIST);
CREATE TABLE table_unique2(
sid NUMBER(4),
sname VARCHAR2(10) ,
UNIQUE(sid,sname));
401 RAMU
402 SITA // Here , these two records are distinct not the same.402 GITHA
403 GITHA
404 RAMU
Now , we removed unique constraint , so now this table consists duplicate data.
//UNIQUE@ ALTER LEVEL (here, the table contains duplicates, so it is not works)
//delete data from table_unique2
SQL> DELETE FROM table_unique2;
There should be at the most one Primary Key or Composite primary key per table
1
PK column do not accept null values.
PK column do not accept duplicate values.
RAW,LONG RAW,VARRAY,NESTED TABLE,BFILE columns cannot be declared
with PK
If PK is composite then uniqueness is determined by the combination of
columns. A composite primary key cannot have more than 32 columns
It is recommended that PK column should be short and numeric.
Oracle automatically creates Unique Index on PK column
EX:
Table altered.
13
CASE 3 : ADD PRIMARY KEY @ TABLE LEVEL
here, we can create a simple and composite primary keys;
14
FOREIGN KEY Constraint:-
15
SYNTAX: CREATE TABLE
<tablename>(
col_name1 datatype[size] ,col_name2
datatype[size] ,
:
col_name n datatype[size],
SQL> ALTER TABLE marks3 ADD CHECK ( marks>0 AND marks< =100 );
Note :-
PRIMARY KEY cannot be dropped if it referenced by any FOREIGN KEY constraint.
If PRIMARY KEY is dropped with CASCADE option then along with PRIMARY
KEY referencingFOREING KEY is also dropped.
PRIMARY KEY column cannot be dropped if it is referenced by some FOREIGN
KEY. PRIMARY KEY table cannot be dropped if it is referenced by some FOREIGN
16
KEY.
17
PRIMARY KEY table cannot be truncated if it is referenced by some FOREIGN KEY.
Note:: Once the primary key and foreign key relationship has been created then you cannot
remove any parent record if the dependent childs exists.
By using this clause you can remove the parent record even if childs exists. Because when ever
you remove parent record oracle automatically removes all its dependent records from child
table, if this clause is present while creating foreign key constraint.
Ex: Consider twe tables dept(parent) and emp(child) tables.TABLE LEVEL
Disable constraint
Performing the DML operation DML
operation Enable constraint
Disabling Constraint:-Syntax :-
ALTER TABLE <tabname> DISABLE CONSTRAINT <constraint_name> ;Example
:- SQL>ALTER TABLE student1 DISABLE CONSTRAINT ck ; SQL>ALTER TABLE
mark1 DISABLE PRIMARY KEY CASCADE;
18
NOTE:-
If constraint is disabled with CASCADE then PK is disabled with FK.
The number of columns and data types of the columns being selected must be identical in all the
SELECT statements used in the query. The names of the columns need not be identical.
All SET operators have equal precedence. If a SQL statement contains multiple SET operators, the
oracle server evaluates them from left (top) to right (bottom) if no parentheses explicitly specify
another order.
Introduction
SQL set operators allows combine results from two or more SELECT statements. At first sight thislooks
similar to SQL joins although there is a big difference. SQL joins tends to combine columns i.e. with
each additionally joined table it is possible to select more and more columns. SQL setoperators on
the other hand combine rows from different queries with strong preconditions .
Retrieve the same number of columns and
The data types of corresponding columns in each involved SELECT must be
compatible (either the same or with possibility implicitly convert to the data
types of the first SELECTstatement).
UNION ---returns all rows selected by either query. To return all rows from
multipletables and eliminates any duplicate rows.
UNION ALL-- returns all rows from multiple tables including duplicates.
INTERSECT – returns all rows common to multiple queries.
MINUS—returns rows from the first query that are not present in second query.
19
Note: Whenever these operators used select statement must have
Syntax :-
SELECT statement 1
UNION / UNION ALL / INTERSECT / MINUS
SELECT statement 2 ;
Rules :-
1. UNION
Example :-
deptno=10 UNION
SELECT job,sal FROM emp WHERE deptno=20 ORDER BY sal ;NOTE:-
ORDER BY clause must be used with last query.
20
2. UNION ALL
This will combine the records of multiple tables having the same structurebut including duplicates.
IT is similar to UNION but it includes duplicates.
Example :-
3. INTERSECT
This will give the common records of multiple tables having the samestructure.
INTERSECT operator returns common values from the result of two SELECT statements.
Example:-
Display common jobs belongs to 10th and 20th departments ?
INTERSECT
SELECT job FROM emp WHERE deptno=20;
21
4. MINUS
This will give the records of a table whose records are not in other tableshaving the same structure.
MINUS operator returns values present in the result of first SELECT statement and not presentin
the result of second SELECT statement.
Example:-
Display jobs in 10th dept and not in 20th dept ?
UNION vs JOIN :-
UNION JOIN
Union combines data
Join relates data
Union is performed on similar
Join can be performed also be performedon
structures
dissimilar structures also
SQL JOINS
A SQL JOIN is an Operation , used to retrieve data from multiple tables. It is performed whenever
two or more tables are joined in a SQL statement. so, SQL Join clause is used to combine records
from two or more tables in a database. A JOIN is a means for combining fields from two tables by
using values common to each. Several operators can be used to join tables,
such as =, <>, <=, >=, !=, BETWEEN, LIKE, and NOT; these all to be used to join tables.
However, the most common operator is the equal symbol.
22
SQL Join Types:
There are different types of joins available in SQL:
INNER JOIN: Returns rows when there is a match in both tables.
OUTER JOIN : Returns all rows even there is a match or no match in tables.
- LEFT JOIN/LEFT OUTER JOIN: Returns all rows from the left table,
even if there are no matches in the right table.
-RIGHT JOIN/RIGHT OUTER JOIN : Returns all rows from the right table, even if there
Are no matches in the left table.
-FULL JOIN/FULL OUTER JOIN : Returns rows when there is a match in one of the tables.
.
SELF JOIN: It is used to join a table to itself as if the table were two tables, temporarily
renaming at least one table in the SQL statement.
CARTESIAN JOIN or CROSS JOIN : It returns the Cartesian product of the sets of
records from the two or more joined tables.
Based on Operators, The Join can be classified as
- Inner join or Equi Join
- Non-Equi Join
NATURAL JOIN: It is performed only when common column name is same. In this,no
need to specify join condition explicitly , ORACLE automatically performs join
operation on the column with same name.
1. SQL INNER JOIN (simple join)
It is the most common type of SQL join. SQL INNER JOINS return all rows from multiple
tables where the join condition is met.
Syntax
SELECT columns FROM table1 INNER JOIN table2 ON table1.column = Table2.column;
Visual Illustration
In this visual diagram, the SQL INNER JOIN returns the shaded area:
23
The SQL INNER JOIN would return the records where table1 and table2 intersect.Let's
look at some data to explain how the INNER JOINS work with example.
We have a table called SUPPLIERS with two fields (supplier_id and supplier_name).It
contains the following data:
supplier_id supplier_name10000 ibm
10001 hewlett Packard
10002 Microsoft
10003 Nvidia
We have another table called ORDERS with three fields (order_id, supplier_id, and
order_date).
It contains the following data:
24
The rows for Microsoft and NVIDIA from the supplier table would be omitted, since the
supplier_id's 10002 and 10003 do not exist in both tables.
The row for 500127 (order_id) from the orders table would be omitted, since
thesupplier_id 10004 does not exist in the suppliers table.
2.OUTER JOIN:
Inner / Equi join returns only matching records from both the tables but not unmatched record,An
Outer join retrieves all row even when one of the column met join condition.
Types of outer join:
1. LEFT JOIN/LEFT OUTER JOIN
The SQL LEFT OUTER JOIN would return the all records from table1 and only those
records from table2 that intersect with table1.
Example
SELECT suppliers.supplier_id, suppliers.supplier_name, orders.order_date FROM
suppliers LEFT OUTER JOIN orders ON suppliers.supplier_id = orders.supplier_id;
This LEFT OUTER JOIN example would return all rows from the suppliers table and only
25
those rows from the orders table where the joined fields are equal.
The rows for Microsoft and NVIDIA would be included because a LEFT OUTER JOINwas
used. However, you will notice that the order_date field for those records contains a
<null> value.
SQL RIGHT OUTER JOIN
This type of join returns all rows from the RIGHT-hand table specified in the ON
condition and only those rows from the other table where the joined fields are equal
(joincondition is met).
Syntax
SELECT columns FROM table1 RIGHT [OUTER] JOIN table2 ON table1.column =
table2.column;
In some databases, the RIGHT OUTER JOIN keywords are replaced with RIGHT JOIN.
Visual Illustration
In this visual diagram, the SQL RIGHT OUTER JOIN returns the shaded area:
The SQL RIGHT OUTER JOIN would return the all records from table2 and only those
records from table1 that intersect with table2.
26
Example
SELECT orders.order_id, orders.order_date, suppliers.supplier_name FROM suppliers
The row for 500127 (order_id) would be included because a RIGHT OUTER JOIN wasused.
However, you will notice that the supplier_name field for that record contains a
<null> value.
SQL FULL OUTER JOIN
This type of join returns all rows from the LEFT-hand table and RIGHT-hand table with
nulls in place where the join condition is not met.
Syntax
SELECT columns FROM table1 FULL [OUTER] JOIN table2 ON table1.column =
table2.column; In some databases, the FULL OUTER JOIN keywords are replaced with
FULL JOIN.
Visual Illustration
In this visual diagram, the SQL FULL OUTER JOIN returns the shaded area:
27
The SQL FULL OUTER JOIN would return the all records from both table1 and table2.
Example
Here is an example of a SQL FULL OUTER JOIN:
Query : Find supplier id, supplier name and order date of suppliers who have ordered.
This FULL OUTER JOIN example would return all rows from the suppliers table and all
rows from the orders table and whenever the join condition is not met, <nulls> would be
extended to those fields in the result set.
If a supplier_id value in the suppliers table does not exist in the orders table, all fields in the
orders table will display as <null> in the result set. If a supplier_id value in the orders table
does not exist in the suppliers table, all fields in the suppliers table will display as
<null> in the result set.
Equi join :
When the Join Condition is based on EQUALITY (=) operator, the join is said to be an Equijoin. It
28
is also called as Inner Join.
Syntax
Select col1,col2,…From <table 1>,<table 2>Where <join condition with ‘=’ > .
Ex.Query : Find supplier id, supplier name and order date of suppliers who have ordered .
select s.supplierid, s.uppliername ,o.order_date from suppliers s, orders o where s.supplierid
=o.supplierid
29
Query : Find supplier id,supplier name and order date above 500126.
sql> select s.supplier_id,s.supplier_name,o.order_date from suppliers s , orders o whereo.order_id
>500126;
Self Join :-
Joining a table to itself is called Self Join.
Self Join is performed when tables having self refrential integrity.
To perform Self Join same table must be listed twice with different alias.
Self Join is Equi Join within the table.
It is used to join a table to itself as if the table were two tables, temporarily renaming at leastone
table in the SQL statement.
Syntax :
(Here T1 and T2 refers same table)
SELECT <collist> From Table1 T1, Table1 T2Where T1.Column1=T2.Column2;
Example:
select s1.supplier_id ,s1.supplier_name ,s2.supplier_id from suppliers s1, suppliers s2 where
s1.supplier_id=s2.supplier_id ;
supplier_id supplier_name supplier_id
------------- -------------- ---------------
---- ----
30
CROSS JOIN:
It returns the Cartesian product of the sets of records from the two or more joined tables. In
Cartesian product, each element of one set is combined with every element of another set to form
the resultant elements of Cartesian product.
Sytax: SELECT * FROM <tablename1> CROSS JOIN <tablename2>
31
NATURAL JOIN:
NATURAL JOIN is possible in ANSI SQL/92 standard.
NATURAL JOIN is similar to EQUI JOIN.
NATURAL JOIN is performed only when common column name is same.
In NATURAL JOIN no need to specify join condition explicitly ,
ORACLEautomatically performs join operation on the column with same name.
Syntax: SELECT <column list> FROM table1 NATURAL JOIN table2;
Example: ( Sailors table)
SELECT sid,sname,sid FROM sailors NATURAL JOIN reserves ; //both tables
havesame column name.
32
VIEWS
A view in SQL is a logical subset of data from one or more tables. View is used to restrict data
access.Data abstraction is usually required after a table is created and populated with data. Data held
by some tables might require restricted access to prevent all users from accessing all columns of a
table, for data security reasons. Such a security issue can be solved by creating several tables with
appropriate columns and assigning specific users to each such table, as required. This answersdata
security requirements very well but gives rise to a great deal of redundant data being resident in
tables, in the database.To reduce redundant data to the minimum possible, Oracle provides Virtual
tables which are Views.
View Definition :-
A View is a virtual table based on the result returned by a SELECT query.
The most basic purpose of a view is restricting access to specific column/rows from a table
thus allowing different users to see only certain rows or columns of a table.
Composition Of View:-
A view is composed of rows and columns, very similar to table. The fields in a view are
fields from one or more database tables in the database.
SQL functions, WHERE clauses and JOIN statements can be applied to a view in the same
manner as they are applied to a table.
View storage:-
Oracle does not store the view data. It recreates the data, using the view’s SELECT statement,
every time a user queries a view.
Advantages Of View:-
Security:- Each user can be given permission to access only a set of views that contain specific data.
Query simplicity:- A view can drawn from several different tables and present it as a single table
turning multiple table queries into single table queries against the view.
Data Integrity:- If data is accessed and entered through a view, the DBMS can automatically check
the data to ensure that it meets specified integrity constraints.
33
Disadvantage of View:-
Performance:- Views only create the appearance of the table but the RDBMS must still translate
queries against the views into the queries against the underlined source tables. If the view is
defined on a complex multiple table query then even a simple query against the view becomes a
complicated join and takes a long time to execute.
Types of Views :-
Simple Views
Complex Views
Simple Views :-
a View based on single table is called simple view.
Syntax:-
CREATE VIEW <View Name>AS
SELECT<ColumnName1>,<ColumnName2>..FROM <TableName>
[WHERE <COND>] [WITH CHECK OPTION][WITH READ ONLY]
Example :-
Views can also be used for manipulating the data that is available in the base tables[i.e. the
user can perform the Insert, Update and Delete operations through view.
Views on which data manipulation can be done are called Updateable Views.
If an Insert, Update or Delete SQL statement is fired on a view, modifications to data in the
view are passed to the underlying base table.
UpdatingaView:
A view can be updated under certain conditions:
34
The SELECT clause may not contain the keyword DISTINCT.
The SELECT clause may not contain summary functions.
The SELECT clause may not contain set functions.
The SELECT clause may not contain set operators.
The SELECT clause may not contain an ORDER BY clause.
The FROM clause may not contain multiple tables.
The WHERE clause may not contain subqueries.
The query may not contain GROUP BY or HAVING.
Calculated columns may not be updated.
All NOT NULL columns from the base table must be included in the view in order for
the INSERT query to function.
So if a view satisfies all the above-mentioned rules then you can update a view.
If VIEW created with WITH CHECK OPTION then any DML operation through that view
violates where condition then that DML operation returns error.
35
NON- UPDATABLE VIEWS:
we cannot perform insert or update or delete operations on base table through complex views.
Complex views are not updatable views.
Example 2 :-
SQL>CREATE VIEW V2AS
SELECT deptno,SUM(sal) AS sumsalFROM
EMP GROUP BY deptno;
Destroying a View:-
The DROP VIEW command is used to destroy a view from the database.Syntax:-
DROP VIEW<viewName>Example :-
SQL>DROP VIEW emp_v;
DIFFERENCES BETWEEN SIMPLE AND COMPLEX VIEWS:
SIMPLE COMPLEX
Created from one table Created from one or more tables
Does not contain functions Conations functions
Does not contain groups of data Contain groups of data
A materialized view in Oracle is a database object that contains the results of a query. They are
local copies of data located remotely, or are used to create summary tables based on
aggregationsof a table's data. Materialized views, which store data based on remote tables are also,
know as snapshots.
A materialized view can querytables, views, and other materialized views. Collectively these are
called master tables (a replication term) or detail tables (a data warehouse term).
For replication purposes, materialized views allow you to maintain copies of remote data on your
local node. These copies are read-only. If you want to update the local copies, you have to use the
Advanced Replication feature. You can select data from a materialized view as you would from a
table or view.
For data warehousing purposes, the materialized views commonly created are aggregate
views, single-table aggregate views, and join views.
In replication environments, the materialized views commonly created are primary key, rowid, and
36
subquery materialized views.
SYNTAX:
Example:
The following statement creates the rowid materialized view on table emp located on a remote
database:
SQL> CREATE MATERIALIZED VIEW mv_emp_rowidREFRESH WITH ROWID
AS SELECT * FROM emp@remote_db;Materialized view log created.
ORDERING
This will be used to ordering the columns data (ascending or descending).Syntax1: (simple form)
37
select * from <table_name> order by <col> desc;
If you want output in descending order you have to use desc keyword after the
column.Ex: SQL> select * from student order by no; SQL> select * from student order by
no desc;
The order of rows returned in a query result is undefined. The ORDER BY clause can be used
tosort the rows. If you use the ORDER BY clause, it must be the last clause of the SQL statement.
You can specify an expression, or an alias, or column position in ORDER BY clause.
ORDER BY :specifies the order in which the retrieved rows are displayed.orders the rows in
ascending order ( default order)
orders the rows in descending order
Ordering of Data :-
Numeric values are displayed with the lowest values firs for example 1–999.
Date values are displayed with the earliest value first for example 01-JAN-92 before 01-
JAN-95.
38
Character values are displayed in alphabetical order—for example, A first and Z last.
39
Null values are displayed last for ascending sequences and first for descendingsequences.
Examples :-
Arrange employee records in ascending order of their sal ?
HAVING group-qualification
The expression appearing in the group-qualification in the HAVING clause must have
a single value per group.
40
41
Ex: SQL> select deptno, sum(sal) from emp group by deptno;SQL> select deptno, sum(sal)
from emp group by deptno; DEPTNO SUM(SAL)
---------- ----------
10 875
0
20 10875
30 9400
SQL> select deptno,job,sum(sal) from emp group by deptno,job;Sql> Find the age of the
youngest sailor for each rating level.
Find the age of the youngest sailor who is eligible to vote for each rating level with at least two
such sailors ?
For each red boat find the number of reservations for this boat?
GROUP BY b.bid;
Find the average age of sailors for each rating level that has at least two sailors ?
GROUP BY s.rating
AGGREGATION
It is a group operation, which will be works on all records of a table. To do this, Group
functions required to process group of rows and Returns one value from that group.
42
Aggregate functions - max(),min(),sum(),avg(),count(),count(*).
Group functions will be applied on all the rows but produces single output.
a) SUM
This will give the sum of the values of the specified column.Syntax: sum (column)
Ex: SQL> select sum(sal) from emp;
b) AVG
This will give the average of the values of the specified column.Syntax: avg (column)
Ex: SQL> select avg(sal) from emp;
c) MAX
This will give the maximum of the values of the specified column.Syntax: max (column)
Ex: SQL> select max(sal) from emp;
d) MIN
This will give the minimum of the values of the specified column.Syntax: min (column)
Ex: SQL> select min(sal) from emp;
e) COUNT
This will give the count of the values of the specified column.Syntax: count (column)
Ex: SQL> select count(sal),count(*) from emp;
SUB QUERIES
A subquery is a SQL query nested inside a larger query.
A subquery may occur in :
- A SELECT clause
- A FROM clause
- A WHERE clause
You can use the comparison operators, such as >, <, or =. The comparison operator can
also be a multiple-row operator, such as IN, ANY, or ALL.
43
A subquery is also called an inner query or inner select, while the statement containing a
subquery is also called an outer query or outer select.
The inner query executes first before its parent query so that the results of inner query
can be passed to the outer query.
You can use a subquery in a SELECT, INSERT, DELETE, or UPDATE statement to perform
the following tasks :
Compare an expression to the result of the query.
Determine if an expression is included in the results of the query.
Check whether the query selects any rows.
Syntax :
The subquery (inner query) executes once before the main query (outer query) executes.
The main query (outer query) use the subquery result.
v001 95
v002 80
v003 74
v004 81
44
Now we want to write a query to identify all students who get better marks than that of the
student who's StudentID is 'V002', but we do not know the marks of 'V002'.
- To solve this problem, we require two queries.
One query returns the marks (stored in Totalmarks field) of 'V002' and a second query identifies
the students who get better marks than the result of the first query.
SQL> select *from marks where
sid='v002'; SIDTOTAL MARKS
---------- ----------
v002 80
The result of query is 80.
- Using the result of this query, here we have written another query to identify the students who
get better marks than 80.
Second query :
SQL> select s.sid,s.name,m.totalmarks from student1 s, marks m where s.sid=m.sid and
m.totalmarks>80;
SID NAME TOTALMARKS
---- ---------- ----------
v001 abhi 95
v004 anand 81
Above two queries identified students who get better number than the student who's StudentID is
'V002' (Abhi).
You can combine the above two queries by placing one query inside the other. The sub query(also
called the 'inner query') is the query inside the parentheses. See the following code and query result:
45
SQL> select s.sid,s.name,m.totalmarks from student1 s,marks m where s.sid=m.sid and
m.totalmarks >(select totalmarks from marks where sid='v002');
SID NAME TOTALMARKS
---- ---------- ----------
v001 Abhi 95
v004 Anand 81
46
Subqueries: Guidelines
There are some guidelines to consider when using subqueries :
-A subquery must be enclosed in parentheses.
-A subquery must be placed on the right side of the comparison operator.
-Subqueries cannot manipulate their results internally, therefore ORDER BY clause cannot be
added in to a subquery.You can use a ORDER BY clause in the main SELECT
statement(outer query) which will be last clause.
-Use single-row operators with single-row subqueries.
-If a subquery (inner query) returns a null value to the outer query, the outer query will
not return any rows when using certain comparison operators in a WHERE clause.
Type of Subqueries
Single row subquery : Returns zero or one row.
Multiple row subquery : Returns one or more rows.
Multiple column subquery : Returns one or more columns.
Correlated subqueries : Reference one or more columns in the outer SQL statement. The
subquery is known as a correlated subquery because the subquery is related to the outerSQL
statement.
Nested subqueries : Subqueries are placed within another subqueries.
1) SINGLE ROW SUB QUERIES:- Returns zero or one row.
If inner query returns only one row then it is called single row subquery.
Syntax :-
47
Example2: (on SAILORS _BOAT_RESERVATION DATABASE )
48
Q: Find the sailor’s ID whose name is equal to ‘DUSTIN’
SQL> SELECT SID FROM SAILORS WHERE SID = (SELECT SID FROM
SAILORS WHERE SNAME='DUSTIN');
SID
----------
22
SQL> SELECT RATING FROM SAILORS WHERE SID = (SELECT SID FROM SAILORS
WHERESNAME='DUSTIN');
RATING
7
Q: Find the sailors records whose sid is geater than ‘dustin’?
SQL> SELECT *FROM SAILORS WHERE SID > (SELECT SID FROMSAILORS
WHERE SNAME='DUSTIN');
49
Q:Find the sailors records ,whose sailors’ having maximum rating .
SQL> SELECT *FROM SAILORS WHERE RATING = (SELECTMAX(RATING)
FROM SAILORS);
SID SNAME RATING AGE
---------- ---------- ---------- ----------
58 RUSTY 10 35
71 ZOBRA 10 16
64 HORATIO 7 35
50
MULTI ROW SUB QUERIES:
if inner query returns more than one row then it is called multi row subquery.
Syntax :-
To test for values in a specified list of values, use IN operator. The IN operator can be used with
any data type. If characters or dates are used in the list, they must be enclosed in single quotation
marks (’’).
Syntax:-
IN (V1,V2,V3---------);
Note :-
Example :-
51
Q:Find the name of sailors who have reserved a red boat
SQL> SELECT S.SNAME FROM SAILORS S WHERE S.SID IN (SELECT R.SID FROM
RESERVES R WHERE R.BID IN (SELECT B.BID FROM BOATS B WHERE
B.COLOR='RED'));
SNAME
----------DUSTIN LUBBER HORATIO
Q:Find the names of sailors who have not reserved a red boat.
SELECT S.SNAME FROM SAILORS S WHERE S.SID NOT IN (SELECT R.SID FROM
RESERVES R WHERE R.BID IN (SELECT B.BID FROM BOATS B WHERE B.COLOR
= 'RED'));
SNAME
----------BRUTUS CANDY RUSTY ZOBRA HORATIO ART
BOB
Using EXISTS Operator :-
EXISTS operator returns TRUE or FALSE.
If inner query returns at least one record then EXISTS returns TRUE otherwise returns
FALSE. ORACLE recommends EXISTS and NOT EXISTS operators instead of IN and NOT
IN.
SNAME
----------DUSTIN LUBBER HORATIO
Q:Find the name of sailors who have not reserved boat 103
SQL> SELECT S.SNAME FROM SAILORS S WHERE NOT EXISTS (SELECT *FROM
RESERVES R WHERE R.BID=103 AND R.SID = S.SID) ;
SNAME
----------BRUTUS CANDY RUSTY HORATIO ZOBRA ART
BOB
52
ANY operator:-
Compares a value to each value in a list or returned by a query. Must be preceded by =, !=, >, <,
<=, >=. Evaluates to FALSE if the query returns no rows.
Select employees whose salary is greater than any salesman’s salary ?
SQL> SELECT S.SID FROM SAILORS S WHERE S.RATING > ANY ( SELECT
S2.RATING FROM SAILORS S2 WHERE S2.SNAME=’HORATIO’) ;
SID
58
71
74
31
32
ALL operator :-
Compares a value to every value in a list or returned by a query. Must be preceded by =, !=, >,
<, <=, >=. evaluates to TRUE if the query returns no rows.
Example:-
SQL> SELECT S.SID FROM SAILORS S WHERE S.RATING > ALL ( SELECT
S2.RATING FROM SAILORS S2 WHERE S2.SNAME=’HORATIO’) ;
SID
58
71
Multi Column Subqueries:-
If inner queryreturns more than one column value then it is called MULTI COLUMN subquery.
Example :-
Display employee names earning maximum salaries in their dept ?
SQL>SELECT ename FROM emp WHERE (deptno,sal) IN (SELECT deptno,MAX(sal)
FROM emp GROUP BY deptno) ;
SQL> SELECT SNAME FROM SAILORS WHERE (RATING,AGE) IN (SELECT
53
RATING,MAX(AGE) FROM SAILORS GROUP BY RATING);
SNAME
----------DUSTIN BRUTUS LUBBER RUSTY HORATIO BOB
SQL> SELECT SID,SNAME FROM SAILORS WHERE (RATING,AGE) IN (SELECT
RATING,MAX(AGE) FROM SAILORS GROUP BY RATING);
SID SNAME
Nested Queries:-
A subquery embedded in another subquery is called NESTED
Example :-
Display employee name earning second maximum salary ?
SQL>SELECT ename FROM emp
WHERE sal = (SELECT MAX(sal) FROM EMP
WHERE sal < (SELECT MAX(sal) FROM emp))
;
Q:Find the names of sailors who have not reserved a red boat.
SELECT S.SNAME FROM SAILORS S WHERE S.SID NOT IN (SELECT R.SID FROM
RESERVES R WHERE R.BID IN (SELECT B.BID FROM BOATS B WHERE B.COLOR
= 'RED'));
SNAME
54
CORRELATED SUB QUERIES:
In the Co-Related sub query a parent query will be executed first and based on the output ofouter
query the inner query execute.
If parent query returns N rows ,inner query executed for N times.
If a subquery references one or more columns of parent query is called CO-RELATED
subquery because it is related to outer query. This subquery executes once for each and every
row of main query.
Example1 :-
Example2: Find sailors whose rating more than avg(rating ) of their id.
SQL> SELECT S.SNAME FROM SAILORS S WHERE RATING > (SELECT AVG(RATING)
FROMSAILORS WHERE SID=S.SID);
no rows selected.
55
SUB QUERIES WITH SET OPERATORS:
Q1) Find the names of sailors who have reserved a red or a green boat?
SQL> Select s.sname from sailors s, reserves r, boats b where s.sid=r.sid andr.bid=b.bid
and (b.color = ‘red’ or b.color= ‘green’);
Or
SQL> Select s.sname from sailors s, reserves r, boats b where s.sid=r.sid andr.bid=b.bid and
b.color=’red’
UNION
Select s.sname from sailors s, reserves r, boats b where s.sid=r.sid andr.bid=b.bid and
b.color=’green’;
SNAME
Dustin
Lubber
Horatio
Q2) Find the names of sailors who have reserved a red and a green boat?
INTERSECT
Select s.sname from sailors s, reserves r, boats b where s.sid=r.sid andr.bid=b.bid and
b.color=’green’;
SNAME
Dustin
Lubber
Horatio
56
Q3) Find the names of sailors wh have reserved a red bo at but not green boat?
MINUS
Select s.sname from sailors s, reserves r, boats b ere s.sid=r.sid andr.bid=b.bid and
w b.color=’green’;
NO ROWS SELECTED
Q4) Find all sids of sailors who h ve a rating of 10 or reserved boat 104?
where r.bid=104
SID
22
31
58
71
57
UNIT-IV
Schema Refinement(Normalization)
Purpose of Normalization:
Normalization is the process of reducing data redundancy in a table and improving data integrity. Data normalization is a
technique used in databases to organize data efficiently.Data normalization ensures that your data remains clean, consistent,
and error-free by breaking it into smaller tables and linking them through relationships. This process reduces redundancy,
improves data integrity, and optimizes database performance. Then why do you need it? If there is no normalization in SQL,
there will be many problems, such as:
Insert Anomaly: This happens when we cannot insert data into the table without another.
Update Anomaly: This is due to data inconsistency caused by data redundancy and data update.
Delete exception: Occurs when some attributes are lost due to the deletion of other attributes.
So normalization is a way of organizing data in a database. Normalization involves organizing the columns and
tables in the database to ensure that their dependencies are correctly implemented using database constraints.
Normalization is the process of organizing data properly. It is used to minimize the duplication of various
relationships in the database. It is also used to troubleshoot exceptions such as inserts, deletes, and updates in the
table. It helps to split a large table into several small normalized tables. Relational links and links are used to
reduce redundancy. Normalization, also known as database normalization or data normalization, is an important
part of relational database design because it helps to improve the speed, accuracy, and efficiency of the database.
Need For Normalization
It eliminates redundant data.
It reduces chances of data error.
The normalization is important because it allows database to take up less disk space.
It also help in increasing the performance.
It improves the data integrity and consistency.
Advantages
By using normalization redundancy of database or data duplication can be resolved.
We can minimize null values by using normalization.
Results in a more compact database (due to less data redundancy/zero).
Minimize/avoid data modification problems.
It simplifies the query.is
The database structure is clearer and easier to understand.
The database can be expanded without affecting existing data.
Finding, sorting, and indexing can be faster because the table is small and more rows can be accommodated on the
data page
Normalization (or schema refinement) is a critical process in database design that aims to eliminate redundancy and
improve the integrity of data. Its primary purpose is to organize the data in a way that reduces duplication and ensures
consistency, making databases more efficient, reliable, and scalable. Here are the key purposes of normalization and schema
refinement:
Problem: Without normalization, data might be repeated in multiple places within a database, leading to
inconsistencies and storage inefficiencies.
Solution: Normalization breaks down tables into smaller, related ones, ensuring that each piece of information
is stored in only one place, thereby reducing duplication.
58
2. Avoid Update Anomalies
Problem: When redundant data exists, updating a piece of information may require multiple updates across
different tables or records, which can lead to errors if one update is missed.
Solution: Normalization ensures that changes only need to be made in one place. This reduces the chances
of inconsistencies arising from partial updates.
Problem: With redundant data, it's possible to have inconsistencies where data might contradict itself across
different parts of the database.
Solution: By organizing data into related, smaller tables, normalization enforces integrity constraints (such as
referential integrity) that maintain the accuracy and consistency of the database.
Problem: When data is stored in an unnormalized form, certain operations (like inserting, deleting, or updating
records) might lead to anomalies (e.g., having a null value where it shouldn't be, or accidentally deleting
important information).
Solution: Normalization helps design the schema in a way that these operations are less likely to result in
inconsistencies or errors.
Problem: Storing redundant data in a single table can waste disk space.
Solution: Through normalization, storage is optimized because each piece of information is stored once, and
related data is grouped appropriately.
Problem: Unnormalized databases with redundant data may require more complex queries to fetch or update
information, as relationships between data elements are not as clearly defined.
Solution: In a normalized schema, relationships between tables are clear, and querying the database is more
efficient, as it allows for better indexing and streamlined data retrieval.
Problem: A poorly structured schema may need significant rework as the database evolves or grows,
especially if new data relationships need to be introduced.
Solution: Normalization encourages flexibility. It provides a clear structure that makes it easier to add new
attributes or relationships without major changes to the existing schema.
Levels of Normalization
Normalization typically occurs in stages, known as normal forms (NF), each addressing different issues:
1NF (First Normal Form): Ensures that the table has no repeating groups and all attributes are atomic.
2NF (Second Normal Form): Addresses partial dependencies, ensuring that non-key attributes depend on the
entire primary key.
3NF (Third Normal Form): Ensures that no transitive dependencies exist between non-key attributes and the
primary key.
59
Higher normal forms like BCNF, 4NF, and 5NF address more specific cases of dependency.
Functional dependency
A functional dependency occurs when one attribute uniquely determines another attribute within a relation. It is a
constraint that describes how attributes in a table relate to each other. If attribute A functionally determines attribute B we
write this as the A→B.
Functional dependencies are used to mathematically express relations among database entities and are very important to
understanding advanced concepts in Relational Database Systems.
Example:
42 abc CO A4
43 pqr IT A3
44 xyz CO A4
45 xyz IT A3
46 mno EC B2
47 jkl ME B2
From the above table we can conclude some valid functional dependencies:
roll_no → { name, dept_name, dept_building },→ Here, roll_no can determine values of fields name,
dept_name and dept_building, hence a valid Functional dependency
roll_no → dept_name , Since, roll_no can determine whole set of {name, dept_name, dept_building}, it
can determine its subset dept_name also.
dept_name → dept_building , Dept_name can identify the dept_building accurately, since departments with
different dept_name will also have a different dept_building
More valid functional dependencies: roll_no → name, {roll_no, name} ⇢ {dept_name, dept_building},
etc.
Here are some invalid functional dependencies:
name → dept_name Students with the same name can have different dept_name, hence this is not a valid
functional dependency.
dept_building → dept_name There can be multiple departments in the same building. Example, in
the above table departments ME and EC are in the same building B2, hence dept_building → dept_name is an
invalid functional dependency.
More invalid functional dependencies: name → roll_no, {name, dept_name} → roll_no, dept_building →
roll_no, etc.
42 abc 17
43 pqr 18
44 xyz 18
Here, {roll_no, name} → name is a trivial functional dependency, since the dependent name is a subset of determinant
set {roll_no, name}. Similarly, roll_no → roll_no is also an example of trivial functional dependency.
42 abc 17
43 pqr 18
44 xyz 18
Here, roll_no → name is a non-trivial functional dependency, since the dependent name is not a subset of
determinant roll_no. Similarly, {roll_no, name} → age is also a non-trivial functional dependency, since age is not a
subset of {roll_no, name}
3. Multivalued Functional Dependency
In Multivalued functional dependency, entities of the dependent set are not dependent on each other. i.e. If a → {b,
c} and there exists no functional dependency between b and c, then it is called a multivalued functional dependency.
For example,
roll_no name age
61
roll_no name age
42 abc 17
43 pqr 18
44 xyz 18
45 abc 19
Here, roll_no → {name, age} is a multivalued functional dependency, since the dependents name & age are not
dependent on each other(i.e. name → age or age → name doesn’t exist !)
4. Transitive Functional Dependency
In transitive functional dependency, dependent is indirectly dependent on determinant.
i.e. If a → b & b → c, then according to axiom of transitivity, a → c. This is a transitive functional dependency.
For example,
enrol_no name dept building_no
42 abc CO 4
43 pqr EC 2
44 xyz IT 1
45 abc EC 2
Here, enrol_no → dept and dept → building_no. Hence, according to the axiom of transitivity, enrol_no →
building_no is a valid functional dependency. This is an indirect functional dependency, hence called Transitive
functional dependency.
5. Fully Functional Dependency
In full functional dependency an attribute or a set of attributes uniquely determines another attribute or set of
attributes. If a relation R has attributes X, Y, Z with the dependencies X->Y and X->Z which states that those
dependencies are fully functional.
6. Partial Functional Dependency
In partial functional dependency a non key attribute depends on a part of the composite key, rather than the
whole key. If a relation R has attributes X, Y, Z where X and Y are the composite key and Z is non key attribute. Then X-
>Z is a partial functional dependency in RBDMS.
62
normalization. With the help of functional dependencies we are able to identify the primary key, candidate key in a table
which in turns helps in normalization.
2. Query Optimization
With the help of functional dependencies we are able to decide the connectivity between the tables and the necessary
attributes need to be projected to retrieve the required data from the tables. This helps in query optimization and improves
performance.
3. Consistency of Data
Functional dependencies ensures the consistency of the data by removing any redundancies or inconsistencies that may
exist in the data. Functional dependency ensures that the changes made in one attribute does not affect inconsistency in
another set of attributes thus it maintains the consistency of the data in database.
4. Data Quality Improvement
Functional dependencies ensure that the data in the database to be accurate, complete and updated. This helps to improve
the overall quality of the data, as well as it eliminates errors and inaccuracies that might occur during data analysis and
decision making, thus functional dependency helps in improving the quality of data in database.
NORMALIZATION
Anomalies in DBMS
There are three types of anomalies that occur when the database is not normalized. These are – Insertion, update
and deletion anomaly. Let’s take an example to understand this.
Example:
Suppose a manufacturing company stores the employee details in a table named employee that has four attributes:
emp_id for storing employee’s id, emp_name for storing employee’s name, emp_address for storing employee’s
address and emp_dept for storing the department details in which the employee works. At some point of time the table
looks like this:
The above table is not normalized. We will see the problems that we face when a table is not normalized.
Update anomaly:
In the above table we have two rows for employee Rick as he belongs to two departments of the company. If we
want to update the address of Rick then we have to update the same in two rows or the data will become inconsistent. If
somehow, the correct address gets updated in one department but not in other then as per the database, Rick would be
having two different addresses, which is not correct and would lead to inconsistent data.
63
Insert anomaly:
Suppose a new employee joins the company, who is under training and currently not assigned to any department then
we would not be able to insert the data into the table if emp_dept field doesn’t allow nulls.
Delete anomaly:
Suppose, if at a point of time the company closes the department D890 then deleting the rows that are having
emp_dept as D890 would also delete the information of employee Maggie since she is assigned only to this department.
Normalization
As per the rule of first normal form, an attribute (column) of a table cannot hold multiple values. It should hold only
atomic values.
Example: Suppose a company wants to store the names and contact details of its employees. It creates a table that looks
like this:
Two employees (Jon & Lester) are having two mobile numbers so the company stored them in the same field as you can
see in the table above.
This table is not in 1NF as the rule says “each attribute of a table must have atomic (single) values”, the emp_mobile values
for employees Jon & Lester violates that rule.
To make the table complies with 1NF we should have the data like this:
64
102 Jon Kanpur 8812121212
102 Jon Kanpur 9900012222
103 Ron Chennai 7778881212
104 Lester Bangalore 9990000123
104 Lester Bangalore 8123450987
An attribute that is not part of any candidate key is known as non-prime attribute.
Example: Suppose a school wants to store the data of teachers and the subjects they teach. They create a table that looks
like this: Since a teacher can teach more than one subjects, the table can have multiple rows for a same teacher.
teacher_subject table:
teacher_id subject
111 Maths
65
111 Physics
222 Biology
333 Physics
Now the tables comply with Second normal form (2NF).
An attribute that is not part of any candidate key is known as non-prime attribute.
In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each functional
dependency X-> Y at least one of the following conditions hold:
An attribute that is a part of one of the candidate keys is known as prime attribute.
Example: Suppose a company wants to store the complete address of each employee, they create a table named
employee_details that looks like this:
Here, emp_state, emp_city & emp_district dependent on emp_zip. And, emp_zip is dependent on emp_id that makes
non-prime attributes (emp_state, emp_city & emp_district) transitively dependent on super key (emp_id). This violates
the rule of 3NF.
To make this table complies with 3NF we have to break the table into two tables to remove the transitive dependency:
employee table:
employee_zip table:
It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is stricter than 3NF. A table complies
with BCNF if it is in 3NF and for every functional dependency X->Y, X should be the super key of the table.
Example: Suppose there is a company wherein employees work in more than one department. They store the data like
this:
The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
To make the table comply with BCNF we can break the table in three tables like this:
emp_nationality table:
emp_id emp_nationality
1001 Austrian
1002 American
emp_dept table:
68
emp_dept_mapping table:
emp_id emp_dept
1001 Production and planning
1001 stores
1002 design and technical support
1002 Purchasing department
Functional dependencies:
Candidate keys:
A lossless join decomposition ensures that when a relation is decomposed into smaller relations, the original relation
can be reconstructed by performing a natural join (or an equivalent join) on the decomposed relations. In other
words, no information is lost during the decomposition process.
Characteristics:
Reconstructibility: You can join the decomposed relations back to get the original relation.
No Data Loss: The decomposition is "lossless" because it preserves all the original tuples from the relation.
To ensure that a decomposition is lossless, one of the following conditions must be met:
If the intersection of the decomposed relations (the common attributes) forms a candidate key of at least one
of the decomposed relations, the decomposition69 is lossless.
o Example: If you decompose a relation R into R1(A, B) and R2(B, C) and B is a candidate key in either
R1 or R2, the decomposition is lossless.
Example:
Given the relation R(A, B, C) with the functional dependency A → B, you can decompose it into two smaller relations:
Here, the common attribute is A. Since A is a key attribute in R1, the decomposition is lossless.
R1(A, B) R2(A, C)
A1, B1 A1, C1
A2, B2 A2, C2
By performing a natural join on R1 and R2 using the common attribute A, we can reconstruct the original relation
R(A, B, C).
2. Dependency-Preserving Decomposition
A dependency-preserving decomposition ensures that all the original functional dependencies (FDs) can still be
enforced in the decomposed relations. This is important for maintaining data integrity.
Characteristics:
Preservation of Dependencies: The decomposed relations must still enforce the original
functional dependencies.
Efficiency: Dependency-preserving decompositions help to avoid the overhead of recomputing dependencies
after joining decomposed relations.
Example:
A→B
B→C
If we decompose R into:
R1(A, B) (preserving A → B)
R2(B, C) (preserving B → C)
70
Decomposition and Normal Forms
Decomposition is an essential tool in achieving higher normal forms (such as 2NF, 3NF, and BCNF). The
normalization process involves decomposing a relation to remove redundancies and dependencies that violate the
conditions of the higher normal forms.
For example:
Steps in Decomposition
Let’s take a simple relation R(A, B, C) with the following functional dependencies:
A→B
B→C
A → C (transitive dependency)
The relation is not in 2NF because B → C is a transitive dependency (i.e., non-key attribute C depends on
another non-key attribute B, which depends on A).
The relation is not in 3NF because of the transitive dependency.
We can decompose the relation into two smaller relations that satisfy 3NF:
R1(A, B) (because A → B) 71
R2(B, C) (because B → C)
Step 3: Verify Lossless Join and Dependency Preservation
Lossless Join: We can join R1(A, B) and R2(B, C) on B to get the original relation R(A, B, C).
Dependency Preservation: The dependencies A → B and B → C are preserved in R1 and R2.
Properties of Decomposition
1. Lossless Join:
The decomposition must allow the original relation to be reconstructed through a natural join of the
decomposed relations without losing any information.
2. Dependency Preservation:
The decomposition must ensure that all functional dependencies in the original relation can be enforced in the
decomposed relations. Ideally, the original functional dependencies should be expressible in the decomposed
relations without requiring joins.
Surrogate Key
A surrogate key also called a primary key, is generated when a new record is inserted into a table automatically
by a database that can be declared as the primary key of that table. It is the sequential number outside of the database
that is made available to the user and the application or it acts as an object that is present in the database but is not
visible to the user or application.
We can say that, in case we do not have a natural primary key in a table, then we need to artificially create
one in order to uniquely identify a row in the table, this key is called the surrogate key or synthetic primary key of
the table. However, the surrogate key is not always the primary key. Suppose we have multiple objects in a database
that are connected to the surrogate key, then we will have a many-to-one association between the primary keys and
the surrogate key and the surrogate key cannot be used as the primary key.
Features of the Surrogate Key
It is automatically generated by the system.
It holds an anonymous integer.
It contains a unique value for all records of the table.
The value can never be modified by the user or application.
The surrogate key is called the factless key as it is added just for our ease of identification of unique values
and contains no relevant fact(or information) that is useful for the table.
Example:
Suppose we have two tables of two different schools having the same column registration_no, name, and percentage, each
table having its own natural primary key, that is registration_no.
Table of school A:
registration_no name percentage
210101 Harry 90
210102 Maxwell 65
210103 Lee 87
210104 Chris 76
72
Table of school B:
registration_no name percentage
CS107 Taylor 49
CS108 Simon 86
CS109 Sam 96
CS110 Andy 58
Now, suppose we want to merge the details of both the schools in a single table. Resulting table
will be:
surr_no registration_no name percentage
1 210101 Harry 90
2 210102 Maxwell 65
3 210103 Lee 87
4 210104 Chris 76
5 CS107 Taylor 49
6 CS108 Simon 86
7 CS109 Sam 96
8 CS110 Andy 58
As we can observe the above table and see that registration_no cannot be the primary key of the table as it does not
match with all the records of the table though it is holding all unique values of the table . Now , in this case, we have to
artificially create one primary key for this table. We can do this by adding a column surr_no in the table that contains
anonymous integers and has no direct relation with other columns . This additional column of surr_no is the surrogate key
of the table.
To determine the highest normal form of a given relation R with functional dependencies, the first step is to check
whether the BCNF condition holds. If R is found to be in BCNF, it can be safely deduced that the relation is also in 3NF,
2NF, and 1NF as the hierarchy shows. The 1NF has the least restrictive constraint – it only requires a relation R to have
atomic values in each tuple. The 2NF has a slightly more restrictive constraint.
The 3NF has a more restrictive constraint than the first two normal forms but is less restrictive than the BCNF. In this
manner, the restriction increases as we traverse down the hierarchy.
74
Examples
Here, we are going to discuss some basic examples which let you understand the properties of BCNF. We will discuss
multiple examples here.
Example 1
Let us consider the student database, in which data of the student are mentioned.
Stu_ID Stu_Branch Stu_Course Branch_Number Stu_Course_No
75
Candidate Key for this table: Stu_ID.
Stu_Course Table
Stu_Course Branch_Number Stu_Course_No
101 201
101 202
102 401
102 402
76
Multivalued Dependency (MVD)
In Database Management Systems (DBMS), multivalued dependency (MVD) deals with complex attribute
relationships in which an attribute may have many independent values while yet depending on another attribute or group
of attributes. It improves database structure and consistency and is essential for data integrity and database normalization.
MVD or multivalued dependency means that for a single value of attribute ‘a’ multiple values of attribute ‘b’ exist. We
write it as,
a --> --> b
It is read as a is multi-valued dependent on b. Suppose a person named Geeks is working on 2 projects
Microsoft and Oracle and has 2 hobbies namely Reading and Music. This
can be expressed in a tabular format in the following way.
Example
Project and Hobby are multivalued attributes as they have more than one value for a single person i.e., Geeks.
When one attribute in a database depends on another attribute and has many independent values, it is said to
have multivalued dependency (MVD). It supports maintaining data accuracy and managing intricate data interactions.
77
t1 = t4; t2 = t3
Then multivalued (MVD) dependency exists. To check the MVD in given table, we apply the conditions stated above
and we check it with the values in the given table.
Example
79
Fourth Normal Form (4NF)
The Fourth Normal Form (4NF) is a level of database normalization where there are no non-trivial multivalued
dependencies other than a candidate key. It builds on the first three normal forms (1NF, 2NF, and 3NF) and the Boyce-
Codd Normal Form (BCNF). It states that, in addition to a database meeting the requirements of BCNF, it must not
contain more than one multivalued dependency.
Properties
A relation R is in 4NF if and only if the following conditions are satisfied:
1. It should be in the Boyce-Codd Normal Form (BCNF).
2. The table should not have any Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of the Fourth Normal Form (4NF)
because it creates unnecessary redundancies and can contribute to inconsistent data. To bring this up to 4NF, it is
necessary to break this information into two tables.
Example:
Consider the database table of a class that has two relations R1 contains student ID(SID) and student name
(SNAME) and R2 contains course id(CID) and course name (CNAME).
Table R1
SID SNAME
S1 A
S2 B
Table R2
CID CNAME
C1 C
C2 D
S1 A C1 C
S1 A C2 D
S2 B C1 C
S2 B C2 D
80
Multivalued dependencies (MVD) are:
SID->->CID; SID->->CNAME; SNAME->->CNAME
Join Dependency
Join decomposition is a further generalization of Multivalued dependencies. If the join of R1 and R2 over C is equal to
R2(C, D) of a given relations R (A, B, C, D). Alternatively, R1 and R2 are a lossless decomposition of R. A JD ⋈ {R1,
relation R then we can say that a join dependency (JD) exists, where R1 and R2 are the decomposition R1(A, B, C) and
R2, …, Rn} is said to hold over a relation R if R1, R2, ….., Rn is a lossless- join decomposition. The *(A, B, C, D), (C,
D) will be a JD of R if the join of joins attribute is equal to the relation R. Here, *(R1, R2, R3) is used to indicate that
relation R1, R2, R3 and so on are a JD of R. Let R is a relation schema R1, R2, R3… Rn be the decomposition of R. r( R
) is said to satisfy join dependency if and
only if
Joint Dependency
Example:
Table R1
Company Product
C1 Pendrive
C1 mic
C2 speaker
C2 speaker
Company->->Product
Table R2
Agent Company
Aman C1
Aman C2
Mohan C1
Agent->->Company
Table R3
Agent Product
Aman Pendrive
Aman Mic
81
Aman speaker
82
Agent Product
Mohan speaker
Agent->->Product
Table R1⋈R2⋈R3
Company Product Agent
C1 Pendrive Aman
C1 mic Aman
C2 speaker speaker
C1 speaker Aman
Agent->->Product
A1 PQR Nut
A1 PQR Bolt
A1 XYZ Nut
A1 XYZ Bolt
A2 PQR Nut
The relation ACP is again decomposed into 3 relations. Now, the natural Join of all three relations will be shown as:
Table R1
Agent Company
83
Agent Company
A1 PQR
A1 XYZ
A2 PQR
Table R2
Agent Product
A1 Nut
A1 Bolt
A2 Nut
Table R3
Company Product
PQR Nut
PQR Bolt
XYZ Nut
XYZ Bolt
The result of the Natural Join of R1 and R3 over ‘Company’ and then the Natural Join of R13 and R2 over ‘Agent’and
‘Product’ will be Table ACP.
Hence, in this example, all the redundancies are eliminated, and the decomposition of ACP is a lossless join decomposition.
Therefore, the relation is in 5NF as it does not violate the property of lossless join.
Conclusion
Multivalued dependencies are removed by 4NF, and join dependencies are removed by 5NF.
The greatest degrees of database normalization, 4NF and 5NF, might not be required for every application.
Normalizing to 4NF and 5NF might result in more complicated database structures and slower query speed, but it
can also increase data accuracy, dependability, and simplicity.
84
UNIT-V
Transaction Concept
Transaction in DBMS is a set of logically related operations executed as a single unit. These logic are followed to
perform modification on data while maintaining integrity and consistency. Transactions are performed in a way that
concurrent actions from different users don’t malfunction the database. Transfer of money from one account to another in a
bank management system is the best example of Transaction.
States through which a transaction goes during its lifetime. These are the states which tell about the current state of
the Transaction and also tell how we will further do the processing in the transactions. These states govern the rules which
decide the fate of the transaction whether it will be committed or aborted. They also use a Transaction log.
Transaction States in DBMS
A Transaction log is a file maintained by the recovery management component to record all the activities of the
transaction. After the commit is done transaction log file is removed.
In DBMS, a transaction passes through various states such as active, partially committed, failed, and aborted.
Understanding these transaction states is crucial for database management and ensuring the consistency of data. For those
looking to solidify their knowledge in DBMS transactions.
These are different types of Transaction States :
1. Active State – When the instructions of the transaction are running then the transaction is in active state. If all the
‘read and write’ operations are performed without any error then it goes to the “partially committed state”; if any
instruction fails, it goes to the “failed state”.
2. Partially Committed – After completion of all the read and write operation the changes are made in main memory or
local buffer. If the changes are made permanent on the DataBase then the state will change to “committed state” and in case
of failure it will go to the “failed state”.
85
3. Failed State – When any instruction of the transaction fails, it goes to the “failed state” or if
failure occurs in making a permanent change of data on Database.
4. Aborted State – After having any type of failure the transaction goes from “failed state” to “aborted
state” and since in previous states, the changes are only made to local buffer or main memory and hence
these changes are deleted or rolled-back.
5. Committed State – It is the state when the changes are made permanent on the Data Base and the
transaction is complete and therefore terminated in the “terminated state”.
6. Terminated State – If there isn’t any roll-back or the transaction comes from the “committed state”,
then the system is consistent and ready for new transaction and the old transaction is terminated.
ACID Properties
This article is based on the concept of ACID properties in DBMS that are necessary for maintaining
data consistency, integrity, and reliability while performing transactions in the database. Let’s explore
them.
A transaction is a single logical unit of work that accesses and possibly modifies the contents of a
database. Transactions access data using read-and-write operations. To maintain consistency in a database,
before and after the transaction, certain properties are followed. These are called ACID properties.
Atomicity:
By this, we mean that either the entire transaction takes place at once or doesn’t happen at all. There is no
midway i.e. transactions do not occur partially. Each transaction is considered as one unit and either runs
to completion or is not executed at all. It involves the following two operations.
86
— Abort : If a transaction aborts, changes made to the database are not visible.
— Commit : If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2 : Transfer of 100 from account X to
account Y .
If the transaction fails after completion of T1 but before completion of T2 .( say, after write(X)
but before write(Y) ), then the amount has been deducted from X but not added to Y . This results in an
inconsistent database state. Therefore, the transaction must be executed in its entirety in order to ensure
the correctness of the database state.
Consistency:
This means that integrity constraints must be maintained so that the database is consistent before
and after the transaction. It refers to the correctness of a database. Referring to the example above,
The total amount before and after the transaction must be maintained. Total
before T occurs = 500 + 200 = 700 .
Total after T occurs = 400 + 300 = 700 .
Therefore, the database is consistent . Inconsistency occurs in case T1 completes but T2 fails. As a result,
T is incomplete.
Isolation:
This property ensures that multiple transactions can occur concurrently without leading to the inconsistency
of the database state. Transactions occur independently without interference. Changes occurring in a
particular transaction will not be visible to any other transaction until that particular change in that
transaction is written to memory or has been committed. This property ensures that the execution of
transactions concurrently will result in a state that is equivalent to a state achieved these were
executed serially in some order. Let X = 500, Y = 500.
Consider two transactions T and T”.
87
Suppose T has been executed till Read (Y) and then T’’ starts. As a result, interleaving of operations takes
place due to which T’’ reads the correct value of X but the incorrect value of Y and sum computed by
T’’: (X+Y= 50, 000+500=50, 500)
is thus not consistent with the sum at end of the transaction:
T: (X+Y = 50, 000 + 450 = 50, 450)
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must take
place in isolation and changes should be visible only after they have been made to the main memory.
Durability:
This property ensures that once the transaction has completed execution, the updates and modifications to
the database are stored in and written to disk and they persist even if a system failure occurs. These updates
now become permanent and are stored in non-volatile memory. The effects of the transaction, thus, are
never lost.
Some important points:
Property Responsibility for maintaining properties
The ACID properties, in totality, provide a mechanism to ensure the correctness and consistency of a
database in a way such that each transaction is a group of operations that acts as a single unit, produces
consistent results, acts in isolation from other operations, and updates that it makes are durably stored.
ACID properties are the four key characteristics that define the reliability and consistency of a transaction
in a Database Management System (DBMS). The acronym ACID stands for Atomicity, Consistency,
Isolation, and Durability. Here is a brief description of each of these properties:
1. Atomicity: Atomicity ensures that a transaction is treated as a single, indivisible unit of work.
Either all the operations within the transaction are completed successfully, or none of them are. If
any part of the transaction fails, the entire transaction is rolled back to its original state, ensuring
data consistency and integrity.
88
2. Consistency: Consistency ensures that a transaction takes the database from one consistent state to
another consistent state. The database is in a consistent state both before and after the transaction is
executed. Constraints, such as unique keys and foreign keys, must be maintained to ensure data
consistency.
3. Isolation: Isolation ensures that multiple transactions can execute concurrently without interfering
with each other. Each transaction must be isolated from other transactions until it is completed.
This isolation prevents dirty reads, non-repeatable reads, and phantom reads.
4. Durability: Durability ensures that once a transaction is committed, its changes are permanent
and will survive any subsequent system failures. The transaction’s changes are saved to the
database permanently, and even if the system crashes, the changes remain intact and can be
recovered.
Overall, ACID properties provide a framework for ensuring data consistency, integrity, and reliability in
DBMS. They ensure that transactions are executed in a reliable and consistent manner, even in the presence
of system failures, network issues, or other problems. These properties make DBMS a reliable and efficient
tool for managing data in modern organizations.
Advantages of ACID Properties in DBMS
1. Data Consistency: ACID properties ensure that the data remains consistent and accurate after
any transaction execution.
2. Data Integrity: ACID properties maintain the integrity of the data by ensuring that any
changes to the database are permanent and cannot be lost.
3. Concurrency Control: ACID properties help to manage multiple transactions occurring
concurrently by preventing interference between them.
4. Recovery: ACID properties ensure that in case of any failure or crash, the system can recover
the data up to the point of failure or crash.
Disadvantages of ACID Properties in DBMS
1. Performance: The ACID properties can cause a performance overhead in the system, as they
require additional processing to ensure data consistency and integrity.
2. Scalability: The ACID properties may cause scalability issues in large distributed systems where
multiple transactions occur concurrently.
3. Complexity: Implementing the ACID properties can increase the complexity of the system and
require significant expertise and resources.
Overall, the advantages of ACID properties in DBMS outweigh the disadvantages. They provide a
reliable and consistent approach to data management, ensuring data integrity, accuracy, and
reliability. However, in some cases, the overhead of implementing ACID properties can cause
performance and scalability issues. Therefore, it’s important to balance the benefits of ACID
properties against the specific needs and requirements of the system.
Concurrency Control
Concurrency control is a very important concept of DBMS which ensures the simultaneous
execution or manipulation of data by several processes or user without resulting in data inconsistency.
Concurrency Control deals with interleaved execution of more than one transaction.
What is Transaction?
89
A transaction is a collection of operations that performs a single logical function in a database application.
Each transaction is a unit of both atomicity and consistency. Thus, we require that transactions do not violate
any database consistency constraints. That is, if the database was consistent when a transaction started, the
database must be consistent when the transaction successfully terminates. However, during the execution of a
transaction, it may be necessary temporarily to allow inconsistency, since either the debit of A or the credit
of B must be done before the other. This temporary inconsistency, although necessary, may lead to difficulty
if a failure occurs.
It is the programmer’s responsibility to define properly the various transactions, so that each
preserves the consistency of the database. For example, the transaction to transfer funds from the account
of department A to the account of department B could be defined to be composed of two separate
programs: one that debits account A, and another that credits account B. The execution of these two
programs one after the other will indeed preserve consistency. However, each program by itself does not
transform the database from a consistent state to a new consistent state. Thus, those programs are not
transactions.
The concept of a transaction has been applied broadly in database systems and applications.
While the initial use of transactions was in financial applications, the concept is now used in real-time
applications in telecommunication, as well as in the management of long-duration activities such as
product design or administrative workflows.
A set of logically related operations is known as a transaction. The main operations of a transaction are:
Read(A): Read operations Read(A) or R(A) reads the value of A from the database and stores
it in a buffer in the main memory.
Write (A): Write operation Write(A) or W(A) writes the value back to the database from the
buffer.
Let us take a debit transaction from an account that consists of the following operations: 1. R(A);
2. A=A-1000;
3. W(A);
Assume A’s value before starting the transaction is 5000.
The first operation reads the value of A from the database and stores it in a buffer.
the Second operation will decrease its value by 1000. So buffer will contain 4000.
the Third operation will write the value from the buffer to the database. So A’s final value will be
4000.
But it may also be possible that the transaction may fail after executing some of its operations. The failure
can be because of hardware, software or power, etc. For example, if the debit transaction discussed above
fails after executing operation 2, the value of A will remain 5000 in the database which is not acceptable
by the bank. To avoid this, Database has two important operations:
Commit: After all instructions of a transaction are successfully executed, the changes made
by a transaction are made permanent in the database.
Rollback: If a transaction is not able to execute all operations successfully, all the changes made
by a transaction are undone.
Properties of a Transaction
Atomicity: As a transaction is a set of logically related operations, either all of them should be
executed or none. A debit transaction discussed above should either execute all three
90
operations or none. If the debit transaction fails after executing operations 1 and 2 then its new value of
4000 will not be updated in the database which leads to inconsistency.
Consistency: If operations of debit and credit transactions on the same account are executed concurrently,
it may leave the database in an inconsistent state.
For Example, with T1 (debit of Rs. 1000 from A) and T2 (credit of 500 to A) executing
concurrently, the database reaches an inconsistent state.
Let us assume the Account balance of A is Rs. 5000. T1 reads A(5000) and stores the value in its
local buffer space. Then T2 reads A(5000) and also stores the value in its local buffer space.
T1 performs A=A-1000 (5000-1000=4000) and 4000 is stored in T1 buffer space. Then T2 performs
A=A+500 (5000+500=5500) and 5500 is stored in the T2 buffer space. T1 writes the value from its
buffer back to the database.
A’s value is updated to 4000 in the database and then T2 writes the value from its buffer back to
the database. A’s value is updated to 5500 which shows that the effect of the debit transaction is
lost and the database has become inconsistent.
To maintain consistency of the database, we need concurrency control protocols which will be
discussed in the next article. The operations of T1 and T2 with their buffers and database have
been shown in Table 1.
T1’s buffer
T1 space T2 T2’s Buffer Space Database
A=5000
W(A); A=5500
91
Isolation: The result of a transaction should not be visible to others before the transaction is committed. For
example, let us assume that A’s balance is Rs. 5000 and T1 debits Rs. 1000 from A. A’s new balance will
be 4000. If T2 credits Rs. 500 to A’s new balance, A will become 4500, and after this T1 fails. Then we
have to roll back T2 as well because it is using the value produced by T1. So transaction results are not
made visible to other transactions before it commits.
Durable: Once the database has committed a transaction, the changes made by the transaction should be
permanent. e.g.; If a person has credited $500000 to his account, the bank can’t say that the update has
been lost. To avoid this problem, multiple copies of the database are stored at different locations.
What is a Schedule?
A schedule is a series of operations from one or more transactions. A schedule can be of two types:
Serial Schedule: When one transaction completely executes before starting another transaction, the
schedule is called a serial schedule. A serial schedule is always consistent. e.g.; If a schedule S has debit
transaction T1 and credit transaction T2, possible serial schedules are T1 followed by T2 (T1->T2) or T2
followed by T1 ((T2->T1). A serial schedule has low throughput and less resource utilization.
Concurrent Schedule: When operations of a transaction are interleaved with operations of other
transactions of a schedule, the schedule is called a Concurrent schedule. e.g.; the Schedule of debit and
credit transactions shown in Table 1 is concurrent. But concurrency can lead to inconsistency in the
database. The above example of a concurrent schedule is also inconsistent.
Difference between Serial Schedule and Serializable Schedule
Serial Schedule Serializable Schedule
Serial schedule are less efficient. Serializable schedule are more efficient.
In serial schedule only one transaction In Serializable schedule multiple transactions can
executed at a time. be executed at a time.
Serial schedule takes more time for In Serializable schedule execution is fast.
execution.
92
Concurrency Control in DBMS
Executing a single transaction at a time will increase the waiting time of the other transactions
which may result in delay in the overall execution. Hence for increasing the overall throughput
and efficiency of the system, several transactions are executed.
Concurrency control is a very important concept of DBMS which ensures the simultaneous
execution or manipulation of data by several processes or user without resulting in data
inconsistency.
Concurrency control provides a procedure that is able to control concurrent execution of the
operations in the database.
The fundamental goal of database concurrency control is to ensure that concurrent execution of
transactions does not result in a loss of database consistency. The concept of serializability can be
used to achieve this goal, since all serializable schedules preserve consistency of the database.
However, not all schedules that preserve consistency of the database are serializable.
In general it is not possible to perform an automatic analysis of low-level operations by
transactions and check their effect on database consistency constraints. However, there are simpler
techniques. One is to use the database consistency constraints as the basis for a split of the
database into subdatabases on which concurrency can be managed separately.
Another is to treat some operations besides read and write as fundamental low-level
operations and to extend concurrency control to deal with them.
Concurrency Control Problems
There are several problems that arise when numerous transactions are executed simultaneously in a
random manner. The database transaction consist of two major operations “Read” and “Write”. It is very
important to manage these operations in the concurrent execution of the transactions in order to maintain
the consistency of the data.
Dirty Read Problem(Write-Read conflict)
Dirty read problem occurs when one transaction updates an item but due to some unconditional events that
transaction fails but before the transaction performs rollback, some other transaction reads the updated
value. Thus creates an inconsistency in the database. Dirty read problem comes under the scenario of Write-
Read conflict between the transactions in the database
1. The lost update problem can be illustrated with the below scenario between two
transactions T1 and T2.
2. Transaction T1 modifies a database record without committing the changes.
3. T2 reads the uncommitted data changed by T1
4. T1 performs rollback
5. T2 has already read the uncommitted data of T1 which is no longer valid, thus creating
inconsistency in the database.
Lost Update Problem
Lost update problem occurs when two or more transactions modify the same data, resulting in the update
being overwritten or lost by another transaction. The lost update problem can be illustrated with the below
scenario between two transactions T1 and T2.
1. T1 reads the value of an item from the database.
2. T2 starts and reads the same database item.
3. T1 updates the value of that data and performs a commit.
4. T2 updates the same data item based on its initial read and performs commit.
93
5. This results in the modification of T1 gets lost by the T2’s write which causes a lost update problem
in the database.
Concurrency Control Protocols
Concurrency control protocols are the set of rules which are maintained in order to solve the concurrency
control problems in the database. It ensures that the concurrent transactions can execute properly while
maintaining the database consistency. The concurrent execution of a transaction is provided with atomicity,
consistency, isolation, durability, and serializability via the concurrency control protocols.
Locked based concurrency control protocol
Timestamp based concurrency control protocol
Locked based Protocol
In locked based protocol, each transaction needs to acquire locks before they start accessing or modifying
the data items. There are two types of locks used in databases.
Shared Lock : Shared lock is also known as read lock which allows multiple transactions to read
the data simultaneously. The transaction which is holding a shared lock can only read the data
item but it can not modify the data item.
Exclusive Lock : Exclusive lock is also known as the write lock. Exclusive lock allows a
transaction to update a data item. Only one transaction can hold the exclusive lock on a data item
at a time. While a transaction is holding an exclusive lock on a data item, no other transaction is
allowed to acquire a shared/exclusive lock on the same data item.
There are two kind of lock based protocol mostly used in database:
Two Phase Locking Protocol : Two phase locking is a widely used technique which ensures strict
ordering of lock acquisition and release. Two phase locking protocol works in two phases.
o Growing Phase : In this phase, the transaction starts acquiring locks before
performing any modification on the data items. Once a transaction acquires a lock,
that lock can not be released until the transaction reaches the end of the execution.
o Shrinking Phase : In this phase, the transaction releases all the acquired locks once it
performs all the modifications on the data item. Once the transaction starts releasing
the locks, it can not acquire any locks further.
Strict Two Phase Locking Protocol : It is almost similar to the two phase locking protocol the
only difference is that in two phase locking the transaction can release its locks before it commits,
but in case of strict two phase locking the transactions are only allowed to release the locks only
when they performs commits.
Timestamp based Protocol
In this protocol each transaction has a timestamp attached to it. Timestamp is nothing but the
time in which a transaction enters into the system.
The conflicting pairs of operations can be resolved by the timestamp ordering protocol through the
utilization of the timestamp values of the transactions. Therefore, guaranteeing that the transactions
take place in the correct order.
Advantages of Concurrency
In general, concurrency means, that more than one transaction can work on a system. The advantages
of a concurrent system are:
Waiting Time: It means if a process is in a ready state but still the process does not get the system
to get execute is called waiting time. So, concurrency leads to less waiting time.
94
Response Time: The time wasted in getting the response from the cpu for the first time, is called
response time. So, concurrency leads to less Response Time.
Resource Utilization: The amount of Resource utilization in a particular system is called
Resource Utilization. Multiple transactions can run parallel in a system. So, concurrency leads to
more Resource Utilization.
Efficiency: The amount of output produced in comparison to given input is called
efficiency. So, Concurrency leads to more Efficiency.
Disadvantages of Concurrency
Overhead: Implementing concurrency control requires additional overhead, such as acquiring and
releasing locks on database objects. This overhead can lead to slower performance and increased
resource consumption, particularly in systems with high levels of concurrency.
Deadlocks: Deadlocks can occur when two or more transactions are waiting for each other to
release resources, causing a circular dependency that can prevent any of the transactions from
completing. Deadlocks can be difficult to detect and resolve, and can result in reduced throughput
and increased latency.
Reduced concurrency: Concurrency control can limit the number of users or applications that can
access the database simultaneously. This can lead to reduced concurrency and slower performance
in systems with high levels of concurrency.
Complexity: Implementing concurrency control can be complex, particularly in distributed systems
or in systems with complex transactional logic. This complexity can lead to increased development
and maintenance costs.
Inconsistency: In some cases, concurrency control can lead to inconsistencies in the database.
For example, a transaction that is rolled back may leave the database in an inconsistent state, or
a long-running transaction may cause other transactions to wait for extended periods, leading to
data staleness and reduced accuracy.
Serializability
In this article, we are going to explain the serializability concept and how this concept affects the
DBMS deeply, we also understand the concept of serializability with some examples, and we will finally
conclude this topic with an example of the importance of serializability. The DBMS form is the foundation
of the most modern applications, and when we design the form properly, it provides high-performance
and relative storage solutions to our application.
What is a serializable schedule, and what is it used for?
If a non-serial schedule can be transformed into its corresponding serial schedule, it is said to be
serializable. Simply said, a non-serial schedule is referred to as a serializable schedule if it yields the
same results as a serial timetable.
Non-serial Schedule
A schedule where the transactions are overlapping or switching places. As they are used to carry out
actual database operations, multiple transactions are running at once. It’s possible that these transactions
are focusing on the same data set. Therefore, it is crucial that non-serial schedules can be serialized in order
for our database to be consistent both before and after the transactions are executed.
95
Example:
Transaction-1 Transaction-2
R(a)
W(a)
R(b)
W(b)
R(b)
R(a)
W(b)
W(a)
We can observe that Transaction-2 begins its execution before Transaction-1 is finished, and they are both
working on the same data, i.e., “a” and “b”, interchangeably. Where “R”-Read, “W”-Write
Serializability testing
We can utilize the Serialization Graph or Precedence Graph to examine a schedule’s serializability. A
schedule’s full transactions are organized into a Directed Graph, what a serialization graph is.
Precedence Graph
It can be described as a Graph G(V, E) with vertices V = “V1, V2, V3,…, Vn” and directed edges E =
“E1, E2, E3,…, En”. One of the two operations—READ or WRITE—performed by a certain transaction
is contained in the collection of edges. Where Ti -> Tj, means Transaction-Ti is either performing read or
write before the transaction-Tj.
Types of Serializability
There are two ways to check whether any non-serial schedule is serializable.
96
Types of Serializability – Conflict & View
1. Conflict serializability
Conflict serializability refers to a subset of serializability that focuses on maintaining the consistency of a
database while ensuring that identical data items are executed in an order. In a DBMS each transaction has
a value and all the transactions, in the database rely on this uniqueness. This uniqueness ensures that no two
operations with the conflict value can occur simultaneously.
For example lets consider an order table and a customer table as two instances. Each order is associated
with one customer even though a single client may place orders. However there are restrictions for
achieving conflict serializability in the database. Here are a few of them.
1. Different transactions should be used for the two procedures.
2. The identical data item should be present in both transactions.
3. Between the two operations, there should be at least one write operation.
Example
Three transactions—t1, t2, and t3—are active on a schedule “S” at once. Let’s create a graph of
precedence.
Transaction – 1 (t1) Transaction – 2 (t2) Transaction – 3 (t3)
R(a)
R(b)
R(b)
W(b)
W(a)
W(a)
97
Transaction – 1 (t1) Transaction – 2 (t2) Transaction – 3 (t3)
R(a)
W(a)
It is a conflict serializable schedule as well as a serial schedule because the graph (a DAG) has no loops.
We can also determine the order of transactions because it is a serial schedule.
DAG of transactions
As there is no incoming edge on Transaction 1, Transaction 1 will be executed first. T3 will run second
because it only depends on T1. Due to its dependence on both T1 and T3, t2 will finally be executed.
Therefore, the serial schedule’s equivalent order is: t1 –> t3 –> t2
Note: A schedule is unquestionably consistent if it is conflicting serializable. A non- conflicting
serializable schedule, on the other hand, might or might not be serial. We employ the idea of View
Serializability to further examine its serial behavior.
2. View Serializability
View serializability is a kind of operation in a serializable in which each transaction should provide some
results, and these outcomes are the output of properly sequentially executing the data item. The view
serializability, in contrast to conflict serialized, is concerned with avoiding database inconsistency. The
view serializability feature of DBMS enables users to see databases in contradictory ways.
To further understand view serializability in DBMS, we need to understand the schedules S1 and S2. The
two transactions T1 and T2 should be used to establish these two schedules. Each schedule must follow the
three transactions in order to retain the equivalent of the transaction. These three circumstances are listed
below.
1. The first prerequisite is that the same kind of transaction appears on every schedule. This
requirement means that the same kind of group of transactions cannot appear on both schedules S1
and S2. The schedules are not equal to one another if one schedule commits a transaction but it
does not match the transaction of the other schedule.
2. The second requirement is that different read or write operations should not be used in either
schedule. On the other hand, we say that two schedules are not similar if schedule S1 has two write
operations whereas schedule S2 only has one. The number of the write
98
operation must be the same in both schedules, however there is no issue if the number of the read
operation is different.
3. The second to last requirement is that there should not be a conflict between either timetable.
execution order for a single data item. Assume, for instance, that schedule S1’s transaction is T1,
and schedule S2’s transaction is T2. The data item A is written by both the transaction T1 and the
transaction T2. The schedules are not equal in this instance. However, we referred to the schedule
as equivalent to one another if it had the same number of all write operations in the data item.
What is view equivalency?
Schedules (S1 and S2) must satisfy these two requirements in order to be viewed as equivalent:
1. The same piece of data must be read for the first time. For instance, if transaction t1 is reading
“A” from the database in schedule S1, then t1 must also read A in schedule S2.
2. The same piece of data must be used for the final write. As an illustration, if transaction t1 updated
A last in S1, it should also conduct final write in S2.
3. The middle sequence need to follow suit. As an illustration, if in S1 t1 is reading A, and t2 updates
A, then in S2 t1 should read A, and t2 should update A.
View Serializability refers to the process of determining whether a schedule’s views are equivalent.
Example
We have a schedule “S” with two concurrently running transactions, “t1” and “t2.”
Schedule – S:
Transaction-1 (t1) Transaction-2 (t2)
R(a)
W(a)
R(a)
W(a)
R(b)
W(b)
R(b)
W(b)
By switching between both transactions’ mid-read-write operations, let’s create its view equivalent
schedule (S’).
Schedule – S’:
99
Transaction-1 (t1) Transaction-2 (t2)
R(a)
W(a)
R(b)
W(b)
R(a)
W(a)
R(b)
W(b)
Recoverability
Recoverability is a property of database systems that ensures that, in the event of a failure or
error, the system can recover the database to a consistent state. Recoverability guarantees that all
committed transactions are durable and that their effects are permanently stored in the database, while the
effects of uncommitted transactions are undone to maintain data consistency.
10
0
The recoverability property is enforced through the use of transaction logs, which record all
changes made to the database during transaction processing. When a failure occurs, the system uses the
log to recover the database to a consistent state, which involves either undoing the effects of
uncommitted transactions or redoing the effects of committed transactions.
There are several levels of recoverability that can be supported by a database system:
No-undo logging: This level of recoverability only guarantees that committed transactions are durable, but
does not provide the ability to undo the effects of uncommitted transactions.
Undo logging: This level of recoverability provides the ability to undo the effects of uncommitted
transactions but may result in the loss of updates made by committed transactions that occur after the
failed transaction.
Redo logging: This level of recoverability provides the ability to redo the effects of committed
transactions, ensuring that all committed updates are durable and can be recovered in the event of failure.
Undo-redo logging: This level of recoverability provides both undo and redo capabilities, ensuring that
the system can recover to a consistent state regardless of whether a transaction has been committed or
not.
In addition to these levels of recoverability, database systems may also use techniques such as
checkpointing and shadow paging to improve recovery performance and reduce the overhead associated
with logging.
Overall, recoverability is a crucial property of database systems, as it ensures that data is
consistent and durable even in the event of failures or errors. It is important for database administrators
to understand the level of recoverability provided by their system and to configure it appropriately to
meet their application’s requirements.
Recoverable Schedules:
Schedules in which transactions commit only after all transactions whose changes they read commit
are called recoverable schedules. In other words, if some transaction Tj is reading value updated or
written by some other transaction Ti, then the commit of Tj must occur after the commit of Ti.
Example 1:
S1: R1(x), W1(x), R2(x), R1(y), R2(y),
W2(x), W1(y), C1, C2;
Given schedule follows order of Ti->Tj => C1->C2. Transaction T1 is executed before T2
hence there is no chances of conflict occur. R1(x) appears before W1(x) and transaction T1 is committed
before T2 i.e. completion of first transaction performed first update on data item x, hence given schedule is
recoverable.
Example 2: Consider the following schedule involving two transactions T1 and T2.
T1 T2
R(A)
W(A)
10
1
T1 T2
W(A)
R(A)
commit
commit
This is a recoverable schedule since T1 commits before T2, that makes the value read by T2
correct.
Irrecoverable Schedule: The table below shows a schedule with two transactions, T1 reads and writes A
and that value is read and written by T2. T2 commits. But later on, T1 fails. So we have to rollback T1.
Since T2 has read the value written by T1, it should also be rollbacked. But we have already committed
that. So this schedule is irrecoverable schedule. When Tj is reading the value updated by Ti and Tj is
committed before committing of Ti, the schedule will
be irrecoverable.
Recoverable with Cascading Rollback: The table below shows a schedule with two transactions, T1
reads and writes A and that value is read and written by T2. But later on, T1 fails. So we have to rollback
T1. Since T2 has read the value written by T1, it should also be rollbacked. As it has not committed, we
can rollback T2 as well. So it is recoverable with cascading rollback. Therefore, if Tj is reading value
updated by Ti and commit of Tj is delayed till commit of Ti, the schedule is called recoverable with
cascading rollback.
10
2
Cascadeless Recoverable Rollback: The table below shows a schedule with two transactions, T1 reads
and writes A and commits and that value is read by T2. But if T1 fails before commit, no other transaction
has read its value, so there is no need to rollback other transaction. So this is a Cascadeless recoverable
schedule. So, if Tj reads value updated by Ti only after Ti is committed, the schedule will be cascadeless
recoverable.
Implementation of Isolation
The levels of transaction isolation in DBMS determine how the concurrently running transactions
behave and, therefore, ensure data consistency with performance being even. There are four basic levels-
Read Uncommitted, Read Committed, Repeatable Read, and Serializable that provide different degrees of
data protection from providing fast access with possible incoherence and strict accuracy at the cost of
performance. It depends upon choosing the right one based on whether the need is speed or data integrity.
What is the Transaction Isolation Level?
In a database management system, transaction isolation levels define the degree to which the
operations in one transaction are isolated from the operations of other concurrent transactions. In other
words, it defines how and when the changes made by one transaction are visible to others to assure data
consistency and integrity.
10
3
As we know, to maintain consistency in a database, it follows ACID properties. Among these four
properties (Atomicity, Consistency, Isolation, and Durability) Isolation determines how transaction
integrity is visible to other users and systems. It means that a transaction should take place in a system in
such a way that it is the only transaction that is accessing the resources in a database system.
Isolation levels define the degree to which a transaction must be isolated from the data modifications made
by any other transaction in the database system. A transaction isolation level is defined by the following
phenomena:
Dirty Read – A Dirty read is a situation when a transaction reads data that has not yet been
committed. For example, Let’s say transaction 1 updates a row and leaves it uncommitted,
meanwhile, Transaction 2 reads the updated row. If transaction 1 rolls back the change, transaction
2 will have read data that is considered never to have existed.
Non Repeatable read – Non-realatable read occurs when a transaction reads the same row twice
and gets a different value each time. For example, suppose transaction T1 reads data. Due to
concurrency, another transaction T2 updates the same data and commit, Now if transaction T1
rereads the same data, it will retrieve a different value.
Phantom Read – Phantom Read occurs when two same queries are executed, but the rows
retrieved by the two, are different. For example, suppose transaction T1 retrieves a set of rows that
satisfy some search criteria. Now, Transaction T2 generates some new rows that match the search
criteria for Transaction T1. If transaction T1 re-executes the statement that reads the rows, it gets a
different set of rows this time.
Based on these phenomena, The SQL standard defines four isolation levels:
1. Read Uncommitted – Read Uncommitted is the lowest isolation level. In this level, one
transaction may read not yet committed changes made by other transactions, thereby allowing
dirty reads. At this level, transactions are not isolated from each other.
2. Read Committed – This isolation level guarantees that any data read is committed at the
moment it is read. Thus it does not allow dirty read. The transaction holds a read or write lock on
the current row, and thus prevents other transactions from reading, updating, or deleting it.
3. Repeatable Read – This is the most restrictive isolation level. The transaction holds read locks
on all rows it references and writes locks on referenced rows for update and delete actions. Since
other transactions cannot read, update or delete these rows, consequently it avoids non-repeatable
read.
4. Serializable – This is the highest isolation level. A serializable execution is guaranteed to be
serializable. Serializable execution is defined to be an execution of operations in which concurrently
executing transactions appears to be serially executing.
The Table given below clearly depicts the relationship between isolation levels, read
phenomena, and locks:
10
4
Anomaly Serializable is not the same as Serializable. That is, it is necessary, but not sufficient that a
Serializable schedule should be free of all three phenomena types. Transaction isolation levels are used in
database management systems (DBMS) to control the level of interaction between concurrent transactions.
The four standard isolation levels are:
1. Read Uncommitted: This is the lowest level of isolation where a transaction can see
uncommitted changes made by other transactions. This can result in dirty reads, non-
repeatable reads, and phantom reads.
2. Read Committed: In this isolation level, a transaction can only see changes made by other
committed transactions. This eliminates dirty reads but can still result in non-repeatable reads and
phantom reads.
3. Repeatable Read: This isolation level guarantees that a transaction will see the same data
throughout its duration, even if other transactions commit changes to the data. However, phantom
reads are still possible.
4. Serializable: This is the highest isolation level where a transaction is executed as if it were the
only transaction in the system. All transactions must be executed sequentially, which ensures that
there are no dirty reads, non-repeatable reads, or phantom reads.
The choice of isolation level depends on the specific requirements of the application. Higher isolation levels
offer stronger data consistency but can also result in longer lock times and increased contention, leading to
decreased concurrency and performance. Lower isolation levels provide more concurrency but can result
in data inconsistencies.
In addition to the standard isolation levels, some DBMS may also support additional custom isolation levels
or features such as snapshot isolation and multi-version concurrency control (MVCC) that provide
alternative solutions to the problems addressed by the standard isolation levels.
Advantages of Transaction Isolation Levels
Improved concurrency: Transaction isolation levels can improve concurrency by allowing
multiple transactions to run concurrently without interfering with each other.
Control over data consistency: Isolation levels provide control over the level of data
consistency required by a particular application.
Reduced data anomalies: The use of isolation levels can reduce data anomalies such as dirty
reads, non-repeatable reads, and phantom reads.
Flexibility: The use of different isolation levels provides flexibility in designing
applications that require different levels of data consistency.
Disadvantages of Transaction Isolation Levels
Increased overhead: The use of isolation levels can increase overhead because the database
management system must perform additional checks and acquire more locks.
Decreased concurrency: Some isolation levels, such as Serializable, can decrease concurrency
by requiring transactions to acquire more locks, which can lead to blocking.
Limited support: Not all database management systems support all isolation levels, which can
limit the portability of applications across different systems.
Complexity: The use of different isolation levels can add complexity to the design of
database applications, making them more difficult to implement and maintain.
10
5
Lock Based Concurrency Control Protocol in DBMS
In a database management system (DBMS), lock-based concurrency control (BCC) is used to control the
access of multiple transactions to the same data item. This protocol helps to maintain data consistency
and integrity across multiple users.
In the protocol, transactions gain locks on data items to control their access and prevent conflicts
between concurrent transactions. This article will look deep into the Lock Based Protocol in detail.
Lock Based Protocols
A lock is a variable associated with a data item that describes the status of the data item to possible
operations that can be applied to it. They synchronize the access by concurrent transactions to the database
items. It is required in this protocol that all the data items must be accessed in a mutually exclusive manner.
Let me introduce you to two common locks that are used and some terminology followed in this protocol.
Types of Lock
1. Shared Lock (S): Shared Lock is also known as Read-only lock. As the name suggests it can be
shared between transactions because while holding this lock the transaction does not have the
permission to update data on the data item. S-lock is requested using lock-S instruction.
2. Exclusive Lock (X): Data item can be both read as well as written.This is Exclusive and cannot
be held simultaneously on the same data item. X-lock is requested using lock-X instruction.
Lock Compatibility Matrix
A transaction may be granted a lock on an item if the requested lock is compatible with locks already held
on the item by other transactions. Any number of transactions can hold shared locks on an item, but if any
transaction holds an exclusive(X) on the item no other transaction may hold any lock on the item. If a lock
cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other
transactions have been released. Then the lock is granted.
10
6
2. Pre-Claiming Lock Protocol
Pre-claiming Lock Protocols assess transactions to determine which data elements require locks. Before
executing the transaction, it asks the DBMS for a lock on all of the data elements. If all locks are given,
this protocol will allow the transaction to start. When the transaction is finished, it releases all locks. If all of
the locks are not provided, this protocol allows the transaction to be reversed and waits until all of the locks
are granted.
3. Two-phase locking (2PL)
A transaction is said to follow the Two-Phase Locking protocol if Locking and Unlocking can be done
in two phases
Growing Phase: New locks on data items may be acquired but none can be released.
Shrinking Phase: Existing locks may be released but no new locks can be acquired. For
more detail refer the published article Two-phase locking (2PL).
4. Strict Two-Phase Locking Protocol
Strict Two-Phase Locking requires that in addition to the 2-PL all Exclusive(X) locks held by the
transaction be released until after the Transaction Commits. For more details refer the published article
Strict Two-Phase Locking Protocol.
Upgrade / Downgrade locks
A transaction that holds a lock on an item Ais allowed under certain condition to change the lock state
from one state to another. Upgrade: A S(A) can be upgraded to X(A) if Ti is the only transaction holding the
S-lock on element A. Downgrade: We may downgrade X(A) to S(A) when we feel that we no longer want
to write on data-item A. As we were holding X-lock on A, we need not check any conditions.
So, by now we are introduced with the types of locks and how to apply them. But wait, just by
applying locks if our problems could’ve been avoided then life would’ve been so simple! If you have
done Process Synchronization under OS you must be familiar with one consistent problem, starvation and
Deadlock! We’ll be discussing them shortly, but just so you know we have to apply Locks but they must
follow a set of protocols to avoid such undesirable problems. Shortly we’ll use 2-Phase Locking (2-PL)
which will use the concept of Locks to avoid deadlock. So, applying simple locking, we may not always
produce Serializable results, it may lead to Deadlock Inconsistency.
Problem With Simple Locking
Consider the Partial Schedule:
S.No T1 T2
1 lock-X(B)
2 read(B)
3 B:=B-50
10
7
S.No T1 T2
4 write(B)
5 lock-S(A)
6 read(A)
7 lock-S(B)
8 lock-X(A)
9 …… ……
1. Deadlock
In deadlock consider the above execution phase. Now, T1 holds an Exclusive lock over B, and T2 holds
a Shared lock over A. Consider Statement 7, T2 requests for lock on B, while in Statement 8 T1 requests
lock on A. This as you may notice imposes a deadlock as none can proceed with their execution.
Deadlock
10
8
2. Starvation
Starvation is also possible if concurrency control manager is badly designed. For example: A transaction
may be waiting for an X-lock on an item, while a sequence of other transactions request and are granted an
S-lock on the same item. This may be avoided if the concurrency control manager is properly designed.
10
9
Cascading aborts still prevails. Let’s gist the Advantages and Disadvantages of Basic TO protocol:
Timestamp Ordering protocol ensures serializability since the precedence graph will be of the
form:
11
0
Timestamp Allocation: Allocating unique timestamps for each transaction can be
challenging, especially in distributed systems where transactions may be initiated at
different locations.
Deadlock
In database management systems (DBMS) a deadlock occurs when two or more transactions are
unable to the proceed because each transaction is waiting for the other to the release locks on resources.
This situation creates a cycle of the dependencies where no transaction can continue leading to the
standstill in the system. The Deadlocks can severely impact the performance and reliability of a DBMS
making it crucial to the understand and manage them effectively.
The Deadlock is a condition in a multi-user database environment where transactions are unable
to the complete because they are each waiting for the resources held by other transactions. This results in a
cycle of the dependencies where no transaction can proceed.
Basically, Deadlocks occur when two or more transactions wait indefinitely for resources held by
each other. Also, mastering how to detect and resolve deadlocks is vital for database efficiency
Characteristics of Deadlock
Mutual Exclusion: Only one transaction can hold a particular resource at a time.
Hold and Wait: The Transactions holding resources may request additional resources held by
others.
No Preemption: The Resources cannot be forcibly taken from the transaction holding them.
Circular Wait: A cycle of transactions exists where each transaction is waiting for the
resource held by the next transaction in the cycle.
In a database management system (DBMS), a deadlock occurs when two or more transactions are
waiting for each other to release resources, such as locks on database objects, that they need to complete
their operations. As a result, none of the transactions can proceed, leading to a situation where they are
stuck or “deadlocked.”
Deadlocks can happen in multi-user environments when two or more transactions are running
concurrently and try to access the same data in a different order. When this happens, one transaction may
hold a lock on a resource that another transaction needs, while the second transaction may hold a lock on a
resource that the first transaction needs. Both transactions are then blocked, waiting for the other to release
the resource they need.
DBMSs often use various techniques to detect and resolve deadlocks automatically. These techniques
include timeout mechanisms, where a transaction is forced to release its locks after a certain period of
time, and deadlock detection algorithms, which periodically scan the transaction log for deadlock cycles
and then choose a transaction to abort to resolve the deadlock.
It is also possible to prevent deadlocks by careful design of transactions, such as always acquiring locks
in the same order or releasing locks as soon as possible. Proper design of the database schema and
application can also help to minimize the likelihood of deadlocks.
In a database, a deadlock is an unwanted situation in which two or more transactions are waiting
indefinitely for one another to give up locks. Deadlock is said to be one of the most feared complications in
DBMS as it brings the whole system to a Halt.
Example – let us understand the concept of deadlock suppose, Transaction T1 holds a lock on some rows
in the Students table and needs to update some rows in the Grades table.
11
1
Simultaneously, Transaction T2 holds locks on those very rows (Which T1 needs to update) in the
Grades table but needs to update the rows in the Student table held by Transaction T1.
Now, the main problem arises. Transaction T1 will wait for transaction T2 to give up the lock, and
similarly, transaction T2 will wait for transaction T1 to give up the lock. As a consequence, All activity
comes to a halt and remains at a standstill forever unless the DBMS detects the deadlock and aborts one
of the transactions.
Deadlock in DBMS
Deadlock Avoidance
When a database is stuck in a deadlock, It is always better to avoid the deadlock rather than restarting or
aborting the database. The deadlock avoidance method is suitable for smaller databases whereas the deadlock
prevention method is suitable for larger databases.
One method of avoiding deadlock is using application-consistent logic. In the above-given example,
Transactions that access Students and Grades should always access the tables in the same order. In this
way, in the scenario described above, Transaction T1 simply waits for transaction T2 to release the lock
on Grades before it begins. When transaction T2 releases the lock, Transaction T1 can proceed freely.
Another method for avoiding deadlock is to apply both the row-level locking mechanism and the READ
COMMITTED isolation level. However, It does not guarantee to remove deadlocks completely.
Deadlock Detection
When a transaction waits indefinitely to obtain a lock, The database management system should
detect whether the transaction is involved in a deadlock or not.
Wait-for-graph is one of the methods for detecting the deadlock situation. This method is suitable for
smaller databases. In this method, a graph is drawn based on the transaction and its lock on the resource.
If the graph created has a closed loop or a cycle, then there is a
deadlock. For the above-mentioned scenario, the Wait-For graph is drawn below:
11
2
Deadlock Prevention
For a large database, the deadlock prevention method is suitable. A deadlock can be prevented if the
resources are allocated in such a way that a deadlock never occurs. The DBMS analyzes the operations
whether they can create a deadlock situation or not, If they do, that transaction is never allowed to be
executed.
Deadlock prevention mechanism proposes two schemes:
Wait-Die Scheme: In this scheme, If a transaction requests a resource that is locked by another
transaction, then the DBMS simply checks the timestamp of both transactions and allows the older
transaction to wait until the resource is available for execution.
Suppose, there are two transactions T1 and T2, and Let the timestamp of any transaction T be TS
(T). Now, If there is a lock on T2 by some other transaction and T1 is requesting resources held by
T2, then DBMS performs the following actions:
Checks if TS (T1) < TS (T2) – if T1 is the older transaction and T2 has held some resource, then it
allows T1 to wait until resource is available for execution. That means if a younger transaction has
locked some resource and an older transaction is waiting for it, then an older transaction is allowed
to wait for it till it is available. If T1 is an older transaction and has held some resource with it and
if T2 is waiting for it, then T2 is killed and restarted later with random delay but with the same
timestamp. i.e. if the older transaction has held some resource and the younger transaction waits for
the resource, then the younger transaction is killed and restarted with a very minute delay with the
same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound Wait Scheme: In this scheme, if an older transaction requests for a resource held by a
younger transaction, then an older transaction forces a younger transaction to kill the transaction
and release the resource. The younger transaction is restarted with a minute delay but with the
same timestamp. If the younger transaction is requesting a resource that is held by an older one,
then the younger transaction is asked to wait till the older one releases it.
The following table lists the differences between Wait – Die and Wound -Wait scheme prevention
schemes:
Wait – Die Wound -Wait
In this, older transactions must wait for the younger In this, older transactions never wait for
one to release its data items. younger transactions.
The number of aborts and rollbacks is higher in these In this, the number of aborts and
techniques. rollback is lesser.
11
3
Applications
1. Delayed Transactions: Deadlocks can cause transactions to be delayed, as the resources they
need are being held by other transactions. This can lead to slower response times and longer wait
times for users.
2. Lost Transactions: In some cases, deadlocks can cause transactions to be lost or aborted, which
can result in data inconsistencies or other issues.
3. Reduced Concurrency: Deadlocks can reduce the level of concurrency in the system, as
transactions are blocked waiting for resources to become available. This can lead to slower
transaction processing and reduced overall throughput.
4. Increased Resource Usage: Deadlocks can result in increased resource usage, as
transactions that are blocked waiting for resources to become available continue to
consume system resources. This can lead to performance degradation and increased
resource contention.
5. Reduced User Satisfaction: Deadlocks can lead to a perception of poor system performance and
can reduce user satisfaction with the application. This can have a negative impact on user adoption
and retention.
Features of Deadlock in a DBMS
1. Mutual Exclusion: Each resource can be held by only one transaction at a time, and other
transactions must wait for it to be released.
2. Hold and Wait: Transactions can request resources while holding on to resources already
allocated to them.
3. No Preemption: Resources cannot be taken away from a transaction forcibly, and the
transaction must release them voluntarily.
4. Circular Wait: Transactions are waiting for resources in a circular chain, where each
transaction is waiting for a resource held by the next transaction in the chain.
5. Indefinite Blocking: Transactions are blocked indefinitely, waiting for resources to
become available, and no transaction can proceed.
6. System Stagnation: Deadlock leads to system stagnation, where no transaction can
proceed, and the system is unable to make any progress.
7. Inconsistent Data: Deadlock can lead to inconsistent data if transactions are unable to
complete and leave the database in an intermediate state.
8. Difficult to Detect and Resolve: Deadlock can be difficult to detect and resolve, as it may involve
multiple transactions, resources, and dependencies.
Disadvantages
1. System downtime: Deadlock can cause system downtime, which can result in loss of
productivity and revenue for businesses that rely on the DBMS.
2. Resource waste: When transactions are waiting for resources, these resources are not being
used, leading to wasted resources and decreased system efficiency.
3. Reduced concurrency: Deadlock can lead to a decrease in system concurrency, which can result
in slower transaction processing and reduced throughput.
4. Complex resolution: Resolving deadlock can be a complex and time-consuming process,
requiring system administrators to intervene and manually resolve the deadlock.
5. Increased system overhead: The mechanisms used to detect and resolve deadlock, such as
timeouts and rollbacks, can increase system overhead, leading to decreased performance.
11
4
Failure Classification
Failure in terms of a database can be defined as its inability to execute the specified transaction or
loss of data from the database. A DBMS is vulnerable to several kinds of failures and each of these failures
needs to be managed differently. There are many reasons that can cause database failures such as network
failure, system crash, natural disasters, carelessness, sabotage(corrupting the data intentionally), software
errors, etc.
Transaction Failure:
If a transaction is not able to execute or it comes to a point from where the transaction becomes incapable of
executing further then it is termed as a failure in a transaction.
Reason for a transaction failure in DBMS:
1. Logical error: A logical error occurs if a transaction is unable to execute because of some mistakes
in the code or due to the presence of some internal faults.
2. System error: Where the termination of an active transaction is done by the database system itself
due to some system issue or because the database management system is unable to proceed with the
transaction. For example– The system ends an operating transaction if it reaches a deadlock
condition or if there is an unavailability of resources.
System Crash:
A system crash usually occurs when there is some sort of hardware or software breakdown. Some other
problems which are external to the system and cause the system to abruptly stop or eventually crash include
failure of the transaction, operating system errors, power cuts, main memory crash, etc.
These types of failures are often termed soft failures and are responsible for the data losses in the volatile
memory. It is assumed that a system crash does not have11 any effect on the data stored in the non-volatile
storage and this is known as the fail-stop-assumption.5
Data-transfer Failure:
When a disk failure occurs amid data-transfer operation resulting in loss of content from disk storage then
such failures are categorized as data-transfer failures. Some other reason for disk failures includes disk head
crash, disk unreachability, formation of bad sectors, read-write errors on the disk, etc.
In order to quickly recover from a disk failure caused amid a data-transfer operation, the backup copy of
the data stored on other tapes or disks can be used. Thus it’s a good practice to backup your data frequently.
Indexing
Indexing is a data structure technique to efficiently retrieve records from the database files based on
some attributes on which the indexing has been done. Indexing in database systems is similar to what we
see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following types −
Primary Index − Primary index is defined on an ordered data file. The data file is ordered
on a key field. The key field is generally the primary key of the relation.
Secondary Index − Secondary index may be generated from a field which is a candidate key
and has a unique value in every record, or a non-key with duplicate values.
Clustering Index − Clustering index is defined on an ordered data file. The data file is ordered on
a non-key field.
Ordered Indexing is of two types −
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This makes
searching faster but requires more space to store index records itself. Index records contain search
key value and a pointer to the actual record on the disk.
Sparse Index
11
In sparse index, index records are not created for every
6 search key. An index record here contains a search
key and an actual pointer to the data on the disk. To search a record, we first proceed by index record and
reach at the actual location of the data. If the data we are looking for is not where we directly reach by
following the index, then the system starts sequential search until the desired data is found.
11
7
Explore our latest online courses and learn new skills at your own pace. Enroll and become a
certified expert to boost your career.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored on the disk
along with the actual database files. As the size of the database grows, so does the size of the indices.
There is an immense need to keep the index records in the main memory so as to speed up the search
operations. If single-level index is used, then a large size index cannot be kept in memory which leads
to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order to make the
outermost level so small that it can be saved in a single disk block, which can easily be
accommodated anywhere in the main memory.
Introduction of B+ Tree
B + Tree is a variation of the B-tree data structure. In a B + tree, data pointers are stored only
at the leaf nodes of the tree. In a B+ tree structure of a leaf node differs from the structure of internal
nodes. The leaf nodes have an entry for every value of the search field, along with a data pointer to the
record (or to the block that contains this record). The leaf nodes of the B+ tree are linked together to
provide ordered access to the search field to the records.
11 Internal nodes of a B+ tree are used to guide
the search. Some search field values from the leaf nodes
8 are repeated in the internal nodes of the B+
tree.
Features of B+ Trees
Balanced: B+ Trees are self-balancing, which means that as data is added or removed from
the tree, it automatically adjusts itself to maintain a balanced structure. This ensures that the
search time remains relatively constant, regardless of the size of the tree.
Multi-level: B+ Trees are multi-level data structures, with a root node at the top and one or
more levels of internal nodes below it. The leaf nodes at the bottom level contain the actual
data.
Ordered: B+ Trees maintain the order of the keys in the tree, which makes it easy to
perform range queries and other operations that require sorted data.
Fan-out: B+ Trees have a high fan-out, which means that each node can have many child
nodes. This reduces the height of the tree and increases the efficiency of searching and
indexing operations.
Cache-friendly: B+ Trees are designed to be cache-friendly, which means that they can
take advantage of the caching mechanisms in modern computer architectures to improve
performance.
Disk-oriented: B+ Trees are often used for disk-based storage systems because they are
efficient at storing and retrieving data from disk.
Why Use B+ Tree?
B+ Trees are the best choice for storage systems with sluggish data access because they
minimize I/O operations while facilitating efficient disc access.
B+ Trees are a good choice for database systems and applications needing quick data
retrieval because of their balanced structure, which guarantees predictable performance for a
variety of activities and facilitates effective range-based queries.
Difference Between B+ Tree and B Tree
Some differences between B+ Tree and B Tree are stated below.
Separate leaf nodes for data Nodes store both keys and data
Structure storage and internal nodes for values
indexing
Leaf Nodes Leaf nodes form a linked list for Leaf nodes do not form a linked list
efficient range-based queries
11
9
Parameters B+ Tree B Tree
Key Typically allows key duplication in Usually does not allow key
Duplication leaf nodes duplication
Memory Requires more memory for Requires less memory as keys and
Usage internal nodes values are stored in the same node
Implementation of B+ Tree
In order, to implement dynamic multilevel indexing, B-tree and B+ tree are generally employed. The
drawback of the B-tree used for indexing, however, is that it stores the data pointer (a pointer to the
disk file block containing the key value), corresponding to a particular key value, along with that key
value in the node of a B-tree. This technique greatly reduces the number of entries that can be packed
into a node of a B-tree, thereby contributing to the increase in the number of levels in the B-tree,
hence increasing the search time of a record. B+ tree eliminates the above drawback by storing data
pointers only at the leaf nodes of the tree. Thus, the structure of the leaf nodes of a B+ tree is quite
different from the structure of the internal nodes of the B tree. It may be noted here that, since data
pointers are present only at the leaf nodes, the leaf nodes must necessarily store all the key values
along with their corresponding data pointers to the disk file block, in order to access them.
Moreover, the leaf nodes are linked to providing ordered access to the records. The leaf nodes,
therefore form the first level of the index, with the internal nodes forming the other levels of a
multilevel index. Some of the key values of the leaf nodes also appear in the internal nodes,
to simply act as a medium to control the searching of a record. From the above discussion, it is
apparent that a B+ tree, unlike a B-tree, has two orders, ‘a’ and ‘b’, one for the internal nodes and the
other for the external (or leaf) nodes.
12
0
Structure of B+ Trees
12
1
Structure of Internal Node 12
2
The Structure of the Leaf Nodes of a B+ Tree of Order ‘b’ is as Follows
Each leaf node is of the form: <<K1, D1>, <K2, D2>, ….., <Kc-1, Dc-1>, Pnext> where c
<= b and each Di is a data pointer (i.e points to actual record in the disk whose key
value is Ki or to a disk file block containing that record) and,
12
3
each Ki is a key value and, Pnext points to next leaf node in the B+ tree (see diagram II
for reference).
Every leaf node has : K1 < K2 < …. < Kc-1, c <= b
Each leaf node has at least \ceil(b/2) values.
All leaf nodes are at the same level.
12
4
Structure of Lead Node
Diagram-II Using the Pnext pointer it is viable to traverse all the leaf nodes, just like a linked
list, thereby achieving ordered access to the records stored in the disk.
Tree Pointer
Searching a Record in B+ Trees
12
5
Searching in B+ Tree
Let us suppose we have to find 58 in the B+ Tree. We will start by fetching from the root node then
we will move to the leaf node, which might contain a record of 58. In the image given above, we will
get 58 between 50 and 70. Therefore, we will we are getting a leaf node in the third leaf node and get
58 there. If we are unable to find that node, we will return that ‘record not founded’ message.
Insertion in B+ Trees
Insertion in B+ Trees is done via the following steps.
Every element in the tree has to be inserted into a leaf node. Therefore, it is necessary to go
to a proper leaf node.
Insert the key into the leaf node in increasing order if there is no overflow. For
more, refer to Insertion in a B+ Trees.
Deletion in B+Trees
Deletion in B+ Trees is just not deletion but it is a combined process of Searching, Deletion, and
Balancing. In the last step of the Deletion Process, it is mandatory to balance the B+ Trees,
otherwise, it fails in the property of B+ Trees.
For more, refer to Deletion in B+ Trees.
Advantages of B+Trees
A B+ tree with ‘l’ levels can store more entries in its internal nodes compared to a B- tree
having the same ‘l’ levels. This accentuates the significant improvement made to the search
time for any given key. Having lesser levels and the presence of
Pnext pointers imply that the B+ trees is very quick and efficient in accessing records from
disks.
Data stored in a B+ tree can be accessed both sequentially and directly.
It takes an equal number of disk accesses to fetch records.
B+trees have redundant search keys, and storing search keys repeatedly is not
possible.
Disadvantages of B+ Trees
The major drawback of B-tree is the difficulty of traversing the keys sequentially. The B+
tree retains the rapid random access property of the B-tree while also allowing rapid
sequential access.
12
6
Application of B+ Trees
Multilevel Indexing
Faster operations on the tree (insertion, deletion, search)
Database indexing
Hashing in DBMS
Hashing in DBMS is a technique to quickly locate a data record in a database irrespective of the size
of the database. For larger databases containing thousands and millions of records, the indexing data
structure technique becomes very inefficient because searching a specific record through indexing
will consume more time. This doesn’t align with the goals of DBMS, especially when performance
and data retrieval time are minimized. So, to counter this problem hashing technique is used. In this
article, we will learn about various hashing techniques.
What is Hashing?
The hashing technique utilizes an auxiliary hash table to store the data records using a hash
function. There are 2 key components in hashing:
Hash Table: A hash table is an array or data structure and its size is determined by the total
volume of data records present in the database. Each memory location in a hash table is called
a ‘bucket‘ or hash indice and stores a data record’s exact location and can be accessed
through a hash function.
Bucket: A bucket is a memory location (index) in the hash table that stores the data
record. These buckets generally store a disk block which further stores multiple records. It
is also known as the hash index.
Hash Function: A hash function is a mathematical equation or algorithm that takes one
data record’s primary key as input and computes the hash index as output.
Hash Function
A hash function is a mathematical algorithm that computes the index or the location where the
current data record is to be stored in the hash table so that it can be accessed efficiently later. This
hash function is the most crucial component that determines the speed of fetching data.
Working of Hash Function
The hash function generates a hash index through the primary key of the data record.
Now, there are 2 possibilities:
1. The hash index generated isn’t already occupied by any other value. So, the address of the data
record will be stored here.
2. The hash index generated is already occupied by some other value. This is called collision so to
counter this, a collision resolution technique will be applied.
12
7
3. Now whenever we query a specific record, the hash function will be applied and returns the
data record comparatively faster than indexing because we can directly reach the exact location of
the data record through the hash function rather than searching through indices one by one.
Example:
Hashing
Types of Hashing in DBMS
There are two primary hashing techniques in DBMS.
1. Static Hashing
In static hashing, the hash function always generates the same bucket’s address. For example, if we
have a data record for employee_id = 107, the hash function is mod-5 which is –
H(x) % 5, where x = id. Then the operation will take place like this:
H(106) % 5 = 1.
This indicates that the data record should be placed or searched in the 1st bucket (or 1st hash index)
in the hash table.
Example:
12
Static Hashing Technique 8
The primary key is used as the input to the hash function and the hash function generates the output
as the hash index (bucket’s address) which contains the address of the actual data record on the disk
block.
Static Hashing has the following Properties
Data Buckets: The number of buckets in memory remains constant. The size of the hash
table is decided initially and it may also implement chaining that will allow handling some
collision issues though, it’s only a slight optimization and may not prove worthy if the
database size keeps fluctuating.
Hash function: It uses the simplest hash function to map the data records to its
appropriate bucket. It is generally modulo-hash function
Efficient for known data size: It’s very efficient in terms when we know the data size
and its distribution in the database.
It is inefficient and inaccurate when the data size dynamically varies because we have limited
space and the hash function always generates the same value for every specific input. When
the data size fluctuates very often it’s not at all useful because collision will keep happening
and it will result in problems like – bucket skew, insufficient buckets etc.
To resolve this problem of bucket overflow, techniques such as – chaining and open
addressing are used. Here’s a brief info on both:
1. Chaining
Chaining is a mechanism in which the hash table is implemented using an array of type nodes, where
each bucket is of node type and can contain a long chain of linked lists to store the data records. So, even
if a hash function generates the same value for any data record it can still be stored in a bucket by adding
a new node.
However, this will give rise to the problem bucket skew that is, if the hash function keeps
generating the same value again and again then the hashing will become inefficient as the
remaining data buckets will stay unoccupied or store minimal data.
2. Dynamic Hashing
Dynamic hashing is also known as extendible hashing, used to handle database that frequently
changes data sets. This method offers us a way to add and remove data buckets on demand
dynamically. This way as the number of data records varies, the buckets will also grow and shrink in
size periodically whenever a change is made.
Properties of Dynamic Hashing
The buckets will vary in size dynamically periodically as changes are made offering more
flexibility in making any change.
Dynamic Hashing aids in improving overall performance by minimizing or
completely preventing collisions. 12
9
It has the following major components: Data bucket, Flexible hash function, and
directories
A flexible hash function means that it will generate more dynamic values and will keep
changing periodically asserting to the requirements of the database.
Directories are containers that store the pointer to buckets. If bucket overflow or bucket
skew-like problems happen to occur, then bucket splitting is done to maintain efficient
retrieval time of data records. Each directory will have a directory id.
Global Depth: It is defined as the number of bits in each directory id. The more the
number of records, the more bits are there.
Working of Dynamic Hashing
Example: If global depth: k = 2, the keys will be mapped accordingly to the hash index. K bits
starting from LSB will be taken to map a key to the buckets. That leaves us with the following 4
possibilities: 00, 11, 10, 01.
13
0