DBMS CSE Digital Notes 2020-2021 - March 9th 2021
DBMS CSE Digital Notes 2020-2021 - March 9th 2021
DBMS CSE Digital Notes 2020-2021 - March 9th 2021
ON
Prepared by
Mrs. Ch.Anitha
Asst. Professor
Course Outcomes:
Demonstrate the basic elements of a relational database management system.
Ability to identify the data models for relevant problems.
Ability to design entity relationship model and convert entity relationship diagrams into
RDBMS and formulate SQL queries on the data.
Apply normalization for the development of application software.
Course PO PO PO PO PO PO PO PO PO PO PO PO
Outcome 1 2 3 4 5 6 7 8 9 10 11 12
CO1 H H H H H H M
CO2 H H H H H M
CO3 H H H H H H M
CO4 H H H H H M
H – High
M – Medium
L - Low
MALLA REDDY ENGINEERING COLLEGE FOR WOMEN
(1805PC06)DATABASE MANAGEMENT SYSTEMS
B.Tech. II Year II Sem LTPC
3 0 03
Course Objectives:
To understand the basic concepts and the applications of database systems.
To master the basics of SQL and construct queries using SQL.
To understand the relational database design principles.
To become familiar with the basic issues of transaction processing and concurrency control.
To become familiar with database storage structures and access techniques.
Course Outcomes:
Demonstrate the basic elements of a relational database management system and Ability to
identify the data models for relevant problems.
Ability to design entity relationship model and convert entity relationship diagrams into
RDBMS and formulate SQL queries on the data.
Apply normalization for the development of application software.
UNIT – II: Relational Model: Introduction to the Relational Model, Integrity Constraints over
Relations, Enforcing Integrity constraints, Querying relational data, Logical data base Design: ER
to Relational, Introduction to Views, Destroying /Altering Tables and Views.
Relational Algebra and Calculus: Preliminaries, Relational Algebra, Relational calculus – Tuple
relational Calculus, Domain relational calculus.
UNIT – III: SQL: Queries, Constraints, Triggers: Form of Basic SQL Query, UNION,
INTERSECT, and EXCEPT, Nested Queries, Aggregate Operators, NULL values, Natural
JOINS, Complex Integrity Constraints in SQL, Triggers and Active Data bases..
UNIT – V: Storage and Indexing: Overview of Storage and Indexing: Data on External Storage,
File Organization and Indexing, Index Data Structures, Comparison of File Organizations. Tree-
Structured Indexing: Intuition for tree Indexes, Indexed Sequential Access Method (ISAM), B+
Trees: A Dynamic Index Structure, Search, Insert, Delete.
TEXT BOOKS:
1. Data base Management Systems, Raghu Ramakrishnan, Johannes Gehrke, McGraw Hill
Education (India) Private Limited, 3rd Edition. (Part of UNIT-I, UNIT-II, UNIT-III, UNIT-
V)
2. Data base System Concepts, A. Silberschatz, Henry. F. Korth, S. Sudarshan, McGraw Hill
Education(India) Private Limited l, 6th edition.( Part of UNIT-I,UNIT-IV)
REFERENCE BOOKS:
1. Database Systems, 6th edition, R Elmasri, Shamkant B.Navathe, Pearson Education.
2. Database System Concepts, Peter Rob & Carlos Coronel, Cengage Learning.
3. Introduction to Database Management, M. L. Gillenson and others, Wiley Student Edition.
4. 4.Database Development and Management, Lee Chao, Auerbach publications, Taylor&
Francis Group. Introduction to Database Systems, C. J. Date, Pearson Education.
INDEX
Details about the entity relationship model. This model provides a high level view of the issues.
To design the data base we need to follow proper way that way is called data model. So we see
how to use the E-R model to design the data base.
What is a Database?
To find out what database is, we have to start from data, which is the basic building
block of any DBMS.
Data: Facts, figures, statistics etc. having no particular meaning (e.g. 1, ABC, 19 etc).
Record: Collection of related data items, e.g. in the above example the three data items had no
meaning. But if we organize them in the following way, then they collectively represent
meaningful information.
Roll Name Age
1 ABC 19
The columns of this relation are called Fields, Attributes or Domains. The rows are called
Tuples or Records.
Database: Collection of related relations.
Database-management system (DBMS) is a collection of interrelated data and a set of
programs to access those data. This is a collection of related data with an implicit meaning and
hence is a database.
DBMS is software which is used to manage the collection of interrelated data.
1.1 Database Management System (DBMS) and Its Applications:
A Database management system is a computerized record-keeping system. It is a repository or a
container for collection of computerized data files. The overall purpose of DBMS is to allow the
users to define, store, retrieve and update the information contained in the database on demand.
Information can be anything that is of significance to an individual or organization.
Database systems are designed to manage large bodies of information. Management of data
involves both defining structures for storage of information and providing mechanisms for the
manipulation of information. In addition, the database system must ensure the safety of the
information stored, despite system crashes or attempts at unauthorized access. If data are to be
shared among several users, the system must avoid possible anomalous results.
Advantages of DBMS:
Because information is so important in most organizations, computer scientists have developed a
large body of concepts and techniques for managing data. These concepts and technique form the
focus of this book.
Instance: The collection of information stored in the database at a particular moment is called an
instance of the data base.
Schema:Database schema skeleton structure of and it represents the logical view of entire
database. It tells about how the data is organized and how relation among them is associated
Data independence:
The ability to modify schema definition in one level with affecting a schema definition in the
next higher level is called data independence.
Physical independence,Logical independence
Data models:
Underlying the structure of a data base is the data model.The collection of conceptual tools for
describing data, data relationships, data semantics.
Three types of data models are there:
Data Definition Language (DDL) statements are used to classify the database structure or
schema. It is a type of language that allows the DBA or user to depict and name those entities,
attributes, and relationships that are required for the application along with any associated
integrity and security constraints. Here are the lists of tasks that come under DDL:
A language that offers a set of operations to support the fundamental data manipulation
operations on the data held in the database. Data Manipulation Language (DML) statements are
used to manage data within schema objects. Here are the lists of tasks that come under DML:
System – creating session, table etc are all types of system privilege.
Object – any command or query to work on tables comes under object privilege. DCL is
used to define two commands. These are:
Grant – It gives user access privileges to database.
Revoke – It takes back permissions from user.
A relational database is a set of formally described tables from which data can be accessed or
reassembled in many different ways without having to reorganize the database tables. The
standard user and application programming interface (API) of a relational database is
the Structured Query Language (SQL). SQL statements are used both for interactive queries for
information from a relational database and for gathering data for reports.
When creating a relational database, you can define the domain of possible values in a data
column and further constraints that may apply to that data value. For example, a domain of
possible customers could allow up to 10 possible customer names but be constrained in one table
to allowing only three of these customer names to be specifiable.
Database design is the process of producing a detailed data model of a database. This data
model contains all the needed logical and physical design choices and physical storage
parameters needed to generate a design in a data definition language, which can then be used to
create a database.
The architecture of a database system is greatly influenced by the underlying computer system
on which the database system runs. Database systems can be centralized, or client-server, where
one server machine executes work on behalf of multiple client machines. Database systems can
also be designed to exploit parallel computer architectures. Distributed databases span multiple
geographically separated machines.
At very high level, a database is considered as shown in below diagram. Let us see them in
detail below.
Applications: - It can be considered as a user friendly web page where the user enters the
requests. Here he simply enters the details that he needs and presses buttons to get the
data.
Information Retrieval - the ability to query a computer system to return relevant results.
The most widely used example is the Google web search engine.
Data Mining - the ability to retrieve information from one or more data sources in order to
combine it, cluster it, visualize it and discover patterns in the data.
Big Data - the ability to manipulate huge volumes of data (that far exceed the capacity of a
single machine) in order to perform data mining techniques on that data.
Text/data mining currently involves analyzing a large collection of often unrelated digital items
in a systematic way and to discover previously unknown facts, which might take the form of
relationships or patterns that are buried deep in an extensive collection. These relationships
would be extremely difficult, if not impossible, to discover using traditional manual-based
search and browse techniques. Both text and data mining build on the corpus of past
publications and build not so much on the shoulders of giants as on the breadth of past
published knowledge and accumulated mass wisdom.
Database users are the one who really use and take the benefits of database. There will be
different types of users depending on their need and way of accessing the database.
Database Users:
Application Programmers - They are the developers who interact with the database by means
of DML queries. These DML queries are written in the application programs like C, C++,
JAVA, Pascal etc. These queries are converted into object code to communicate with the
database. For example, writing a C program to generate the report of employees who are
working in particular department will involve a query to fetch the data from database. It will
include a embedded SQL query in the C Program.
Sophisticated Users - They are database developers, who write SQL queries to
select/insert/delete/update data. They do not use any application or programs to request the
database. They directly interact with the database by means of query language like SQL. These
users will be scientists, engineers, analysts who thoroughly study SQL and DBMS to apply the
concepts in their requirement. In short, we can say this category includes designers and
developers of DBMS and SQL.
Specialized Users - These are also sophisticated users, but they write special database
application programs. They are the developers who develop the complex programs to the
requirement.
Stand-alone Users - These users will have stand –alone database for their personal use. These
kinds of database will have readymade database packages which will have menus and graphical
interfaces.
Native Users - these are the users who use the existing application to interact with the database.
For example, online library system, ticket booking systems, ATMs etc which has existing
application and users use them to interact with the database to fulfill their requests.
Database Administrators: The life cycle of database starts from designing, implementing to
administration of it. A database for any kind of requirement needs to be designed perfectly so
DBA has many responsibilities. A good performing database is in the hands of DBA.
Installing and upgrading the DBMS Servers: - DBA is responsible for installing a new
DBMS server for the new projects. He is also responsible for upgrading these servers as there
are new versions comes in the market or requirement. If there is any failure in upgradation of
the existing servers, he should be able revert the new changes back to the older version, thus
maintaining the DBMS working. He is also responsible for updating the service packs/ hot
fixes/ patches to the DBMS servers.
Design and implementation: - Designing the database and implementing is also DBA’s
responsibility. He should be able to decide proper memory management, file organizations,
error handling, log maintenance etc for the database.
Performance tuning: - Since database is huge and it will have lots of tables, data, constraints
and indices, there will be variations in the performance from time to time. Also, because of
some designing issues or data growth, the database will not work as expected. It is
responsibility of the DBA to tune the database performance. He is responsible to make sure all
the queries and programs works in fraction of seconds.
Migrate database servers: - Sometimes, users using oracle would like to shift to SQL server
or Netezza. It is the responsibility of DBA to make sure that migration happens without any
failure, and there is no data loss.
Backup and Recovery: - Proper backup and recovery programs needs to be developed by
DBA and has to be maintained him. This is one of the main responsibilities of DBA.
Data/objects should be backed up regularly so that if there is any crash, it should be recovered
without much effort and data loss.
Documentation: - DBA should be properly documenting all his activities so that if he quits or
any new DBA comes in, he should be able to understand the database without any effort. He
should basically maintain all his installation, backup, recovery, security methods. He should keep
various reports about database performance.
A Database Management System allows a person to organize, store, and retrieve data from a
computer. It is a way of communicating with a computer’s “stored memory.” In the very early
years of computers, “punch cards” were used for input, output, and data storage. Punch cards
offered a fast way to enter data, and to retrieve it. Herman Hollerith is given credit for adapting
the punch cards used for weaving looms to act as the memory for a mechanical tabulating
machine, in 1890. Much later, databases came along.
Databases (or DBs) have played a very important part in the recent evolution of computers. The
first computer programs were developed in the early 1950s, and focused almost completely on
coding languages and algorithms. At the time, computers were basically giant calculators and
data (names, phone numbers) was considered the leftovers of processing information.
Computers were just starting to become commercially available, and when business people
started using them for real-world purposes, this leftover data suddenly became important.
The CODASYL approach was a very complicated system and required substantial training. It
depended on a “manual” navigation technique using a linked data set, which formed a large
network. Searching for records could be accomplished by one of three techniques:
Higher scalability
A distributed computing system
Lower costs
A flexible schema
Can process unstructured and semi-structured data
Has no complex relationship
Unfortunately, NoSQL does come with some problems. Some NoSQL databases can be quite
resource intensive, demanding high RAM and CPU allocations. It can also be difficult to find
tech support if your open source NoSQL system goes down.
The database design process can be divided into six steps.The ER model is most relevant to the
first three steps.
Requirements Analysis:
The very first step in designing a database application is to understand what data is to be stored
in the database, what applications must be built on top of it, and what operations are most
frequent and subject to performance requirements. In other words, we must find out what the
users want from the database
We must choose a DBMS to implement our database design, and convert the conceptual
database design into a database schema in the data model of the chosen DBMS.
In this step we must consider typical expected workloads that our database must support and
further refine the database design to ensure that it meets desired performance criteria. This tep
may simply involve building indexes on some tables and clustering some tables, or it may
involve a substantial redesign of parts of the database schema obtained from the earlier design
steps.
Security Design:
In this step, we identify different user groups and different roles played by various users (Eg :
the development team for a product, the customer support representatives, the product manager
).
For each role and user group, we must identify the parts of the database that they must be able
to access and the parts of the database that they should not be allowed to access, and take steps
to ensure that they can access.
1.12 Entities:
The entity relationship (E-R) data model is based on a perception of a real world that consists of
a set of basic objects called entities, and of relationships among these objects.
Rectangles- which represent entity sets
Ellipse-which represent attributes
Diamonds-which represent relationship sets
Lines-which link attributes to entity sets and entity sets to relationship sets
Double ellipses-which represent multivalued attributes
Double lines- which indicate total participation of an entity in a relationship set
The appropriate mapping cardinality for a particular relationship set is obviously dependent on
the real world situation that is being modeled by the relationship set. The overall logical
Entity: An entity is a real-world object or concept which is distinguishable from other objects.
It may be something tangible, such as a particular student or building. It may also be somewhat
more conceptual, such as CS A-341, or an email address.
Attributes: These are used to describe a particular entity (e.g. name, SS#, height).
Entity set: a collection of similar entities (i.e., those which are distinguished using the same set
of attributes. As an example, I may be an entity, whereas Faculty might be an entity set to
which I belong. Note that entity sets need not be disjoint. I may also be a member of Staff or
of Softball Players.
Key: a minimal set of attributes for an entity set, such that each entity in the set can be uniquely
identified. In some cases, there may be a single attribute (such as SS#) which serves as a key,
but in some models you might need multiple attributes as a key ("Bob from Accounting").
There may be several possible candidate keys. We will generally designate one such key as
the primary key.
ER diagrams:
It is often helpful to visualize an ER model via a diagram. There are many variant conventions
for such diagrams; we will adapt the one used in the text.
Diagram conventions
ER Model
Entity relationship model defines the conceptual view of database. It works around real world
entity and association among them. At view level, ER model is considered well for designing
databases.
Entities are represented by means of their properties, called attributes. All attributes have
values. For example, a student entity may have name, class, age as attributes.
There exists a domain or range of values that can be assigned to attributes. For example, a
student's name cannot be a numeric value. It has to be alphabetic. A student's age cannot be
negative, etc.
Types of Attributes
Simple attribute
Simple attributes are atomic values, which cannot be divided further. For example, student's
phone-number is an atomic value of 10 digits.
Composite attribute
Composite attributes are made of more than one simple attribute. For example, a student's
complete name may have first_name and last_name.
Derived attribute
Derived attributes are attributes, which do not exist physical in the database, but there values
are derived from other attributes presented in the database. For example, average_salary in a
department should be saved in database instead it can be derived. For another example, age can
be derived from data_of_birth.
Single-valued attribute
Single valued attributes contain on single value. For example: Social_Security_Number.
Multi-value attribute
Multi-value attribute may contain more than one values. For example, a person can have more
than one phone numbers, email_addresses etc.
o Super Key: Set of attributes (one or more) that collectively identifies an entity in an
entity set.
o Candidate Key: Minimal super key is called candidate key that is, supers keys for
which no proper subset are a superkey. An entity set may have more than one candidate
key.
o Primary Key: This is one of the candidate key chosen by the database designer to
uniquely identify the entity set.
The association among entities is called relationship. For example, employee entity has relation
works_at with department. Another example is for student who enrolls in some course. Here,
Works_at and Enrolls are called relationship.
Relationship Set
Relationship of similar type is called relationship set. Like entities, a relationship too can have
attributes. These attributes are called descriptive attributes.
Degree of Relationship
The number of participating entities in an relationship defines the degree of the relationship.
o Binary = degree 2
o Ternary = degree 3
o n-ary = degree
Mapping Cardinalities
Cardinality defines the number of entities in one entity set which can be associated to the
number of entities of other set via relationship set.
o One-to-one: one entity from entity set A can be associated with at most one entity of
entity set B and vice versa.
o One-to-many: One entity from entity set A can be associated with more than one
entities of entity set B but from entity set B one entity can be associated with at most
one entity.
o Many-to-one: More than one entities from entity set A can be associated with at most
one entity of entity set B but one entity from entity set B can be associated with more
than one entity from entity set A.
o Many-to-many: one entity from A can be associated with more than one entity from B
and vice versa.
Ternary Relationship SetA relationship set need not be an association of precisely two entities;
it can involve three or more when applicable. Here is another example from the text, in which a
store has multiple locations.
If you took a 'snapshot' of the relationship set at some instant in time, we will call this
an instance..
If both entity sets of a relationship set have key constraints, we would call this a "one-to-one"
relationship set. In general, note that key constraints can apply to relationships between more
than two entities, as in the following example.
Participation Constraints
Recall that a key constraint requires that each entity of a set be required to participate in at most
one relationship. Dual to this, we may ask whether each entity of a set be required to participate
in at least one relationship.
If this is required, we call this a total participation constraint; otherwise the participation
is partial. In our ER diagrams, we will represent a total participation constraint by using
a thick line.
Weak Entities
There are times you might wish to define an entity set even though its attributes do not formally
contain a key (recall the definition for a key).
Together, this assures us that we can uniquely identify each entity from the weak set by
considering the primary key of its identifying owner together with a partial key from the weak
entity.
In our ER diagrams, we will represent a weak entity set by outlining the entity and the
identifying relationship set with dark lines. The required key constraint and total participation
are diagrammed with our existing conventions. We underline the partial key with a dotted line.
Class Hierarchies
Dually, we can ask whether every entity in a superclass be required to lie in (at least) one
subclass. By default we will not assume not, but we can specify a covering constraint if
desired. (e.g. "Motorboats AND Cards COVER Motor_Vehicles")
Aggregation
Thus far, we have defined relationships to be associations between two or more entities.
However, it sometimes seems desirable to define a new relationship which associates some
entity with some other existing relationship. To do this, we will introduce a new feature to our
model called aggregation. We identifying an existing relationship set by enclosing it in a larger
dashed box, and then we will allow it to participate in another relationship set.
A motivating example follows:
It is most important to recognize that there is more than one way to model a given situation.
Our next goal is to start to compare the pros and cons of common choices.
Consider the scenario, if we want to add address information to the Employees entity set? We
might choose to add a single attribute address to the entity set. Alternatively, we could
introduce a new entity set, Addresses and then a relationship associating employees with
addresses. What are the pros and cons?
Adding a new entity set is more complex model. It should only be done when there is need for
the complexity. For example, if some employees have multiple address to be associated, then
the more complex model is needed. Also, representing addresses as a separate entity would
allow a further breakdown, for example by zip code or city.
What if we wanted to modify the Works_In relationship to have both a start and end date, rather
than just a start date. We could add one new attribute for the end date; alternatively, we could
create a new entity set Duration which represents intervals, and then the Works_In relationship
can be made ternary (associating an employee, a department and an interval). What are the pros
and cons?
Consider a situation in which a manager controls several departments. Let's presume that a
company budgets a certain amount (budget) for each department. Yet it also wants managers to
have access to some discretionary budget (dbudget). There are two corporate models. A
discretionary budget may be created for each individual department; alternatively, there may be
a discretionary budget for each manager, to be used as she desires.
Which scenario is represented by the following ER diagram? If you want the alternate
interpretation, how would you adjust the model?
If we did not need the until or since attributes. In this case, we could model the identical setting
using the following ternary relationship:
Let's compare these two models. What if we wanted to add an additional constraint to
each, that each sponsorship (of a project by a department) be monitored by at most one
employee. Can you add this constraint to either of the above models.
A relation schema specifies the domain of each field or column in the relation instance.
These domain constraints in the schema specify an important condition that we want each
instance of the relation to satisfy: The values that appear in a column must be drawn from the
domain associated with that column. Thus, the domain of a field is essentially the type of that
field, in programming language terms, and restricts the values that can appear in the field.
Domain Constraints:A relation schema specifies the domain of each field in the
relation instance. These domain constraints in the schema specify the condition that
each instance of the relation has to satisfy: The values that appear in a column must be
drawn from the domain associated with that column. Thus, the domain of a field is
essentially the type of that field.
Key Constraints
A Key Constraint is a statement that a certain minimal subset of the fields of a relation is a
unique identifier for a tuple.
Super Key:An attribute, or set of attributes, that uniquely identifies a tuple within a
relation.However, a super key may contain additional attributes that are not necessary
for a unique identification.
Example: The customer_id of the relation customer is sufficient to distinguish one tuple
from other. Thus,customer_id is a super key. Similarly, the combination
of customer_id and customer_name is a super key for the relation customer. Here
the customer_name is not a super key, because several people may have the same
name. We are often interested in super keys for which no proper subset is a super key.
Such minimal super keys are called candidate keys.
Candidate Key:A super key such that no proper subset is a super key within the
relation.There are two parts of the candidate key definition:
o Two distinct tuples in a legal instance cannot have identical values in all the
fields of a key
o No subset of the set of fields in a candidate key is a unique identifier for a
tuple.A relation may have several candidate keys.
Primary Key:The candidate key that is selected to identify tuples uniquely within the
relation. Out of all the available candidate keys, a database designer can identify
a primary key. The candidate keys that are not selected as the primary key are called
as alternate keys.
Example: For the student relation, we can choose student_id as the primary key.
Foreign Key:Foreign keys represent the relationships between tables. A foreign key is a
column (or a group of columns) whose values are derived from the primary key of some
other table.The table in which foreign key is defined is called a Foreign table or Details
table. The table that defines the primary key and is referenced by the foreign key is
called the Primary table or Master table.
General Constraints
Domain, primary key, and foreign key constraints are considered to be a fundamental part of
the relational data model. Sometimes, however, it is necessary to specify more general
constraints.
Example: we may require that student ages be within a certain range of values. Giving such an
IC, the DBMS rejects inserts and updates that violate the constraint.
Current database systems support such general constraints in the form of table
constraints andassertions. Table constraints are associated with a single table and checked
whenever that table is modified. In contrast, assertions involve several tables and are checked
whenever any of these tables is modified.
Example: for table constraint, which ensures always the salary of an employee, is above 1000:
CREATE TABLE employee (eid integer, ename varchar2(20), salary real,
CHECK(salary>1000));
Referential integrity constraint states that if a relation refers to an key attribute of a different or
same relation, that key element must exists.
To ensure that only bonafide students can enroll in courses, any value that appears in the sid
field of an instance of the Enrolled relation should also appear in the sid field of some tuple in
the Students relation
The sid field of Enrolled is called a foreign key and refers to Students. The foreign key in the
referencing relation (Enrolled, in our example) must match the primary key of the referenced
relation (Students), i.e., it must have the same number of columns and compatible data types,
although the column names can be different.
Specifying Foreign Key Constraints in SQL
CREATE TABLE Enrolled ( sid CHAR(20), cid CHAR(20), grade CHAR(10), PRIMARY
KEY (sid, cid), FOREIGN KEY (sid) REFERENCES Students )
Data integrity refers to the correctness and completeness of data within a database. To enforce
data integrity, you can constrain or restrict the data values that users can insert, delete, or update
in the database. For example, the integrity of data in the pubs2 and pubs3 databases requires
that a book title in the titles table must have a publisher in the publishers table. You cannot
insert books that do not have a valid publisher into titles, because it violates the data integrity
of pubs2 or pubs3.
Transact-SQL provides several mechanisms for integrity enforcement in a database such as
rules, defaults, indexes, and triggers. These mechanisms allow you to maintain these types of
data integrity:
Requirement – requires that a table column must contain a valid value in every row; it
cannot allow null values. The create table statement allows you to restrict null values for a
column.
Check or validity – limits or restricts the data values inserted into a table column. You
can use triggers or rules to enforce this type of integrity.
Consider the instance S1 of Students shown in Figure 3.1. The following insertion violates the
primary key constraint because there is already a tuple with the sid 53688, and it will be
rejected by the DBMS:
INSERT INTO Students (sid,name,login,age,gpa)VALUES (53688,‘Mike’,‘mike@ee’, 17, 3.4)
The following insertion violates the constraint that the primary key cannot contain null:
INSERT INTO Students (sid,name, login,age, gpa)VALUES (null, ‘Mike’, ‘mike@ee’, 17, 3.4)
The symbol * means that we retain all fields of selected tuples in the result. The condition S.age
18 in the WHERE clause specifies that we want to select only tuples in which the age field has
a value less than 18.
It is highly recommended that every table should start with its primary key attribute
conventionally named as TablenameID.
The initial relational schema is expressed in the following format writing the table names with
the attributes list inside a parentheses as shown below for
Persons( personid , name, lastname, email )
Persons and Phones are Tables. name, lastname, are Table Columns (Attributes).
If you have a multi-valued attribute, take the attribute and turn it into a new entity or table of its
own. Then make a 1:N relationship between the new entity and the existing one. In
simplewords. 1. Create a table for the attribute. 2. Add the primary (id) column of the parent
entity as a foreign key within the new table as shown below:
To keep it simple and even for better performances at data retrieval, I would personally
recommend using attributes to represent such relationship. For instance, let us consider the case
where the Person has or optionally has one wife. You can place the primary key of the wife
within the table of the Persons which we call in this case Foreign key as shown below.
It should convert to :
Persons( personid , name, lastname, email )
House ( houseid , num , address, personid)
5. N:N Relationships
We normally use tables to express such type of relationship. This is the same for N − ary
relationship of ER diagrams. For instance, The Person can live or work in many countries.
Also, a country can have many people. To express this relationship within a relational schema
we use a separate table as shown below:
To destroy views, use the DROP TABLE command. For example, DROP TABLE Students
RESTRICT destroys the Students table unless some view or integrity constraint refers to
Students; if so, the command fails. If the keyword RESTRICT is replaced by CASCADE,
Students is dropped and any referencing views or integrity constraints are (recursively) dropped
as well; one of these two keywords must always be specified. A view can be dropped using the
DROP VIEW command, which is just like DROP TABLE.
ALTER TABLE modifies the structure of an existing table. To add a column called maiden
Students, for example, we would use the following command:
ALTER TABLE Students ADD COLUMN maiden-name CHA(10)
The definition of Students is modified to add this column, and all existing rows are padded with
null values in this column. ALTER TABLE can also be used to delete columns and to add or
drop integrity constraints on a table.
Overview: The Relational Model defines two root languages for accessing a relational database -
- Relational Algebra and Relational Calculus. Relational Algebra is a low-level, operator-
oriented language. Creating a query in Relational Algebra involves combining relational
operators using algebraic notation. Relational Calculus is a high-level, declarative language.
Creating a query in Relational Calculus involves describing what results are desired.
Relational algebra is one of the two formal query languages associated with the relational
model. Queries in algebra are composed using a collection of operators. A fundamental property
is that every operator in the algebra accepts (one or two) relation instances as arguments and
returns a relation instance as the result. This property makes it easy to compose operators to form
a complex query —a relationalalgebra expression is recursively defined to be a relation, a
unary algebra operator applied to a singleexpression, or a binary algebra operator applied to two
expressions. We describe the basic operators of the algebra (selection, projection, union, cross-
product, and difference).
Relational algebra includes operators to select rows from a relation (σ)and to project columns
(π).
These operations allows to manipulate data in a single relation. Consider the instance of the
Sailors relation shown in Figure 4.2, denoted as S2. We can retrieve rows corresponding to
expert sailors by using the s operator. The expression (S2) evaluates to the relation shown in
Figure 4.4. The subscript rating>8 specifies the selection criterion to be applied while retrieving
tuple
Set Operations: The following standard operations on sets are also available in relational
algebra: union (𝖴), intersection (n), set-difference (-), and cross-product (×).
Union: R𝖴S returns a relation instance containing all tuples that occur in either relation
instanceR or relation instance S (or both). R and S must be unioncompatible, and the schema of
the result is defined to be identical to the schema of R.
Intersection: RnS returns a relation instance containing all tuples that occur in both R and S.
Therelations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
Set-difference: R-S returns a relation instance containing all tuples that occur in R but not in
S.The relations R and S must be union-compatible, and the schema of the result is defined to be
identical to the schema of R.
Cross-product: R×S returns a relation instance whose schema contains all the fields of R
(inthesame order as they appear in R) followed by all the fields of S (in the same order as they
appear in S). The result of R × S contains one tuple r, s (the concatenation of tuples r and s) for
each pair of tuples r ∈ R, s ∈S. The cross-product opertion is sometimes calledCartesian
product.
Joins: The join operation is one of the most useful operations in relational algebra and is the
most commonlyused way to combine information from two or more relations. Although a join
can be defined as a cross-product followed by selections and projections, joins arise much more
frequently in practice than plain cross-products.
Condition Joins
The most general version of the join operation accepts a join condition c and a pair of relation
instances as arguments, and returns a relation instance. The join condition is identical to a
selection condition in form.
The operation is defined as follows:
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula
which may use connectors like and, or, and not. These terms may use relational operators like
− =, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.
For example −
r𝖴 s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
Notation − r − s
Notation − r Χ s
r Χ s = { q t | q ∈ r and t ∈ s}
Notation − ρ x (E)
Set intersection
Assignment
Natural join
The variant of the calculus that we present in detail is called the tuple relational calculus
(TRC). Variables in TRC take on tuples as values. In another variant, called the domain
relational calculus (DRC), the variables range over field values.
Syntax of TRC Queries: Let Rel be a relation name, R and S be tuple variables, a an attribute of
R,and b an attribute of S. Let op denote an operator in the set {<, >, =, =, =, =}. An atomic
formula is one of the following:
R ∈ Rel
R.a op S.b
R.a op constant, or constant op R.a
A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(R) denotes a formula in which the variable R appears:
2.10 Domain Relational Calculus: A domain variable is a variable that ranges over the values
in the domain of some attribute (e.g., the variable can be assigned an integer if it appears in an
attribute whose domain is the set of integers). A DRC query has the form {x |
p(x1,x2,...,xn)},where each x is either a domain variable or a constant and p(x1,x2,...,xn) denotes
a DRC formula whose only free variables are the variables among the x i, 1 ≤ i ≥ n. The result
of this query is the set of all tuples x1,x2,...,xi for which the formula evaluates to true.
DRC formula is defined in a manner that is very similar to the definition of a TRC formula.
The main difference is that the variables are now domain variables. Let op denote an operator in
the set {<, >, =, =, =, =} and let X and Y be domain variables.
1 X op Y
X op constant,or constant op X
A formula is recursively defined to be one of the following, where p and q are themselves
formulas, and p(X) denotes a formula in which the variable X appears:
Notation − {T | Condition}
Output − Returns tuples with 'name' from Author who has written article on 'database'.
TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀).
For example −
Notation −
Where a1, a2 are attributes and P stands for formulae built by inner attributes.
For example −
Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also
involves relational operators.
The expression power of Tuple Relation Calculus and Domain Relation Calculus is equivalent
to Relational Algebra.
An Instance B of Boats
UNIT-III
SQL, Schema Refinement and Normal Forms
3.1The Form of a Basic SQL Query:
SQL is the language used to query all databases. It's simple to learn and appears to do very little
but is the heart of a successful database application. Understanding SQL and using it efficiently
is highly imperative in designing an efficient database application. The better your understanding
of SQL the more versatile you'll be in getting information out of databases.A SQL SELECT
statement can be broken down into numerous elements, each beginning with a keyword.
Although it is not necessary, common convention is to write these keywords in all capital letters.
In this article, we will focus on the most fundamental and common elements of a SELECT
statement, namely
SELECT
FROM
WHERE
ORDER BY
If we want only specific columns (as is usually the case), we can/should explicitly specify them
in a comma-separated list, as in
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
Explicitly specifying the desired fields also allows us to control the order in which the fields are
returned, so that if we wanted the last name to appear before the first name, we could write
SELECT EmployeeID, LastName, FirstName, HireDate, City FROM Employees
The WHERE Clause
The next thing we want to do is to start limiting, or filtering, the data we fetch from the database.
By adding a WHERE clause to the SELECT statement, we add one (or more) conditions that
must be met by the selected data. This will limit the number of rows that answer the query and
are fetched. In many cases, this is where most of the "action" of a query takes place.
Examples
We can continue with our previous query, and limit it to only those employees living in London:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London'
If you wanted to get the opposite, the employees who do not live in London, you would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City <> 'London'
It is not necessary to test for equality; you can also use the standard equality/inequality operators
that you would expect. For example, to get a list of employees who were hired on or after a given
date, you would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHEREHireDate>= '1-july-1993'
Of course, we can write more complex conditions. The obvious way to do this is by having
multiple conditions in the WHERE clause. If we want to know which employees were hired
between two given dates, we could write
SELECTEmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHERE (HireDate>= '1-june-1992') AND (HireDate<= '15-december-1993')
Note that SQL also has a special BETWEENoperator that checks to see if a value is between
two values (including equality on both ends). This allows us to rewrite the previous query as
SELECT EmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHEREHireDateBETWEEN '1-june-1992' AND '15-december-1993'
We could also use the NOT operator, to fetch those rows that are not between the specified dates:
SELECTEmployeeID, FirstName, LastName, HireDate, City
FROM Employees
WHERE HireDateNOT BETWEEN '1-june-1992' AND '15-december-1993'
Let us finish this section on the WHERE clause by looking at two additional, slightly more
sophisticated, comparison operators.
What if we want to check if a column value is equal to more than one value? If it is only 2
values, then it is easy enough to test for each of those values, combining them with the OR
operator and writing something like
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City = 'London' OR City = 'Seattle'
However, if there are three, four, or more values that we want to compare against, the above
approach quickly becomes messy. In such cases, we can use the IN operator to test against a set
of values. If we wanted to see if the City was either Seattle, Tacoma, or Redmond, we would
write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City IN ('Seattle', 'Tacoma', 'Redmond')
As with the BETWEEN operator, here too we can reverse the results obtained and query for
those rows where City is not in the specified list:
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE City NOT IN ('Seattle', 'Tacoma', 'Redmond')
Finally, the LIKE operator allows us to perform basic pattern-matching using wildcard
characters. For Microsoft SQL Server, the wildcard characters are defined as follows:
Wildcard Description
_ (underscore) matches any single character
[] matches any single character within the specified range (e.g. [a-f])
or set (e.g. [abcdef]).
[^] matches any single character not within the specified range (e.g.
[^a-f]) or set (e.g. [^abcdef]).
Here too, we can opt to use the NOT operator: to find all of the employees whose first name
does not start with 'M' or 'A', we would write
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
WHERE (FirstNameNOT LIKE'M%') AND (FirstNameNOT LIKE 'A%')
If we want the sort order for a column to be descending, we can include the DESC keyword after
the column name.
The ORDER BY clause is not limited to a single column. You can include a comma-delimited
list of columns to sort by—the rows will all be sorted by the first column specified and then by
the next column specified. If we add the Country field to the SELECT clause and want to sort
by Country and City, we would write:
SELECTEmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country, City DESC
Note that to make it interesting, we have specified the sort order for the City column to be
descending (from highest to lowest value). The sort order for the Country column is still
ascending. We could be more explicit about this by writing
SELECTEmployeeID, FirstName, LastName, HireDate, Country, City
FROM Employees
ORDER BY Country ASC, City DESC
It is important to note that a column does not need to be included in the list of selected (returned)
columns in order to be used in the ORDER BY clause. If we don't need to see/use the Country
values, but are only interested in them as the primary sorting field we could write the query as
SELECTEmployeeID, FirstName, LastName, HireDate, City FROM Employees
ORDER BY Country ASC, City DESC
SQL provides three set-manipulation constructs that extend the basic query form. Since the
answer to a query is a multiset of rows, it is natural to consider the use of operations such as
union, intersection, and difference. SQL supports these operations under the names UNION,
INTERSECT,andEXCEPT.
Union:
Eg: Find the names of sailors who have reserved a red or a green boat.
SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
union
SELECT S2.sname FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’
This query says that we want the union of the set of sailors who have reserved red boats and
the set of sailors who have reserved green boats.
Intersect:
Eg:Find the names of sailors who have reserved both a red and a green boat.
SELECT S.sname FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND
R.bid = B.bid AND B.color = ‘red’
intersect
SELECT S2.sname FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’
Except:
Eg:Find the sids of all sailors who have reserved red boats but not green boats.
SELECT S.sid FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid =
B.bid AND B.color = ‘red’
Except
SELECT S2.sid FROM Sailors S2, Reserves R2, Boats B2 WHERE S2.sid = R2.sid AND
R2.bid = B2.bid AND B2.color = ‘green’
SQL also provides other set operations: IN (to check if an element is in a given set), op ANY, op
ALL (to compare a value with the elements in a given set, using comparison operator op), and
EXISTS (to check if a set is empty). IN and EXISTS can be prefixed by NOT,withthe obvious
modification to their meaning.
We cover UNION, INTERSECT,andEXCEPT in this section, and the other operations
Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along
with the operators like =, <, >, >=, <=, IN, BETWEEN etc.
A subquery can have only one column in the SELECT clause, unless multiple columns
are in the main query for the subquery to compare its selected columns.
An ORDER BY cannot be used in a subquery, although the main query can use an
ORDER BY. The GROUP BY can be used to perform the same function as the ORDER
BY in a subquery.Subqueries that return more than one row can only be used with
multiple value operators, such as the IN operator.
The SELECT list cannot include any references to values that evaluate to a BLOB,
ARRAY, CLOB, or NCLOB.
The sub query can refer to variables from the surrounding query, which will act as constants
during any one evaluation of the sub query.
This simple example is like an inner join on col2, but it produces at most one output row for
each tab1 row, even if there are multiple matching tab2 rows:
SELECT col1
FROM tab1
WHERE EXISTS (SELECT 1
FROM tab2
WHERE col2 = tab1.col2);
SELECT name
FROM stud
WHERE EXISTS (SELECT 1
FROM assign
WHERE stud = stud.id);
The right-hand side of this form of IN is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result. The result of IN is TRUE if any equal sub query row is found.
ALL
The right-hand side of this form of ALL is a parenthesized sub query, which must return exactly
one column. The left-hand expression is evaluated and compared to each row of the sub query
result using the given operator, which must yield a Boolean result. The result of ALL is TRUE if
all rows yield TRUE (including the special case where the sub query returns no rows). NOT IN
is equivalent to <> ALL.
Row-wise comparison
The left-hand side is a list of scalar expressions. The right-hand side can be either a list of scalar
expressions of the same length, or a parenthesized sub query, which must return exactly as many
columns as there are expressions on the left-hand side. Furthermore, the sub query cannot return
more than one row. (If it returns zero rows, the result is taken to be NULL.) The left-hand side is
evaluated and compared row-wise to the single sub query result row, or to the right-hand
expression list. Presently, only = and <> operators are allowed in row-wise comparisons. The
result is TRUE if the two rows are equal or unequal, respectively.
A nested query is a query that has another query embedded within it; the embedded query is
called a subquery.
SQL provides other set operations: IN (to check if an element is in a given set),NOT IN(to
check if an element is not in a given set).
Eg:1. Find the names of sailors who have reserved boat 103.
FROM Reserves R
The nested subquery computes the (multi)set of sids for sailors who have reserved boat 103,
and the top-level query retrieves the names of sailors whose sid is in this set. The IN operator
allows us to test whether a value is in a given set of elements; an SQL query is used to generate
the set to be tested.
2.Find the names of sailors who have not reserved a red boat.
FROM Reserves R
FROM Boats B
In the nested queries that we have seen, the inner subquery has been completely independent of
the outer query. In general the inner subquery could depend on the row that is currently being
examined in the outer query .
Eg: Find the names of sailors who have reserved boat number 103.
The EXISTS operator is another set comparison operator, such as IN. It allows us to test
whether a set is nonempty.
Set-Comparison Operators
SQL also supports op ANY and op ALL, where op is one of the arithmetic comparison
operators {<, <=, =, <>, >=,>}.
Eg:1. Find sailors whose rating is better than some sailor called Horatio.
If there are several sailors called Horatio, this query finds all sailors whose rating is better
than that of some sailor called Horatio.
2.Find the sailors with the highest rating.
Comparison Operators:Comparison operators are used to compare the column data with
specific values in a condition.Comparison Operators are also used along with the SELECT
statement to filter data based on specific conditions.
SQL Comparison Keywords:There are other comparison keywords available in sql which are
used to enhance the search capabilities of a sql query. They are "IN", "BETWEEN...AND", "IS
NULL", "LIKE".
Comparision Operators Description
LIKE column value is similar to specified character(s).
IN column value is equal to any one of a specified set of values.
BETWEEN...AND column value is between two values, including the end values
specified in the range.
IS NULL column value does not exist.
The above select statement searches for all the rows where the first letter of the column
first_name is 'S' and rest of the letters in the name can be any character.
There is another wildcard character you can use with LIKE operator. It is the underscore
character, ' _ ' . In a search string, the underscore signifies a single character.
To display all the names with 'a' second character,
SELECT first_name, last_name
FROM student_details
WHERE first_name LIKE '_a%';
NOTE:Each underscore act as a placeholder for only one character. So you can use more than
one underscore. Eg: ' i% '-this has two underscores towards the left, 'S j%' - this has two
underscores between character 'S' and 'i'.
To find the names of the students between age 10 to 15 years, the query would be like,
SELECT first_name, last_name, age
FROM student_details
WHERE age BETWEEN 10 AND 15;
SQL IN Operator: The IN operator is used when you want to compare a column with more than
one value. It is similar to an OR condition.
If you want to find the names of students who are studying either Maths or Science, the query
would be like,
SELECT first_name, last_name, subject
FROM student_details
WHERE subject IN ('Maths', 'Science');
You can include more subjects in the list like ('maths','science','history')
If you want to find the names of students who do not participate in any games, the query would
be as given below
SELECT first_name, last_name
FROM student_details
WHERE games IS NULL
There would be no output as we have every student participate in a game in the table
student_details, else the names of the students who do not participate in any games would be
displayed.
Example
The following example Aggregate Functions are applied to the employee_count of the branch
table. The region_nbr is the level of grouping.Here are the contents of the table:
Table: BRANCH
branch_nbr branch_name region_nbr employee_count
108 New York 100 10
110 Boston 100 6
212 Chicago 200 5
404 San Diego 400 6
415 San Jose 400 3
Syntax:
The basic syntax of NULL while creating a table:
The following table describes how logical "OR" operator selects a row.
Column1 Column2 Row
Satisfied? Satisfied? Selected
YES YES YES
YES NO YES
NO YES YES
NO NO NO
Example: To find the names of the students between the age 10 to 15 years, the query would be
like:
The following table describes how logical "AND" operator selects a row.
Example: If you want to find out the names of the students who do not play football, the query
would be like:
OUTER JOINS
All joins mentioned above, that is Theta Join, Equi Join and Natural Join are called inner-joins.
An inner-join process includes only tuples with matching attributes, rest are discarded in
resulting relation. There exists methods by which all tuples of any relation are included in the
resulting relation.
Left
A B
100 Database
101 Mechanics
102 Electronics
Right
A B
100 Alex
102 Maya
104 Mira
A B C D
100 Database 100 Alex
Employee Employee
Age Gender Location Salary
ID Name
New
1001 Henry 54 Male 100000
York
1002 Tina 36 Female Moscow 80000
1003 John 24 Male London 40000
1006 Sophie 29 Female London 60000
Default values are also subject to integrity constraint checking (defaults are included as part of
an INSERT statement before the statement is parsed.)
If the results of an INSERT or UPDATE statement violate an integrity constraint, the statement
will be rolled back.
Integrity constraints are stored as part of the table definition, (in the data dictionary.)
If multiple applications access the same table they will all adhere to the same rule.
NOT NULL
UNIQUE
CHECK constraints for complex integrity rules
PRIMARY KEY
FOREIGN KEY integrity constraints - referential integrity actions: – On Update – On
Delete – Delete CASCADE – Delete SET NULL
Constraint States
The current status of an integrity constraint can be changed to any of the following 4 options
using the CREATE TABLE or ALTER TABLE statement.
Eg: The trigger called init count initializes a counter variable before every execution of an
INSERT statement that adds tuples to the Students relation. The trigger called incr count
increments the counter for each inserted tuple that satisfies the condition age < 1
CREATE TRIGGER init count BEFORE INSERT ON Students /* Event */
DECLARE
count INTEGER;
BEGIN /* Action */
count := 0;
END
BEGIN /* Action */
count:=count+1;
END
Overview:Only construction of the tables is not only the efficient data base design. Solving the
redundant data problem is the efficient one. For this we use functional dependences. And normal
forms those will be discussed in this chapter.
We now present an overview of the problems that schema refinement is intended to address and
a refinement approach based on decompositions. Redundant storage of information is the root
cause of these problems. Although decomposition can eliminate redundancy, it can lead to
problems of its own and should be used with caution.
Problems Caused by Redundancy
Storing the same information redundantly, that is, in more than one place within a database, can
lead to several problems:
Redundant storage: Some information is stored repeatedly.
Use of Decompositions
Redundancy arises when a relational schema forces an association between attributes that is not
natural. Functional dependencies can be used to identify such situations and to suggest
refinements to the schema. The essential idea is that many problems arising from redundancy can
be addressed by replacing a relation with a collection of ‘smaller’ relations. Each of the smaller
relations contains a subset of the attributes of the original relation. We refer to this process as
decomposition of the larger relation into the smaller relations.
Problems Related to Decomposition: Decomposing a relation schema can create more
problems than it solves. Two important questions must be asked repeatedly:
Do we need to decompose a relation?
What problems (if any) does a given decomposition cause?
To help with the first question, several normal forms have been proposed for relations. If a
relation schema is in one of these normal forms, we know that certain kinds of problems cannot
arise. Considering the normal form of a given relation schema can help us to decide whether or
not to decompose it further. If we decide that a relation schema must be decomposed further, we
must choose a particular decomposition.
With respect to the second question, two properties of decompositions are of particular interest.
The lossless-join property enables us to recover any instance of the decomposed relation from
corresponding instances of the smaller relations. The dependency preservation property enables
us to enforce any constraint on the original relation by simply enforcing some constraints on
each of the smaller relations. That is, we need not perform joins of the smaller relations to check
whether a constraint on the original relation is violated.
Functional Dependency Set: Functional Dependency set or FD set of a relation is the set of all
FDs present in the relation. For Example, FD set for relation STUDENT shown in table 1 is:
{ STUD_NO->STUD_NAME, STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE,
STUD_NO->STUD_COUNTRY,
STUD_NO -> STUD_AGE, STUD_STATE->STUD_COUNTRY }
Attribute Closure: Attribute closure of an attribute set can be defined as set of attributes which
can be functionally determined from it.
How to find attribute closure of an attribute set?
To find attribute closure of an attribute set:
Add elements of attribute set to the result set.
Recursively add elements to the result set which can be functionally determined from the
elements of the result set.
Using FD set of table 1, attribute closure can be determined as:
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}
How to finding Candidate Keys and Super Keys using Attribute Closure?
If attribute closure of an attribute set contains all attributes of relation, the attribute set
will be super key of the relation.
If no subset of this attribute set can functionally determine all attributes of the relation,
the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate key.
3.10 Normalization:
In general, database normalization involves splitting tables with columns that have different
types of data ( and perhaps even unrelated data) into multiple table, each with fewer columns that
describe the attributes of a single concept of physical object or being.
The goal of normalization is to prevent the problems ( called modification anomalie) that plague
a poorly designed relation (table).
Suppose, for example, that you have a table with resort guest ID numbers, activities the guests
have signed up to do, and the cost of each activity – all together in the following GUEST –
ACTIVITY-COST table:
Each row in the table represents a guest that has signed up for the named activity and paid the
specified cost. Assuming that the cost depends only on the activity that is, a specific activity
costs the same for all guests if you delete the row for GUEST – ID 2587, you lose not only the
fact that guest 2587 signed up for scuba diving, but also the fact that scuba diving costs $ 250.00
per outing. This is called a deletion anomaly – when you delete a row, you lose more information
than you intended to remove.
In the current example, a single deletion resulted in the loss of information on two entities what
activity a guest signed up to do and how much a particular activity costs.
Now, suppose the resort adds a new activity such as horseback riding. You cannot enter the
activity name ( horseback riding) or cost ($190.00) in to the table until a guest decides to sign up
for it. The unnecessary restriction of having to wait until someone signs up for an activity before
you can record its name and cost is called an insertion anomaly.
In the current example, each insertion adds facts about two entities. Therefore, you cannot
INSERT a fact about one entity until you have an additional fact about the other entity.
Conversely, each deletion removes facts about two entities. Thus, you cannot DELETE the
information about one entity while leaving the information about the other in table.
You can eliminate modification anomalies through normalization – that is, splitting the single
table with rows that have attributes about two entities into two tables, each of which has rows
with attributes that describe a single entity.
You will be ablve to remove the aromatherapy appointment for guest 1269 without losing the
fact that an aromatherapy session costs $75.00. Similarly, you can now add the fact that
horseback riding costs $ 190.00 per day to the ACTIVITY – COST table without having to wait
for a guest to sign up for the activity.
During the development of relational database systems in the 1970s, relational theorists kept
discovering new modification anomalies. Some one would find an anomaly, classify it, and then
figure out a way to prevent it by adding additional design criteria to the definition of a “well
formed relation. These design criteria are known as normal forms. Not surprisingly E.F codd (of
the 12 rule database definition fame), defined the first, second, and third normal forms, (INF,
2NF, and 3NF).
After Codd postulated 3 NF, relational theorists formulated Boyce-codd normal form (BCNF)
and then fourth normal form (4NF) and fifth normal form (5NF)
First Normal form :
Normalization is a processes by which database designers attempt to eliminate modification
anomalies such as the :
Deletion anomaly
The iniability to remove a single fact from a table without removing other (unrelated) facts you
want to keep.
Insertion anomaly:
The inability to insert one fact without inserting another ( and some times, unrelated) fact.
Update anomaly:
Changing a fact in one column creates a false fact in another set of columns. Modification
anomalies are a result of functional dependencies among the columns in a row ( or tuple, to use
the precise relational database term
A functional dependency means that if you know the value in one column or set of columns, you
can always determine the value of another. To put the table in first normal form (INF) you could
break up the student number list in the STUDENTS column of each row such that each row had
only one of the student Ids in the STUDENTS column. Doing so would change the table’s
structure and rows to: The value given by the combination (CLASS, SECTION, STUDENT) is
the composite key for the table because it makes each row unique and all columns atomic. Now
that each the table in the current example is in INF, each column has a single, scalar value.
Unfortunately, the table still exhibits modification anomalies:\
Deletion anomaly:
If professor SMITH goes to another school and you remove his rows from the table, you also
lose the fact that STUDENTS 1005, 2110 and 3115 are enrolled in a history class.
Insertion anomaly:
If the school wants to add an English Class (EI00), it cannot do so until a student signs up for the
course ( Remember, no part of a primary key can have a NULL value).
Update anomaly:
If STUDENT 4587 decides to sign up for the SECTION 1, CS100 CLASS instead of his math
class, updating the Class and section columns in the row for STUDENT 4587 to reflect the
change will cause the table to show TEACHER RAWL INS as being in both the MATH and the
COMP-SCI departments.
Thus, ‘flattening’ a table’s columns to put it into first normal form (INF) does not solve any of
the modification anomaliesAll it does is guarantee that the table satisfies the requirements for a
table defined as “relational” and that there are no multi valued dependencies between the
columns in each row.
When a table is in second normal form, it must be in first normal form (no multi valued
dependencies and have no partial key dependencies.
A partial key dependency is a situation in which the value in part of a key can be used to
determine the value of another attribute ( column)Thus, a table is in 2NF when the value in all
nonkey columns depends on the entire key. Or, said another way, you cannot determine the value
of any of the columns by using part of the keyWith (CLASS, SECTION, STUDENT) as its
primary key. If the university has two rules about taking classes no student can sign up for more
than one section of the same class, and a student can have only one major then the table, while in
1 NF, is not in 2NF.
Given the value of (STUDENT, COURSE) you can determine the value of the SECTION, since
no student can sign up for two sections of the same course. Similarly since students can sign up
for only one major, knowing STUDENT determines the value of MAJOR. In both instances, the
value of a third column can be deduced (or is determined) by the value in a portion of the key
(CLASS, SECTION, STUDENT) that makes each row unique.
To put the table in the current example in 2NF will require that it be split in to three tables
described by :
Courses (Class, Section, Teacher, Department)
PRIMARY KEY (Class, Section)
Enrollment (Student, Class, Section)
PRIMARY KEY (Student, class)
Students (student, major)
PRIMARY KEY (Student)
Unfortunately, putting a table in 2NF does not eliminate modification anomalies.
Suppose, for example, that professor Jones leaves the university. Removing his row from the
COURSES table would eliminate the entire ENGINEERING department, since he is currently
the only professor in the department.
Similarly, if the university wants to add a music department, it cannot do so until it hires a
professor to teach in the department.
Understanding Third Normal Form :
To be a third normal form (3NF) a table must satisfy the requirements for INF (no multi valued
dependencies) and 2NF ( all nonkey attributes must depend on the entire key). In addition, a
table in 3NF has no transitive dependencies between nonkey columns.
Given a table with columns, (A,B,C) a transitive dependency is one in which a determines B, and
B determines C, therefore A determines C, or, expressed using relational theory notation
If A B and B C then A C.
When a table is in 3NF the value in every non key column of the table can be determined by
using the entire key and only the entire key,. Therefore, given a table in 3NF with columns
(A,B,C) if A is the PRIMARY KEY, you could not use the value of B ( a non key column) to
determine the value of a C ( another non key column). As such, A determines B(A B), and A
determines C( C). However, knowing the value of column B does not tell you have value in
column C that is, it is not the case that B C.
Suppose, for example, that you have a COURSES tables with columns and PRIMARY KEY
described by
Courses (Class, section, teacher, department , department head)
PRIMARY KEY (Class, Section)
That contains the Data :
History
H100 Smith Smith
History
H1002 Riley Smith
Given that a TEACHER can be assigned to only one DEPARTMENT and that a
DEPARTMENT can have only one department head, the table has multiple transitive
dependencies.
For example, the value of TEACHER is dependant on the PRIMARY KEY (CLASS,
SECTION), since a particular SECTION of a particular CLASS can have only one teacher that is
A B. Moreover, since a TEACHER can be in only one DEPARTMENT, the value in
DEPARTMENT is dependant on the value in TEACHER that is B C. However, since the
PRIMARY KEY (CLASS, SECTION) determines the value of TEACHER, it also determines
the value of DEPARTMENT that is A C. Thus, the table exhibits the transitive dependency in
which A B and B C, therefore A C.
The problem with a transitive dependency is that it makes the table subject to the deletion
anomaly. When smith retires and we remove his row from the table, we lose not only the fact
that smith taught SECTION 1 of H100 but also the fact that SECTION 1 of H100 was a class
that belonged to the HISTORY department.
To put a table with transitive dependencies between non key columns into 3 NF requires that the
table be split into multiple tables. To do so for the table in the current example, we would need
split it into tables, described by :
Courses (Class, Section, Teacher)
PRIMARY KEY (class, section)
Teachers (Teacher, department)
PRIMARY KEY (teacher)
Departments (Department, Department head)
PRIMARY KEY (department )
After Normalization
S1 A C1 C 5k
S2 A C1 C 5k
S1 A C2 C 10k
S3 B C2 C 10k
S3 B C2 JAVA 15k
Primary Key(SID,CID)
Here all the data is stored in a single table which causes redundancy of data or say anomalies as
SID and Sname are repeated once for same CID .
3.12 OTHER KINDS OF DEPENDENCIES:
Finish-to-Start Dependencies: The most common type of dependency is the finish-to-start
relationship (FS). This relationship means that the first task, the predecessor, must be finished
before the next task, the successor, can start. On the Gantt chart it is usually represented as
follows:
Start-to-Start Dependencies
The next type of dependency is the start-to-start relationship (SS). This relationship means that
the successor task cannot start until the predecessor task starts. On the Gantt chart, it is usually
represented as follows:
Finish-to-Finish Dependencies
The third type of dependency is the finish-to-finish relationship (FF). This relationship means
that the successor task cannot finish until the predecessor task finishes. On the Gantt chart, it is
usually represented as follows:
Start-to-Finish Dependencies
The start-to-finish relationship (SF) is the least common task relationship and means that the
successor cannot finish until the predecessor starts. On the Gantt chart, it is usually represented
as follows:
Of course tasks sometimes overlap – this is termed lead (or lead time). Tasks can also be delayed
(for example, to wait while concrete dries) which is called lag (or lag time).
UNIT-IV
Overview:
In this unit we introduce two topics first one is concurrency control. The stored data will be
accessed by the users so if any two or users try to access same data at a time it may raise the
problem of data inconsistency to solve that concurrency control methods are invented. Recovery
is used to maintain the data without loss when the problem of power failure, software failure and
hardware failure.
4.1 Transactions
Collections of operations that form a single logical unit of work are called Transactions. A
database system must ensure proper execution of transactions despite failures – either the entire
transaction executes, or none of it does.
4.2 Transaction Concept:
A transaction is a unit of program execution that accesses and possibly updates various data
items. Usually, a transaction is initiated by a user program written in a high level data
manipulation language or programming language ( for example SQL, COBOL, C, C++ or
JAVA), where it is delimited by statements ( or function calls) of the form Begin transaction and
end transaction. The transaction consists of all operations executed between the begin transaction
and end transaction.
To ensure integrity of the data, we require that the database system maintain the following
properties of the transaction.
Atomicity: Either all operations of the transaction are reflected properly in the database, or non
are .
Consistency: Execution of a transaction in isolation ( that is, with no other transaction executing
concurrently) preserves the consistency of the database.
Isolation: Even though multiple transactions may execute concurrently, the system guarantees
that, for every pair of transaction Ti and Tj, ti appears to Ti that either Tj finished execution
before Ti started, or Tj started execution after Ty finished. Thus, each transaction is unaware of
other transactions executing concurrently in the system.
Durability: After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.
Transaction state:
In the absence of failures, all transactions complete successfully. A transaction may not always
complete its execution successfully. Such a transaction is termed aborted. If we are to ensure the
atomicity property, an aborted transction must have no effect on the state of the database.
Thus, any changes that the aborted transaction made to the database must be undone. Once the
changes caused by an aborted transaction have been undone, we say that the transaction has been
rolled back. It is part of the responsibility of the recovery scheme to manage transaction aborts.
Once a transction has committed, we cannot undo its effects by aborting it. The only way to undo
the effects of committed transaction is to execute a compensating transaction. For instance, if a
transaction added $20 to an account, the compensating transaction would subtract $20 from the
account. However, it is not always possible to create such a compensating transaction. Therefore,
the responsibility of writing and executing a compensating transaction is left to the user, and is
not handled by the database system. By successful completion of a transaction, A transaction
must be in one of the following states :
Active:
The initial state ; the transaction stays in this state while it is executing
Partially committed :
After the final statement has been executed
Faile:
After the discovery that normal execution can no longer proceed
Aborted:
After the transaction has been rolled back and the database has been restrored to its state prior to
the start of the transaction
Committed:
After successful completion
We say that a transaction has committed nly if it has entered the committed state. Similarly, we
say that a transaction has aborted only if it has entered the aborted state. A transaction is said to
have terminated if has either committed or aborted.
A transaction starts in the active state. When it finishes its final statement, it enters the
partially committed state. At this point, the transaction has completed its execution, but it is still
possible that it may have to be aborted, since the actual output may still be termporarily residing
in main momory, and thus a hardware failure may preclude its successful completion.
The database system then writes out enough information to disk that, even in the event of
a failure, the updates performed by the transaction can be recreated when the system restarts after
the failure. When the last of this information is written out, the transaction enters the committed
state.
A transaction enters the filed state after the system determines that the transaction can no
longer proceed with its normal execution ( for example, because of hard ware or logical errors)
such a transaction must be rolled back. Then, it enters the aborted state. At this point, the system
has two options.
It can restart the transaction, but only if the transaction was aborted as a result of some hardware
or software error that was not created through the internal logic of the transaction. A restarted
transaction is considered to be a new transaction.
It can kill the transaction. It usually does so because of some internal logical error that can be
corrected only by rewriting the application program, or because the input was bad, or because the
desired data were not found in the database.
We must be cautious when dealing with observable external writes, such as writes to a terminal
or printer. Once such a write has occurred, it cannot be erased, since it may have been seen
external to the database system. Most systems allows such writes to take place only after the
transaction has entered the committed state.
These properties are often called the ACID properties, the acronym is derived from the first letter
of each of the four properties.Volatile Memory
These are the primary memory devices in the system, and are placed along with the CPU. These
memories can store only small amount of data, but they are very fast. E.g.:- main memory, cache
memory etc. these memories cannot endure system crashes- data in these memories will be lost
on failure.
Non-Volatile memory
These are secondary memories and are huge in size, but slow in processing. E.g.:- Flash memory,
hard disk, magnetic tapes etc. these memories are designed to withstand system crashes.
Stable Memory
This is said to be third form of memory structure but it is same as non volatile memory. In this
case, copies of same non volatile memories are stored at different places. This is because, in case
of any crash and data loss, data can be recovered from other copies. This is even helpful if there
one of non-volatile memory is lost due to fire or flood. It can be recovered from other network
location. But there can be failure while taking the backup of DB into different stable storage
devices. Even it may fail to transfer all the data successfully; either it will partially transfer the
data to remote devices or completely fail to store the data in stable memory. Hence extra caution
has to be taken while taking the backup of data from one stable memory to other. There are
different methods followed to copy the data. One of them is to copy the data in two phases –
copy the data blocks to first storage device, if it is successful copy to second storage device. The
copying is complete only when second copy is executed successfully. But second copy of data
blocks may fail to copy whole blocks. In such case, each data blocks in first copy and second
copy needs to be compared for its inconsistency. But verifying each blocks would be very costly
task as we may have huge number of data block. One of the better way to identify the failed
block is to identify the block which was in progress during the failure. Take only this block,
compare the data and correct the mismatches.
Failure Classification
When a transaction is being executed in the system, it may fail to execute due to various reasons.
The failure can be because of system program, bug in a program, user, or system crash. These
failures can be broadly classified into three categories.
Transaction Failure : This type of failure affects only few tables or processes. This is the
condition in the transaction where a transaction cannot execute it further. This failure can be
because of user or executing program/ transaction. The user may cancel the transaction when the
transaction is executing by pressing the cancel button or abort using the DB commands. The
transaction may fail because of the constraints on the tables – violation of constraints. It can even
fail if there is concurrent processing of multiple transactions and there is lack of resources for all
of them or deadlock situation. All these will cause the transaction to stop processing in the
middle of its execution. When a transaction fails / stops in the middle, it would have partially
changed DB and it needs to be rolled back to previous consistent state. In ATM withdrawal
example, if the user cancels his transaction after step (i), the system should be able to stop further
processing of the transaction, or if he cancels the transaction after step (ii), the system should be
strong enough to update his balance in his account. Here system may cancel the transaction due
to insufficient balance. The failure can be because of errors in the code – logical errors or
because of system errors like deadlock or unavailability of system resources to execute the
transactions.
System Crash: This can be because of hardware or software failure or because of external
factors like power failure. This is the failure of the system because of the bug in the software or
the failure of system processor. This crash mainly affects the data in the primary memory. If it
affects only the primary memory, the actual data will not be really affected and recovery from
this failure is easy. This is because primary memories are temporary storages and it would not
have updated the actual database. Hence the system will be in a consistent state before to the
transaction. But when secondary memory crashes, there would be a loss of data and need to take
serious actions to recover lost data. Because secondary memories contain actual DB data.
Recovering them from crash is little tedious and requires more effort. DB Recovery system
provides strong mechanisms to recovery the system from crash and maintains the atomicity of
the transactions. In most of the cases data in the secondary memory are not affected because o f
this crash. This is because; the database has lots of integrity checkpoints to prevent the data loss
from secondary memory.
Disk Failure: These are the issues with hard disks like formation of bad sectors, disk head crash,
unavailability of disk etc. Data can even be lost because of fire, flood, theft etc. This is mainly
affects the secondary memory where the actual data lies. In these cases, we need to have
alternative ways of storing DB. We can create backups of DB at regular basis and store them
separately from the memory where DB is stored or maintain multiple copies of DB at different
network locations to recover them from failure.
To gain a better understanding of ACID properties and the need for them, consider a simplified
banking system consisting of several accounts and a set of transactions that access and update
those accounts.
Read (X) which transfers the data item X from the database to a local buffer belonging to the
transaction that executed the read operation
Write (X), which transfers the data item X from the local buffer of the transaction that executed
the write back to the database.
In a real database system, the write operation does not necessarily result in the immediate update
of the data on the disk; the write operation may be temporarily stored in memory and executed
on the disk later.
For now, however, we shall assume that the write operation updates the database immediately.
Let Ty be a transaction that transfers $50 from account A to account B. This transaction can be
defined as
Ti : read (A);
A; = A-50;
Write (A);
Read (B);
B:=B+50;
Write (B).
Consistency:
Execution of a transaction in isolation ( that is, with no other transaction executing concurrently)
preserves the consistency of the database.
The consistency requirement here is that the sum of A and B be unchanged by the execution of
the transaction. Without the consistency requirement, money could be created or destroyed by
the transaction. It can be verified easily that, if the database is consistent before an execution of
the transaction, the database remains consistent after the execution of the transaction.
Ensuring consistency for an individual transaction is the responsibility of the application
programmer who codes the transaction. This task may be facilitated by automatic testing of
integrity constraints.
Atomicity:
Suppose that, just before the execution of transaction Ty the values of accounts A and B are
$1000 and $2000, respectively.
Now suppose that, during the execution of transaction Ty, a failure occurs that prevents Ti from
completing its execution successfully.
Examples of such failures include power failures, hardware failures, and software errors
Further, suppose that the failure happened after the write (A) operation but before the write (B)
operation. In this case, the values of amounts A and B reflected in the database are $950 and
$2000. The system destroyed $50 as a result of this failure.
In particular, we note that the sum A + B is no longer preserved. Thus, because of the failure, the
state of the system no longer reflects a real state of the world that the database is supposed to
capture. WE term such a state in inconsistent state. We must ensure that such inconsistencies are
not visible in a database system.
Note, however, that the system must at some point be in an inconsistent state. Even if transaction
Ty is executed to completion, there exists a point at which the value of account A is $ 950 and
the value of account B is $2000 which is clearly an inconsistent state.
This state, however is eventually replaced by the consistent state where the value of account A is
$ 950, and the value of account B is $ 2050.
Thus, if the transaction never started or was guaranteed to complete, such an inconsistent state
would not be visible except during the execution of the transaction.
If the atomicity property is present, all actions of the transaction are reflected in the database or
none are.
Ensuring atomicity is the responsibility of the database system itself; specifically, it is handed by
a component called the transaction management component. ]
4.6 Serializability:
When multiple transactions are being executed by the operating system in a multiprogramming
environment, there are possibilities that instructions of one transactions are interleaved with
some other transaction.
Serial Schedule − It is a schedule in which transactions are aligned in such a way that
one transaction is executed first. When the first transaction completes its cycle, then the
next transaction is executed. Transactions are ordered one after the other. This type of
schedule is called a serial schedule, as transactions are executed in a serial manner.
To resolve this problem, we allow parallel execution of a transaction schedule, if its transactions
are either serializable or have some equivalence relation among them.
Equivalence Schedules
An equivalence schedule can be of the following types −
Result Equivalence
If two schedules produce the same result after execution, they are said to be result equivalent.
They may yield the same result for some value and different results for another set of values.
That's why this equivalence is not generally considered significant.
View Equivalence
Two schedules would be view equivalence if the transactions in both the schedules perform
similar actions in a similar manner.
For example −
If T reads the initial data in S1, then it also reads the initial data in S2.
If T reads the value written by J in S1, then it also reads the value written by J in S2.
If T performs the final write on the data value in S1, then it also performs the final write
on the data value in S2.
Conflict Equivalence
Two schedules would be conflicting if they have the following properties −
Concurrency Control:
4.7. Lock-Based protocols:
ADBMS must be able to ensure that only serializable, recoverable schedules are allowed, and
that no actions of committed transactions are lost while undoing aborted transactions. A
DBMS typically uses a locking protocol to achieve this. A locking protocol is a set of rules to
be followed by each transaction, in order to ensure that even though actions of several
transactions might be interleaved, the net effect is identical to executing all transactions in
some serial order.
Strict Two-Phase Locking(Strict2PL):
The most widely used locking protocol, called Strict Two-Phase Locking, or Strict2PL,
It has two rules. The first rule is
1.If a transaction T wants to read an object, it first requests a shared lock on the object.
Of course, a transaction that has an exclusive lock can also read the object; an additional shared
lock is not required. A transaction that requests a lock is suspended until the DBMS is able to
grant it the requested lock. The DBMS keeps track of the locks it has granted and ensures that if
atransaction holds an exclusive lock on an object no other transaction holds a shared or exclusive
lock on the same object.
(1)All locks held by a transaction are released when the transaction is completed
those pages. Similarly, if a transaction accesses ever alrecords on a page, it should lock the entire
pageand if it accesses just a few records, it should lock just those records.
The question to be addressed is how a lock manager can efficiently ensure that a page,
for example, is not locked by a transaction while an other transaction holds a conflictinglock on
the file containing the page.
The recovery manager of a DBMS is responsible for ensuring two important properties of
transactions: atomicity and durability. It ensures atomicity by undoing the actions of transactions
that do not commit and durability by making sure that all actions of committed transactions
survive system crashes, (e.g., a core dump caused by a bus error) and media failures (e.g., a
disk is corrupted).
The Log: The log, sometimes called the trail or journal, is a history of actions executed by the
DBMS. Physically, the log is a file of records stored in stable storage, which is assumed to
survive crashes; this durability can be achieved by maintaining two or more copies of the log on
different disks, so that the chance of all copies of the log being simultaneously lost is negligibly
small.
The most recent portion of the log, called the log tail,is kept in main memory and is
periodically forced to stable storage. This way, log records and data records are written to disk at
the same granularity.
Every log record is given a unique id called the log sequence number (LSN). As with
any record id, we can fetch a log record with one disk access given the LSN. Further, LSNs
should be assigned in monotonically increasing order; this property is required for the ARIES
recovery algorithm. If the log is a sequential file, in principle growing indefinitely, the LSN can
simply be the address of the first byte of the log record.
The transaction is considered to have committed at the instant that its commit log record is
written to stable storage
Abort: When a transaction is aborted, an abort type log record containing the transaction id is
appendedto the log, and Undo is initiated for this transaction
End: As noted above, when a transaction is aborted or committed, some additional actions must
be takenbeyond writing the abort or commit log record. After all these additional steps are
completed, an end type log record containing the transaction id is appended to the log.
Undoing an update: When a transaction is rolled back (because the transaction is aborted, or
duringrecovery from a crash), its updates are undone. When the action described by an update
log record is undone, a compensation log record,or CLR, is written.
Dirty page table: This table contains one entry for each dirty page in the buffer pool, that is,
each pagewith changes that are not yet reflected on disk. The entry contains a field recLSN,
which is the LSN of the first log record that caused the page to become dirty. Note that this LSN
identifies the earliest log record that might have to be redone for this page during restart from a
crash.
Checkpoint
A checkpoint is like a snapshot of the DBMS state, and by taking checkpoints periodically, as
we will see, the DBMS can reduce the amount of work to be done during restart in the event of a
subsequent crash.
We assume that each transaction Ti executes in two or three different phases in its lifetime,
depending on whether it is a read-only or an update transaction. The phases are, in order,
1. Read phase. During this phase, the system executes transaction Ti. It reads the values of the
various data items and stores them in variables local to Ti. It performs all write operations on
temporary local variables, without updates of the actual database.
2. Validation phase. Transaction Ti performs a validation test to determine whether it can copy
to the database the temporary local variables that hold the results of write operations without
causing a violation of serializability.
3. Write phase. If transaction Ti succeeds in validation (step 2), then the system applies the
actual updates to the database. Otherwise, the system rolls back Ti.
Each transaction must go through the three phases in the order shown. However, all three phases
of concurrently executing transactions can be interleaved.
To perform the validation test, we need to know when the various phases of trans-
actions Ti took place. We shall, therefore, associate three different timestamps with
transaction Ti:
1. Start(Ti), the time when Ti started its execution.
2. Validation(Ti ), the time when Ti finished its read phase and started its validation phase.
3. Finish(Ti), the time when Ti finished its write phase.
We determine the serializability order by the timestamp-ordering technique, using the value of
the timestamp Validation(Ti). Thus, the value TS(Ti) = Validation(Ti) and, if TS(Tj ) < TS(Tk ),
then any produced schedule must be equivalent to a serial schedule in which
transaction Tj appears before transaction Tk . The reason we have chosen Validation(Ti), rather
than Start(Ti), as the timestamp of transaction Ti is that we can expect faster response time
provided that conflict rates among transactions are indeed low.
The validation test for transaction Tj requires that, for all transactions Ti with TS(Ti) < TS(Tj ),
one of the following two conditions must hold:
1. Finish(Ti) < Start(Tj ). Since Ti completes its execution before Tj started, the serializability
order is indeed maintained.
2. The set of data items written by Ti does not intersect with the set of data items read by Tj ,
and Ti completes its write phase before Tj starts its validation phase
(Start(Tj ) < Finish(Ti) < Validation(Tj )). This condition ensures that
the writes of Ti and Tj do not overlap. Since the writes of Ti do not affect the read of Tj , and
since Tj cannot affect the read of Ti, the serializability order is indeed maintained.
As an illustration, consider again transactions T14 and T15. Suppose that TS(T14) < TS(T15).
Then, the validation phase succeeds in the schedule 5 in Figure 16.15. Note that the writes to the
actual variables are performed only after the validation phase of T15. Thus, T14 reads the old
values of B and A, and this schedule is serializable.
The validation scheme automatically guards against cascading rollbacks, since the actual writes
take place only after the transaction issuing the write has committed.
However, there is a possibility of starvation of long transactions, due to a sequence of conflicting
short transactions that cause repeated restarts of the long transaction.
To avoid starvation, conflicting transactions must be temporarily blocked, to enable the long
transaction to finish.
This validation scheme is called the optimistic concurrency control scheme since transactions
execute optimistically, assuming they will be able to finish execution and validate at the end. In
contrast, locking and timestamp ordering are pessimistic in that they force a wait or a rollback
whenever a conflict is detected, even though there is a chance that the schedule may be conflict
serializable.
Recovery System:
Crash Recovery:
DBMS is a highly complex system with hundreds of transactions being executed every second.
The durability and robustness of a DBMS depends on its complex architecture and its underlying
hardware and system software. If it fails or crashes amid transactions, it is expected that the
system would follow some sort of algorithm or techniques to recover lost data.
Non-volatile storage − These memories are made to survive system crashes. They are
huge in data storage capacity, but slower in accessibility. Examples may include hard-
disks, magnetic tapes, flash memory, and non-volatile (battery backed up) RAM.
When a system crashes, it may have several transactions being executed and various files opened
for them to modify the data items. Transactions are made of various operations, which are atomic
in nature. But according to ACID properties of DBMS, atomicity of transactions as a whole must
be maintained, that is, either all the operations are executed or none.
When a DBMS recovers from a crash, it should maintain the following −
It should check the states of all the transactions, which were being executed.
A transaction may be in the middle of some operation; the DBMS must ensure the
atomicity of the transaction in this case.
It should check whether the transaction can be completed now or it needs to be rolled
back.
No transactions would be allowed to leave the DBMS in an inconsistent state.
There are two types of techniques, which can help a DBMS in recovering as well as maintaining
the atomicity of a transaction −
Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
Maintaining shadow paging, where the changes are done on a volatile memory, and later,
the actual database is updated.
Log-based Recovery
Log is a sequence of records, which maintains the records of actions performed by a transaction.
It is important that the logs are written prior to the actual modification and stored on a stable
storage media, which is failsafe.
Log-based recovery works as follows −
The log file is kept on a stable storage media.
When a transaction enters the system and starts execution, it writes a log about it.
<Tn, Start>
When the transaction modifies an item X, it write logs as follows −
The recovery system reads the logs backwards from the end to the last checkpoint.
It maintains two lists, an undo-list and a redo-list.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in the redo-list.
If the recovery system sees a log with <T n, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All the transactions in the undo-list are then undone and their logs are removed. All the
transactions in the redo-list and their previous logs are removed and then redone before saving
their logs.
4.13 Recovery Algorithm:
Introduction to ARIES
ARIES is a recovery algorithm that is designed to work with a steal, no-force approach. When
the recovery manager is invoked after a crash, restart proceeds in three phases:
1. Analysis: Identifies dirty pages in the buffer pool and active transactions at the time of the
crash.
2. Redo: Repeats all actions, starting from an appropriate point in the log, and restores the
database stateto what it was at the time of the crash.3.Undo: Undoes the actions of transactions
that did not commit, so that the database reflects only theactions of committed transactions.
There are three main principles behind the ARIES recovery algorithm:
Write-ahead logging: Any change to a database object is first recorded in the log; the record in
the logmust be written to stable storage before the change to the database object is written to
disk.
Repeating history during Redo: Upon restart following a crash, ARIES retraces all actions of
theDBMS before the crash and brings the system back to the exact state that it was in at the time
of the crash. Then, it undoes the actions of transactions that were still active at the time of the
crash.
Logging changes during Undo: Changes made to the database while undoing a transaction are
logged inorder to ensure that such an action is not repeated in the event of repeated restarts.
Buffer Management:
A DBMS must manage a huge amount of data, and in the course of processing therequired space
for the blocks of data will often be greater than the memory spaceavailable. For this there is the
need to manage a memory in which to load and unload theblocks. The buffer manager is
responsible primarily for managing the operations inherentsaving and loading of the blocks. In
fact, the operations that provides the buffer manager are these:* FIX: This command tells the
operator of the buffer to load a block from disk and returnthe pointer to the memory where it is
loaded. If the block was already in memory, thebuffer manager needs only to return the pointer,
otherwise he must load from disk andbring it into memory. If the buffer memory is full but it is
possible to have 2 situations:or the possibility of releasing a portion of memory that is occupied
by transactionsalready completed. In this case, before freeing the area the content is written
to disk if any block of this area had been changed.* There is the possibility of free memory to be
occupied because transitions still ongoing.In this case, the buffer manager can work in 2 ways: in
the first mode (STEAL) theoperator of the free buffer memory occupied by a transition already
active, possiblysaving your changes to disk, in the second mode (NOT STEAL) the transition
requestedblock is made to wait until the free memory.* SET DIRTY: invoking this
command, you mark a block of memory as amended.Before introducing the last 2
commands you need to anticipate that the DMBS canoperate in 2 modes: Force and NOT
FORCE. When working in FORCE mode, the rescuedisk is in synchronous mode with
the commit of a transaction. When working mode isNOT FORCE the rescue is carried out from
time to time in asynchronous manner.Typically, commercial database operating mode NOT
FORCE because this allows anincrease in performance: the block may undergo multiple changes
in memory beforebeing saved, then you can choose to make the saves when the system is
unloading.* Force: This command will cause the operator of the buffer to make the writing
insynchronously with the completion (commit) the transaction* FLUSH: This command will
cause the operator of the buffer to perform the rescue,when in how NOT FORCE.
dump in restoring the database to a previous consistent state. Once this restoration has been
accomplished, the system uses the log to bring the database system to the most recent consistent
state.
More precisely, no transaction may be active during the dump procedure, and a procedure similar
to checkpointing must take place:
1. Output all log records currently residing in main memory onto stable storage.
2. Output all buffer blocks onto the disk.
3. Copy the contents of the database to stable storage.
4. Output a log record <dump> onto the stable storage.
Steps 1, 2, and 4 correspond to the three steps used for checkpoints in Section 17.4.3.
To recover from the loss of nonvolatile storage, the system restores the database to disk by using
the most recent dump. Then, it consults the log and redoes all the transactions that have
committed since the most recent dump occurred. Notice that no undo operations need to be
executed.
A dump of the database contents is also referred to as an archival dump, since we can archive
the dumps and use them later to examine old states of the database.
Dumps of a database and checkpointing of buffers are similar.
The simple dump procedure described here is costly for the following two reasons.
First, the entire database must be be copied to stable storage, resulting in considerable data
transfer. Second, since transaction processing is halted during the dump procedure, CPU cycles
are wasted. Fuzzy dump schemes have been developed, which allow transactions to be active
while the dump is in progress. They are similar to fuzzy checkpointing schemes; see the
bibliographical notes for more details.
A remote, online, or managed backup service, sometimes marketed as cloud backup or backup-
as-a-service, is a service that provides users with a system for the backup, storage, and recovery
of computer files. Online backup providers are companies that provide this type of service to end
users (or clients). Such backup services are considered a form of cloud computing.
Online backup systems are typically built around a client software program that runs on a
schedule. Some systems run once a day, usually at night while computers aren't in use. Other
newer cloud backup services run continuously to capture changes to user systems nearly in real-
time. The only backup system typically collects, compresses, encrypts, and transfers the data to
the remote backup service provider's servers or off-site hardware.
There are many products on the market – all offering different feature sets, service levels, and
types of encryption. Providers of this type of service frequently target specific market segments.
High-end LAN-based backup systems may offer services such as Active Directory, client remote
control, or open file backups. Consumer online backup companies frequently have beta software
offerings and/or free-trial backup services with fewer live support options.
UNIT-V
The lowest layer of the software deals with management of space on disk, where the
data is to be stored. Higher layers allocate, deal locate, read and write pages through
(routines provided by) this layer, called the disk space manager.
On top of the disk space manager, we have the buffer manager, which partitions the
available main memory into a collection of pages of frames. The purpose of the buffer
manager is to bring pages in from disk to main memory as needed in response to read
requests from transactions.
The next layer includes a variety of software for supporting the concepts of a file, which,
in DBMS, is a collection of pages or a collection of records. This layer typically
supports a heap file, or file or unordered pages, as well as indexes. In addition to
keeping track of the pages in a file, this layer organizes the information within a page.
The code that implements relational operators sits on top of the file and access methods
layer. These operators serve as the building blocks for evaluating queries posed against
the data.
When a user issues a query, the query is presented to a query optimizer, whish uses
information about how the data is stored to produce an efficient execution plan for
evaluating the query. An execution plan is usually represented as tree of relational
operators ( with annotations that contain additional detailed information about which
access methods to use.
Data in a DBMS is stored on storage devices such as disks and tapes ; the disk space
manager is responsible for keeping tract of available disk space. The file manager, which
provides the abstraction of a file of records to higher levels of DBMS code, requests to
the disk space manager to obtain and relinquish space on disk.
When a record is needed for processing, it must be fetched from disk to main memory.
The page on which the record resides is determined by the file manager ( the file
manager determines the page on which the record resides)
Sometimes, the file manager uses auxiliary data structures to quickly identify the page
that contains a desired record. After identifying the required page, the file manager
issues a request for the page to a layer of DBMS code called the buffer manager.
Primary Storage − The memory storage that is directly accessible to the CPU
comes under this category. CPU's internal memory (registers), fast memory
(cache), and main memory (RAM) are directly accessible to the CPU, as they are
all placed on the motherboard or CPU chipset. This storage is typically very
small, ultra-fast, and volatile. Primary storage requires continuous power supply
in order to maintain its state. In case of a power failure, all its data is lost.
Secondary Storage − Secondary storage devices are used to store data for future
use or as backup. Secondary storage includes memory devices that are not a part
of the CPU chipset or motherboard, for example, magnetic disks, optical disks
(DVD, CD, etc.), hard disks, flash drives, and magnetic tapes.
Tertiary Storage − Tertiary storage is used to store huge volumes of data. Since
such storage devices are external to the computer system, they are the slowest in
speed. These storage devices are mostly used to take the back up of an entire
system. Optical disks and magnetic tapes are widely used as tertiary storage.
Memory Hierarchy
A computer system has a well-defined hierarchy of memory. A CPU has direct access to
it main memory as well as its inbuilt registers. The access time of the main memory is
obviously less than the CPU speed. To minimize this speed mismatch, cache memory is
introduced. Cache memory provides the fastest access time and it contains data that is
most frequently accessed by the CPU.
The memory with the fastest access is the costliest one. Larger storage devices offer
slow speed and they are less expensive, however they can store huge volumes of data as
compared to CPU registers or cache memory.
Magnetic Disks
Hard disk drives are the most common secondary storage devices in present computer
systems. These are called magnetic disks because they use the concept of magnetization
to store information. Hard disks consist of metal disks coated with magnetizable
material. These disks are placed vertically on a spindle. A read/write head moves in
between the disks and is used to magnetize or de-magnetize the spot under it. A
magnetized spot can be recognized as 0 (zero) or 1 (one).
Hard disks are formatted in a well-defined order to store data efficiently. A hard disk
plate has many concentric circles on it, called tracks. Every track is further divided
into sectors. A sector on a hard disk typically stores 512 bytes of data.
Redundant Array of Independent Disks
RAID or Redundant Array of Independent Disks, is a technology to connect multiple
secondary storage devices and use them as a single storage media.
RAID consists of an array of disks in which multiple disks are connected together to
achieve different goals. RAID levels define the use of disk arrays.
RAID 0
In this level, a striped array of disks is implemented. The data is broken down into
blocks and the blocks are distributed among disks. Each disk receives a block of data to
write/read in parallel. It enhances the speed and performance of the storage device.
There is no parity and backup in Level 0.
RAID 1
RAID 1 uses mirroring techniques. When data is sent to a RAID controller, it sends a
copy of data to all the disks in the array. RAID level 1 is also called mirroring and
provides 100% redundancy in case of a failure.
RAID 2
RAID 2 records Error Correction Code using Hamming distance for its data, striped on
different disks. Like level 0, each data bit in a word is recorded on a separate disk and
ECC codes of the data words are stored on a different set disks. Due to its complex
structure and high cost, RAID 2 is not commercially available.
RAID 3
RAID 3 stripes the data onto multiple disks. The parity bit generated for data word is
stored on a different disk. This technique makes it to overcome single disk failures.
RAID 4
In this level, an entire block of data is written onto data disks and then the parity is
generated and stored on a different disk. Note that level 3 uses byte-level striping,
whereas level 4 uses block-level striping. Both level 3 and level 4 require at least three
disks to implement RAID.
RAID 5
RAID 5 writes whole data blocks onto different disks, but the parity bits generated for
data block stripe are distributed among all the data disks rather than storing them on a
different dedicated disk.
RAID 6
RAID 6 is an extension of level 5. In this level, two independent parities are generated
and stored in distributed fashion among multiple disks. Two parities provide additional
fault tolerance. This level requires at least four disk drives to implement RAID.
File Organization
File Organization defines how file records are mapped onto disk blocks. We have four
types of File Organization to organize file records −
Update Operations
Retrieval Operations
Update operations change the data values by insertion, deletion, or update.
Retrieval operations, on the other hand, do not alter the data but retrieve them
after optional conditional filtering. In both types of operations, selection plays a
significant role. Other than creation and deletion of a file, there could be several
operations, which can be done on files.
Open − A file can be opened in one of the two modes, read mode or write
mode. In read mode, the operating system does not allow anyone to alter data. In
other words, data is read only. Files opened in read mode can be shared among
several entities. Write mode allows data modification. Files opened in write
mode can be read but cannot be shared.
Locate − Every file has a file pointer, which tells the current position where the
data is to be read or written. This pointer can be adjusted accordingly. Using find
(seek) operation, it can be moved forward or backward.
Read − By default, when files are opened in read mode, the file pointer points to
the beginning of the file. There are options where the user can tell the operating
system where to locate the file pointer at the time of opening a file. The very
next data to the file pointer is read.
Write − User can select to open a file in write mode, which enables them to edit
its contents. It can be deletion, insertion, or modification. The file pointer can be
located at the time of opening or can be dynamically changed if the operating
system allows to do so.
Close − This is the most important operation from the operating system’s point
of view. When a request to close a file is generated, the operating system
o removes all the locks (if in shared mode),
o saves the data (if altered) to the secondary storage media, and
o releases all the buffers and file handlers associated with the file.
The organization of data inside a file plays a major role here. The process to locate the
file pointer to a desired record inside a file various based on whether the records are
arranged sequentially or clustered. We know that data is stored in the form of records. Every record
has a key field, which helps it to be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from the database
files based on some attributes on which the indexing has been done. Indexing in
database systems is similar to what we see in books.
Indexing is defined based on its indexing attributes. Indexing can be of the following
types −
Primary Index − Primary index is defined on an ordered data file. The data file
is ordered on a key field. The key field is generally the primary key of the
relation.
Secondary Index − Secondary index may be generated from a field which is a
candidate key and has a unique value in every record, or a non-key with
duplicate values.
Clustering Index − Clustering index is defined on an ordered data file. The data
file is ordered on a non-key field.
Ordered Indexing is of two types −
Dense Index
Sparse Index
Dense Index
In dense index, there is an index record for every search key value in the database. This
makes searching faster but requires more space to store index records itself. Index
records contain search key value and a pointer to the actual record on the disk.
Sparse Index
In sparse index, index records are not created for every search key. An index record here
contains a search key and an actual pointer to the data on the disk. To search a record,
we first proceed by index record and reach at the actual location of the data. If the data
we are looking for is not where we directly reach by following the index, then the
system starts sequential search until the desired data is found.
Multilevel Index
Index records comprise search-key values and data pointers. Multilevel index is stored
on the disk along with the actual database files. As the size of the database grows, so
does the size of the indices. There is an immense need to keep the index records in the
main memory so as to speed up the search operations. If single-level index is used, then
a large size index cannot be kept in memory which leads to multiple disk accesses.
Multi-level Index helps in breaking down the index into several smaller indices in order
to make the outermost level so small that it can be saved in a single disk block, which
can easily be accommodated anywhere in the main memory.
The costs of some simple operations for three basic file organizations;
Scan :
Fetch all records in the file. The pages in the file must be fetched from disk into the
buffer pool. There is also a CPU overhead per record for locating the record on the page
( in the pool).
Fetch all records that satisfy an equality selection, for example, “ find the students
record for the student with sid 23’ Pages that contain qualifying records must be fetched
from disk, and qualifying records must be located within retrieved pages.
Fetch all records that satisfy a range section, for example, “find all students records with
name alphabetically after ‘smith”
Insert :
Insert a given record into the file. We must identify the page in the file into which the
new record must be inserted, fetch that page from disk, modify it to include the new
record, and then write back the modified page. Depending on the file organization, we
may have to fetch, modify and write back other pages as well.
Delete :
Delete a record that is specified using its record identity 9rid). We must identify the page
that contains the record, fetch it from disk, modify it, and write it back. Depending on
the file organization, we may have to fetch, modify and write back other pages as well.
Heap files :
Scan :
The cost is B(D+RC) because we must retrieve each of B pages taking time D per page,
and for each page, process R records taking time C per record.
For each retrieved data page, user must check all records on the page to see if it is the
desired record. The cost is 0.5B(D+RC). If there is no record that satisfies the selection
then user must scan the entire file to verify it.
Delete :
First find the record, remove the record from the page, and write the modified page
back. For simplicity, assumption is made that no attempt is made to compact the file to
reclaim the free space created by deletions. The cost is the cost of searching plus C+D.
The record to be deleted is specified using the record id. Since the page id can easily be
obtained from the record it, user can directly read in the page. The cost of searching is
therefore D
Sorted files :
The files sorted on a sequence of field are known as sorted files.
The various operation of sorted files are
Scan : The cost is B(D+RC) because all pages must be examined the order in which
records are retrieved corresponds to the sort order.
(ii) Search with equality selection:
Here assumption is made that the equality selection is specified on the field by
which the file is sorted; if not, the cost is identical to that for a heap file. To locate
the first page containing the desired records or records, qualifying records must
exists, with a binary search in log 2 B steps. Each step requires a disk I/O two
comparisons. Once the page is known the first qualifying record can again be located
by a binary search of the page at a cost of Clog2 R. The cost is Dlog2 B + Clog2B.
This is significant improvement over searching heap files.
(iv) Insert :
To insert a record preserving the sort order, first find the correct position in the
file, add the record, and then fetch and rewrite all subsequent pages. On average, assume
that the inserted record belong in the middles of the file. Thus, read the latter half of the
file and then write it back after adding the new record. The cost is therefore the cost of
searching to find the position of the new record plus 2 * (0.5B(D+RC)), that is, search
cost plus B(D+RC)
(v) Delete :
First search for the record, remove the record from the page, and write the
modified page back. User must also read and write all subsequent pages because all
records that follow the deleted record must be moved up to compact the free space. The
cost is search cost plus B(D+RC) Given the record identify (rid) of the record to delete,
user can fetch the page containing the record directly.
Hashed files :
A hashed file has an associated search key, which is a combination of one or more fields
of the file. In enables us to locate records with a given search key value quickly, for
example, “Find the students record for Joe” if the file is hashed on the name field we can
retrieve the record quickly.
This organization is called a static hashed file; its main drawback is that long chains of
overflow pages can develop. This can affect performance because all pages ina bucket
have to be searched.
The various operations of hashed files are ;
Scan :
In a hashed file, pages are kept at about 80% occupancy ( in order to leave some space
for futue insertions and minimize over flow pages as the file expands). This is achieved
by adding a new page to a bucket when each existing page is 80% full, when records are
initially organized into a hashed file structure. Thus the number of pages, and therefore
the cost of scanning all the data pages, is about 1.25 times the cost of scaning an
unordered file, that is, 1.25B(D+RC)
The hash function associated with a hashed file maps a record to a bucket based on the
values in all the search key fields; if the value for anyone of these fields is not specified,
we cannot tell which bucket the record belongs to. Thus if the selection is not an
equality condition on all the search key fields, we have to scan the entire file.
Search with Range selection :
The harsh structure offers no help at all; even if the range selection is on the search key,
the entire file must be scanned. The cost is 1.25 B{D+RC}
Insert :
The appropriate page must be located, modified and then written back. The cost is thus
the cost of search plus C+D.
Delete :
We must search for the record, remove it from the page, and write the modified page
back. The cost is again the cost of search plus C+D (for writing the modified page ).
The below table compares I/O costs for three file organizations
A heap file has good storage efficiency, and supports fast scan, insertion, and deletion of
records. However it is slow for searches.
A stored file also offers good storage efficiency, but insertion and deletion of records is
slow. It is quite for searches, and in particular, it is the best structure for range
selections.
A hashed file does not utilize space quite as well as sorted file, but insertions and
deletions are fast, and equality selections are very fast. However, the structure offers no
support for range selections, and full file scans are title slower; the lower space
utilization means that files contain more pages.
The potential large size of the index file motivates the ISAM idea. Building an auxiliary
file on the index file and so on recursively until the final auxiliary file fits on one page?
This repeated construction of a one-level index leads to a tree structure that is illustrated
in Figure The data entries of the ISAM index are in the leaf pages of the tree and
additional overflow pages that are chained to some leaf page. In addition, some systems
carefully organize the layout of pages so that page boundaries correspond closely to the
physical characteristics of the underlying storage device. The ISAM structure is
completely static and facilitates such low-level optimizations.
5.6 B+ Tree:
B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf
nodes of a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain
at the same height, thus balanced. Additionally, the leaf nodes are linked using a link
list; therefore, a B+ tree can support random access as well as sequential access.
Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B+tree is of the
order n where n is fixed for every B+ tree.
Internal nodes −
Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
At most, an internal node can contain n pointers.
Leaf nodes −
Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
At most, a leaf node can contain n record pointers and n key values.
Every leaf node contains one block pointer P to point to next leaf node and forms
a linked list.
B+ Tree Insertion
B+ trees are filled from bottom and each entry is done at the leaf node.
If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
If a non-leaf node overflows −
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
AssignmentQuestions
UNIT – I
UNIT – II
UNIT – III
UNIT – IV
1. Organize a locking protocol? Describe the Strict Two Phase Locking Protocol?
What can you say about the schedules allowed by this protocol?
2. Classify short notes on : a) Multiple granularity b) Serializability c) Complete
schedule d) Serial Schedule.
UNIT – V
1. Illustrate extendable hashing techniques for indexing data records. Consider your
class students data records and roll number as index attribute and show the hash
directory.
2. Is disk cylinder a logical concept? Justify your answer.
3. Formulate the performance implications of disk structure? Explain briefly about
redundant arrays of independent disks.
4. Measure the indexing? Explain what are the differences between trees based
index and Hash based index.
5. Justify extendable hashing? How it is different from linear hashing?
Tutorial Problems
Tutorial-1
4. Elaborate the Trigger? Explain how to implement Triggers in SQL with example.
5. Discuss the following operators in SQL with examples
i) Some ii) Not In iii) In iv) Except
Tutorial -3
1. Consider a relation R with five attributes ABCDE. You are given the following
dependencies: A->B, BC->E and ED->A
i) List all keys for R
ii) Is R in 3NF? If not, explain why not.
Tutorial -4
1. Discuss about log? What is log tail? Explain the concept of checkpoint log
record.
2. Elaborate to test serializability of a schedule? Explain with an example.
3. Construct the concurrency control using time stamp ordering protocol.
4. Demonstrate ACID properties of transactions.
5. Differentiate transaction rollback and restart recovery.
Tutorial -5
Important Questions
Unit-1
List the six design goals for relational database and explain why they are desirable.
What is the composite Attribute? How to model it in the ER diagram? Explain with an
example.
Compare candidate key , primary key and super key.
Unit-2
Write the following queries in Tuple Relational Calculus for following Schema.
ii. Find the names of sailors who have reserved at least one boat
(b) Find the names of sailors who have reserved at least two boats
(c) Find the names of sailors who have reserved all boats.
The key fields are underlined. The catalog relation lists the
price changes for parts by supplies. Write the following
queries in SQL.
Explain in detail
thefollowing
(j) i. join
operation
ii. Nested-
loop join
iii.BlockNes
ted-Loop
join.
Write the SQL expressions for the following
relational database? sailor schema(sailor id, Boat id,
sailorname, rating, age) Reserves(Sailor id, Boat id,
Day)
Boat Schema(boat id, Boatname,color)
i. Find the age of the youngest sailor for each rating level?
Find the age of the youngest sailor who is eligible to vote for each rating
level with at lead two such sailors?
Find the No. of reservations for each red boat?
Find the average age of sailor for each rating level that atleast 2 sailors.
What is outer join? Explain different types of joins?
What is a trigger and what are its 3 parts. Explain in detail.
What is view? Explain the Views in SQL.
Unit-3
Unit-5
Explain the following
a. Cluster indexes
b. Primary and secondary indexes
c. Clustering file organization
Unit-I
Relational calculus is a
(A) Conceptual view. (B) Internal view.(C)External view. (D) Physical View.
Unit-II
An entity set that does not have sufficient attributes to form a primary key
is a A)strong entity set. (B) weak entity set.C)simple entity set. (D) primary
entity set.
Q.6 The database environment has all of the following components except:
(A) users. (B) separate files.C)database. (D) database administrator.
Unit-III
A report generator is used to
Conceptual design
a) is a documentation technique.
(b) needs data volume and processing frequencies to determine the size of the
database.
(c) involves modelling independent of the DBMS.
(d) is designing the relational model.
A subschema expresses
(A) the logical view. (B)the physical view.(C)the external view. (D)all of the above.
Unit-
An advantage of the database management
135approach is
(A) Primary key(B) Secondary Key C)Foreign Key (D) None of these
Unit-
136
1.The file organization that provides very fast access to any arbitrary record of a
. file is
2.DBMS helps
achieve
(e) Neither (A) nor (B) (D) both (A) and (B)
Q.4Which of the following operation is used if we are interested in only certain columns
of a table?
Unit-1
11. What is an unsafe query? Give an example and explain why it is important to
disallow such queries?
13. List the six design goals for relational database and explain why they are
desirable.
Unit-2
1. A company database needs to store data about employees, departments and children
of employees. Draw an ER diagram that captures the above data.
5. What is the composite Attribute? How to model it in the ER diagram? Explain with an
example.
Write the following queries in Tuple Relational Calculus for following Schema.
Find the names of sailors who have reserved at least one boat
Find the names of sailors who have reserved at least two boats
The key fields are underlined. The catalog relation lists the
price changes for parts by supplies.
13. Write the following queries in SQL.
Find the pnames of parts supplied by raghu supplier and no one else.
i. Find the age of the youngest sailor for each rating level?
15. Find the age of the youngest sailor who is eligible to vote for each rating level
with at lead two such sailors?
16. Find the No. of reservations for each red boat?
17. Find the average age of sailor for each rating level that atleast 2 sailors.
18. What is outer join? Explain different types of joins?
19. What is a trigger and what are its 3 parts. Explain in detail.
Unit-3
Unit-4
10 What are the merits & demerits of using fuzzy dumps for media recovery.
11.What information does the dirty page table and transaction table contain.
Unit-5
a. Cluster indexes
b. Primary and secondary indexes
c. Clustering file organization
Sample MidPaper
II B.Tech II Sem CSE Database Management Systems I Mid Question Paper
PART-A
1. a) List the responsibilities of DBA?
b) Write brief notes on views?
c) List the primitive operations in relational algebra?
d) What is meant by nested queries?
e) What is Trigger and Active database?
PART-B
4) What are integrity constraints? How these constraints are expressed in SQL?
(or)
5) Explain the operations of relational algebra? What are aggregative operations
and logical operators in SQL?
6) Describe about DDL & DML commands with syntaxes and examples?
(or)
7) What is normalization? Explain 1NF, 2NF and 3NF Normal forms with
examples?
PART-A(25Marks)
1.a) Discuss about DDL. [2]
b) Write brief notes on altering tables and views. [3]
c) Describe about outer join. [2]
d) What is meant by nested queries? [3]
e) What is second normal form? [2]
f) Describe the inclusion dependencies. [3]
g) What is meant by buffer management? [2]
h) What is meant by remote backup system? [3]
i) Discuss about primary indexes. [2]
j) What is meant by linear hashing? [3]
PART-B
(50
Marks)
2. Explain the relationaldatabasearchitecture. [10]
OR
3. State and explain various features ofE-RModels. [10]
REFERENCES
3. Data base Systems design, Implementation, and Management, Peter Rob &
Carlos Coronel 7th Edition.
Websites:-
http://en.wikipedia.org/wiki/Database_management_system
https://www.tutorialspoint.com/dbms
http://helpingnotes.com/notes/msc_notes/dbms_notes/
http://www.geeksforgeeks.org
Journals:-