0% found this document useful (0 votes)
9 views102 pages

Relational Database Notes

The document discusses the importance of databases and the benefits of using a Database Management System (DBMS), including data independence, efficient access, and data recovery. It outlines the roles of database designers, administrators, and end-users, along with the capabilities expected of a database, such as integrity constraints and efficient query processing. Additionally, it covers various data models, including network, hierarchical, and object-oriented models, highlighting their advantages and disadvantages.

Uploaded by

Varun Shivan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views102 pages

Relational Database Notes

The document discusses the importance of databases and the benefits of using a Database Management System (DBMS), including data independence, efficient access, and data recovery. It outlines the roles of database designers, administrators, and end-users, along with the capabilities expected of a database, such as integrity constraints and efficient query processing. Additionally, it covers various data models, including network, hierarchical, and object-oriented models, highlighting their advantages and disadvantages.

Uploaded by

Varun Shivan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Week-1

Why are Databases Needed?

Database Approach

•By using DBMS approach (instead of le system), we get:


• data independence,
• e cient data access,
• data administration,
• concurrent access and data recovery, and
• no complex programming which means reduced development
time.

Video Transcript:
By having this functionality, what the DBMS can o er to the application program is several bene ts. It can
make the application program not worry about data. What we say is that if there is some change to the
structure of the data, it is not necessary to change the application program that we are writing to access
the data. The data structure itself is maintained by the DBMS software and if the data le becomes or the
data table becomes very big and the access becomes slow, the DBMS software has the methods whereby
it can create additional structures whereby you can have a faster access. These are referred as indexing
structures. When data is accessed by many users or many processes and while the data is being
accessed, there is some event you may say accident or application error because of which the data
integrity may be compromised so that DBMS will have provisions to protect it and the DBMS software will
have the security mechanism whereby di erent users will get di erent levels of access. Then when we
have the data which is an important resource of the organization. What we want to do is that we want to
take backups and we want to make sure that the hardware where the data is con gured or the
con guration settings related to the data storage, the DBMS software will provide you mechanisms to
maintain it. When many users are accessing the data, particularly, there are some applications where the
millions of users may be concurrently accessing, for example, a travel reservation or some banking
software for a very large bank or when we are talking about some online stores. In such cases, it
is important that when many users are accessing the same data or possibly making some events whereby
the data needs to be updated, so the concurrent access should ensure that the data integrity is not lost. At
the same time, if there are any accidental events, the data should be recoverable. All these things should
happen while the user or the developer should not be burdened with a lot of programming e ort.

Data abstraction:
• A data model is used to hide storage details and present the users with a conceptual view of the
database.
• Programs refer to the data model constructs rather than data storage details.

Multiple views of the data:


• Each user may see a di erent view of the database, which describes only the data of interest to that
user.

Sharing of data:
• Allowing a set of concurrent users to retrieve from and to update the database without interference.
• Concurrency control within the DBMS guarantees that each transaction is correctly executed or aborted.
• Recovery subsystem ensures each completed transaction has its e ect permanently recorded in the
database.
ffi
fi
ff
fi
ff
ff
ff
ff
fi
fi
ff
fi
Capabilities Expected of a Database

Database Designers
Database designers communicate with the end-users and understand their needs.

Designer is responsible to de ne:

• Structure
• Content
• Constraints
• Transactions

Video Transcript:
So let's see, rst, the data has to be database structure has to be designed. So there is a role of database
designers who are going to work with what should be the layout of the database, and then the various
characteristics of the data structures, such as what should be the structure, what should be the content,
and what should be the type of constraints and how the transactions have to be de ned. So the DBMS
should enable the de nitions of various elements so that the data is going to serve the purpose of the
application. So when you talk about the structure, let's say there is a university academic system,
and then it is going to maintain the information about the students. So they may want to know the
students name certainly, and then they may want to know the age and they may want to know the
gender. And there may be reason why they want to know about your hobbies and why they may want to
know about, say your blood group. But it is very unlikely that for the university system, they will be
requiring the name of your pet. So the database designer has to make sure that the structure of the data is
meeting the functionality that is expected of the purpose for which the information system is built. And
then there will be constraints with respect to the data. So if you say the age of a student, the age can
never be a negative number. And it is also possible that there can be more intelligent constraint. For
example, the university does not admit anyone who is less than, say twelve years of age and who is more
than 80 years of age. So you may be able to incorporate some additional constraints that help in making
sure that data is of higher quality. And then there are transactions, so these have to be de ned over the
application because the data integrity is maintained only if the transactions are de ned properly. And then
the application execution respects the completion of the transactions. And then the DBMs needs to have
the ability to recognize them and make sure that if there is any intermittent failures during the transactions,
so the data is restored to a valid state, then there is a role of database administrator.

Database Administrators

• Database administrators are referred to as DBAs.


• DBA is responsible to oversee and manage DBMS and its
environment.

DBA is accountable for problems such as:


• security breaches and
• poor DB response time.

Video Transcript:
So if you look at the database administrators functions, they authorize
the accesses to the database. So that means whenever somebody
needs access to the database, the database administrator has to give the suitable command so that the
database will allow the user to access. And of course, the user can access to the level to which the
database administrator has given the access rights. And then make sure that the database processes are
getting enough resources, and the database activities are performed with suitable level of speed. And the
database is housed in an environment which has got the right con guration to perform the operations that
are being done. And then acquire if required additional hardware, and then possibly the additional
software. As you know, the data is very important resource. So it is possible that the organization wants to
make sure that even if there is a breakdown of the server, the data is not lost. So what that means is that
they may want to have a software, such as a backup software, and possibly some other softwares
which ensure that the data is connected with the rest of the systems. And DBAs are held accountable if
there are security breaches. So what that means is that they need to make sure that the database software
fi
fi
fi
fi
fi
fi
fi
is patched. So as you know, it is possible that the DBMS software provider keeps releasing some of the
updates to the software. So the DBA is responsible to ensure that the DBMS software is in good shape
and also make sure that all the events in the log are understood. That is, there are no exceptional events
that are happening within the database and the database response is within the reasonable time that is
expected by the applications.

End-Users
End users use the data for queries, reports and update the database content.

End-user categorization:

• Casual: Access database occasionally when needed.

• Naive or Parametric: They make up a large section of the end-user population.


• They use previously well-de ned functions against the database.
• Examples are bank-tellers or university secretaries who do this activity for an entire shift of operations.

• Sophisticated:
• These include business analysts, scientists, engineers, and others thoroughly familiar with the system’s
capabilities.
• Many use tools in the form of software packages that work closely with the stored database.

Control Redundancy

• There are many issues if the same data is stored by multiple departments on their own.

• Redundancy leads to duplication of e ort and inconsistency.

Restrict Unauthorized Access

• Databases are often shared by a large number of users.

• Most users are not authorized to access all information in DB.

• Users may be given restricted access on data.


• Retrieve
• Update
• Delete

E cient Query Processing

• Databases are typically stored on disks.

• DBMS provides specialized data structures to speed up disk search.

• Auxiliary les called indexes are often used for the purpose.

• Indexes are typically based on:


• Tree data structure.
• Hash data structure.

• Data needs to be copied to main memory for processing.

• DBMS has a bu ering or caching module.

Video Transcript:
Sometimes the data size may be very big. Let's say there is a Indian government has the data for its
citizens. We know that there are more than one billion records in the database. If they want to access any
particular citizen information, it should not take a long time or what you know is that such a data will never
t into the memory of the machine. So unless we have some very smart mechanism, we will not be able to
fi
ffi
fi
ff
fi
ff
access the data. So in such cases, the DBMS provides the mechanism so that you will be able to access it
very e ciently. So typically, DBMS maintains some auxiliary les called indexes. So these indexes are
much smaller than the actual data size, but at the same time, they help you with accessing the data at any
point in the database, very quickly. The typical indexes that the DBMS use are either based on a tree data
structure or has data structure. Further to this, what we know is that the database sizes can be huge. They
can run into terabytes and much bigger than that. But then the machine on which you're processing the
data will not be that big. It is not going to have the main memory that is running into the size of the
database. So you need to be able to selectively bring in the contents from the database and bu er it
within the main memory and do the processing and then restore it back to the database. So the DBMS
should have the ability to work with the main memory, and work with the small pieces of the data which is
being processed and it also needs to take care of or recognize what are the updates that have taken place
in the data and make sure that those updates are transferred back into the database.

Multiple User Interfaces


• Users with varying levels of technical knowledge use a database.
• DBMS provides a variety a of user interfaces.
• Apps for mobile users.
• Query language for users with technical knowledge.
• ' Programming language interfaces for application developers.
• Menu-driven interface for users.

Represent Relationships Among Data


• Databases are required to represent a variety of relationships among data.
• DB needs to enable the maintenance of relationships.

Enforce Integrity Constraints


• DBMS provides for de ning and enforcing integrity constraints on data.
• A simple constraint is type of the data.
• Key constraint.
• Referential Integrity.
• Further, during the execution of a transaction in concurrent access, the DBMS needs ability to isolate
users, transactions, and ensure consistency is maintained.

Video Transcript:
This integrity of the data can be de ned in multiple ways. There can be some de nitions, more of static
types. What we may say is that while we are de ning the schema or the structure of the data, we may
specify some constraints, as simple as that, so the As cannot be a negative number. That can be a simple
constraint, so that ensures that such quality issues will not show up in the data. We may have a key
constraint. For example, there can be no two students with the same role numbers, so that we can ensure
with another type of constraint that is the key constraint. Then we can have a referential integrity
constraint. Here what we are saying is that if a course is o ered and there is a faculty that is assigned for
this course, that faculty should be present in the faculty database. Thereby, we are making sure that
there is no spelling mistake and we are making sure that the information related to the faculty is
available. Further to this, there is another type of problems that they can come because the data is
updated with the help of transactions in many of the critical operations. It is possible that many people are
trying to make the updates. The simple example can be a travel reservation system. It is possible that
large number of users are trying to get a booking made for the same ight or same journey. In such cases,
it is necessary that the DBMS has the ability to isolate the users and the transactions and ensure that the
consistency is maintained. That is to say, it should never be possible that the two people will be able to
make the same booking. That is the database is in a particular state. So where a seat is available, and
then two users are trying to get access to a seat for the travel, and the DBMS has to make sure that the
transactions by two users have su cient isolation and also that they follow a certain sequence and rules
so that only one of them will be able to get the birth or the seat and only that person will be making the
payment for the seat. This type of integrity is required in a dynamic way when the program is getting
executed.
ffi
fi
ffi
fi
fi
ff
fi
fl
fi
ff
Triggers and Stored Procedures

• Many DBMSs provide for associating triggers with tables.


• A trigger is a rule that is activated by an update and performs additional operations on other tables,
sends messages, etc.

• Many DBMSs provide for stored procedures.


• Stored procedures are part of the de nition of a database.
• They contain elaborate procedures to enforce rules.

Video Transcript:
Further to that, many of the DBMS o er more sophisticated mechanisms so that the data can be checked
periodically. This is called as a triggers on the table updates. Whenever a table is getting updated, it is
possible to de ne a trigger that is going to perform some necessary operations that have to be done
whenever this table is updated. Similarly, the DBMSs also support a feature known as a stored
procedures. These stored procedures have reasonably complex algorithm or logic to make sure that the
data that has been updated or the data in a particular instance in the DBMS is following or is having
certain characteristics and also perform certain operations which are complex logical operations.

Varieties of Databases

History of Data Models

• Network model
• Hierarchical model
• Relational model
• Object-oriented data models

Network Model

• The rst network DBMS was implemented by Honeywell in 1964-65 (IDS System).
• Adopted heavily due to the support by CODASYL (Conference on Data Systems Languages).
• Later implemented in a large variety of systems such as IDMS (Cullinet), DMS 1100 (Unisys), IMAGE
(Hewlett-Packard), VAX -DBMS (Digital Equipment Corp).

• Advantages:
• Able to model complex relationships and represent semantics of add/delete on the relationships.
• Can handle most situations for modeling using record types and relationship types.
• Language is navigational; uses constructs like FIND, FIND member, FIND owner, FIND NEXT within set,
GET, etc.
• Programmers can do optimal navigation through the database.

• Disadvantages:
• Navigational and procedural nature of processing.
• Database contains a complex array of pointers that thread through a set of records.
• Little scope for automated “query optimization”.

Hierarchical Data Model

• Initially implemented in a joint e ort by IBM and North American Rockwell around 1965. Resulted in the
IMS family of systems.
• IBM's IMS product had a very large customer base worldwide.
• Hierarchical model was formalised based on the IMS system.

• Advantages:
• Simple to construct and operate.
• Corresponds to a number of natural hierarchically organized domains, e.g., organisation (“org") chart.
• Language is simple; uses constructs like GET, GET UNIQUE, GET NEXT, GET NEXT WITHIN PARENT,
etc.
fi
fi
ff
ff
fi
• Disadvantages:
• Navigational and procedural nature of processing.
• Database is visualised as a linear arrangement of records.
• Little scope for "query optimisation".

Object-Oriented Data Model

• Models have been proposed to align with object-oriented programming paradigm.

• Some data models emerged out of need to persist objects in programming languages such as C++
(OBJECTSTORE), Smalltalk(GEMSTONE), etc.

• Another development was object-relational models.

• Relational systems incorporate concepts from object databases.


• Supported by major vendors.
• Concepts such as inheritance incorporated in SQL standard.

Relational Model

• Proposed in 1970 by E.F. Codd (IBM), the rst commercial system in 1981-82.
• Now in several commercial products (e.g. DB2, ORACLE, MS SQL Server, etc.).
• Several free open-source implementations, e.g. MySQL, PostgreSQL, etc.
• Currently most dominant for developing database applications.
• ANSI maintains SQL relational standards: SQL-89 (SQL1), SQL-92 (SQL2), SQL-99, SQL3, ... (Latest is
SQL:2023).

• All data is stored in the form of relations.


• Database structure (catalog) is also stored in relational form (metadata is stored/found here).

Characteristics of RD

• Physical Data Independence

• Logical Data Independence

• Integrity Independence

• Distribution Independence

Video Transcript:
Some important characteristics of the relational model are that it maintains physical data
independence. That is to say, the programs need not be changed. If we are reworking the block size or if
we are changing the index structure of the data, there is no need to change the program. Similarly, there is
a logical data independence. This is to say, suppose that the structure itself is being changed, but let's say
the changes are not impacting the application program. So the changes are adding some additional
columns or removing some columns which are not being used the application program. So in such case,
there is no need to change the application program and the integrity constraints can be de ned
independent of the program. This is to say there are several integrity constraints that are supported by the
dbms and they can be independent of the application, but it is also possible that the application may have
some additional integrity constraints, so in such case those are owned by the application and the
relational model provides for distribution independence. That is to say, if the data is stored across multiple
machines, so the application or the queries need not know about the distribution of the data.
fi
fi
What is NoSQL

• Solution(s) for storage and retrieval of data which is modeled in forms other than the tabular relations
used in relational databases.
• The term "NoSQL" comes from a Twitter a hashtag used for a small conference of programmers/experts
working in non-relational databases.
• Represents wide-ranging solutions/ideas/apps for mobile users.

Why go Beyond Relational

• Structured data is only a small part of all data.


• Relational solutions became too costly and ine cient beyond a size.
• Vertical scale vs. horizontal scale.

• First organizations to outgrow relational:


• Google came up with Bigtable.
• Amazon came up with Dynamo.
• Facebook

Video Transcript:
So why is it that some people have gone beyond the relational model?

One thing is that in an enterprise today, structured data is only a small part. There is a lot of data of other
forms such as text, multimedia. The organizations are also dependent on those forms of data
and sometimes there is semi structured data such as XML, JSON.

Further, the relational solutions are becoming too costly when the data is very big. So as you can see,
there are companies like Amazon, Google, Facebook, they deal with the data which is not imagined a
decade ago. I mean, their data sizes have grown exponentially over the period and have become too big
and unwieldy for a relational model to be e ective for those organizations. Further, the relational model
often demands vertical scaling. For example, if you want to handle larger data and larger user base, then
you need to have more powerful machine, but the organizations would like to store the data with
horizontal scaling, that is, employ more machines in order to handle larger volumes of data. The rst few
organizations that have outgrown the relational model are Google, Amazon,, and in the same line, you may
include Facebook.

And other social networking organizations and the organizations which deal with wide varieties of data, for
example, an e-commerce store.

Or we are looking at the networking site among the people, professional networking sites. So what
happens is that in such cases the data is too big and the relational way of solving it is going to be simply
prohibitive.

NoSQL Characteristics

• Large data volumes


• Scalable replication and distribution
• Potentially thousands of machines
• Potentially distributed around the world
• Queries need to return answers quickly
• Mostly query, few updates
• ACID transaction properties are not needed- BASE
• Mostly open-source development

Video Transcript:
So the NoSQL characteristics are large data volumes and they would like to use horizontal scaling. That is
they want to store their data over many machines and they will replicate or synchronize the data
potentially among the thousands of the machines. And possibly those machines will be located all over the
worlds. And at the same time, they would like the query response to be reasonably quick. And this Edam
do updates most of the time they access the data for the purpose of query, and instead of acid model,
ff
ffi
fi
they follow the base model. What is the di erence between acid model and base model? Acid model
implies that immediate consistency Versus eventual consistency. And these NoSQL databases support the
base while the relational model goes for acid bay way of consistency. And further, most of these NoSQL
databases are based on open source, which is becoming very popular over last two, three decades.

Types of NoSQL Databases


• Key-value stores, e.g. Cassandra, Dynamo
• Document databases, e.g. MongoDB, XML, ISON
• Column family stores, e.g. BigTable, Hbase
• Graph databases, e.g. Neo4j

Video Transcript:
Let us look at the types of NoSQL databases. So they can be broadly categorized into the four types. So
rst one is a key value. So here the data has a key which is of a standard form, but then the value can be
of the various types. It can be a le, it can be a record, it can be some multimedia element. And so then
there are databases of documents. Generally, these documents are not strongly structured, but they are
self describing. So they can be of the form of XML JSON, where each document may have a di erent
structure, but they all have self description. Then there can be data which is based on columns, but these
columns can be organized into column families and can be stored on multiple machines. So this is one
approach which looks like, initially like you have a relational type of structure, but then the storage
organization is di erent from the relational. Then we can have a graph databases which organizes the data
in the form of nodes and edges.
And the nodes and edges are separately stored, and there is also a direction of these graphs. So these are
also again another form of NoSQL databases.
So we have seen that there are several dbms products which are not relational, and they have become
popular over last two decades. And there are more than 200 types of NoSQL databases and there is no
standardization among them.
fi
ff
fi
ff
ff
Week-2

Overview of the Database Design Process and a Sample Database Application

Major Steps in Database Design Process

• Requirement collection and analysis:


• Understanding the domain.
• Identifying the data to be stored.
• Identifying the operations to be performed on data.

• Conceptual design:

• E-R modeling.

• Logical design:
• Designing tables and relationships.
• Database schema.

• Physical design:
• Indexing.
• Clustering.
• Storage formats.

Overview of Database Design Process

Video Transcript:
The rst step in the database design process is
requirements collection and analysis. This step is
all about understanding what our users need from
the database. To do this, we interview prospective
users and listen carefully to their needs. We then
document these requirements in a clear and
concise manner, leaving no room for confusion. It
involves understanding the domain, identifying the
data to be stored. It's important to be as detailed
and thorough as possible to ensure we capture all
the necessary information. After collecting and
analyzing requirements, the next step is
conceptual design, where we create a high level
conceptual schema for the database. This schema provides a clear description of user data
requirements, including entity types, relationships, and constraints. Think of your modeling as sketching
the blueprint of our database. Since it doesn't focus on implementation details, it's easier to understand
and communicate with non-technical users. The conceptual schema acts as a reference to ensure all user
requirements are met without con icts. By concentrating on data properties rather than storage details, we
can create a robust conceptual database design more e ectively. During or after the conceptual schema
design, in addition to data requirements, functional requirements can also be considered. The speci c
tasks user will perform, such as retrieving information or making updates. While there are various
techniques like data ow diagrams and sequence diagrams to specify these requirements, we won't
discuss any of these techniques here. They're usually described in detail in other courses. The next step in
database design is the actual implementation of the database using a commercial DBMS. Most current
commercial DBMS use an implementation data model, such as the relational model. The conceptual
schema is transformed from high level data model into the implementation data model. This step is called
logical design or data model mapping. Its result is a database schema in the implementation data model
of the DBMS. Data model mapping is often automated or semi -automated within the database design
tools. The design tools employ algorithmic steps similar to the ones that we discussed in this module for
ER2 relational mapping. The last step is the physical design process or phase, during which the internal
storage structures, le organizations, indexes, access parts, and physical design parameters for the
database les are speci ed. Simultaneously, application programs will be designed and implemented
fi
fi
fi
fl
fi
fl
ff
fi
that will interact with the database. These programs are developed based on the high level transaction
speci cations that were de ned earlier.

A Sample Database Application

• Company Database

• Employees
• Departments

• Projects
• Dependents

Example Company Database

• Create a database schema design based on the following requirements of the company database:
• The company is organized into departments.
• Each department has a unique name, unique number and a particular employee who manages the
department.
• We keep track of the start date of the department manager. A department may have several locations.
• Each department controls several projects.
• Each project has a unique name, unique number and is located at a single location.

• The database will store each employee’s social security number, address, salary, sex, and birthdate.
• Each employee works for one department but may work on several projects.
• The DB will keep track of the number of hours per week that an employee works on each project.
• It is required to keep track of the direct supervisor of each employee.

• Each employee may have several dependents.


• For each dependent, the database keeps a record of name, sex, birthdate, and relationship to the
employee.

An ER Schema Diagram for the Company Database


fi
fi
Components of E-R Model

ER model describes data as:


• Entities.
• Attributes.
• Relationships.

Entities

• Entities are speci c objects or things in the mini-world that are represented in the database.
• For a example, a particular person, car, house, employee, department, or a job.

Attributes

• Attributes are properties used to describe an entity.


• Example: EMPLOYEE entity may have the attributes Name, SSN, Address, Age, Salary.
• Example: STUDENT entity may have the attributes ID, Name, Address, Age.

Types of Attributes

• Simple
• Composite
• Single
• Multi-valued
• Stored
• Derived

• Simple:
• Each entity has a single atomic value for the attribute. For example, SSN or Sex.

• Composite:
• The attribute may be composed of several components.
• Example: Address (Apt#, House#, Street, City, State, ZipCode, Country).
• Example: Name (FirstName, MiddleName, LastName).

• Single-valued:
• Most attributes have a single value for a particular entity; such attributes are called single-valued.
• Example: SSN, Sex.

• Multi-valued:
• An attribute can have a set of values (more than one) for the same entity called multi-valued.
• Example: Color of a CAR.
• Example: PhoneNumber of an Employee.

• Derived:
• If the attribute can be derived from other attributes, then it's called a derived attribute.
• Example: Age.

• Stored:
• The attributes that are not derivable from other attributes and need to be stored in the database.
• Example: Birth_Date of an Employee.

• Null Values.
• A particular entity may not have an applicable value for an attribute.
• Example: Apartment_number.
• College_degrees.
• Meaning of NULL:
• UNKNOWN (missing or not known)
• Not applicable.
fi
• Complex Attributes.
• Composite and Multivalued attributes can be
nested arbitrarily.
• Example: Address_Emailphone ({PhoneNo},
{Email}, Address {Street_Number, City, State,
Zip}).
• Multivalued attribute is represented by
double oval.

Entity Types and Entity Set

• Entities with the same basic attributes are grouped or typed into an
entity type.
• For example, the entity type EMPLOYEE and PROJECT.
• Each entity type will have a collection of entities stored in the
database.

Video Transcript:
Let's further understand the concepts of entity set and entity
collection in ER modeling. Database usually contains groups of
entities that are similar. For example, a company employing hundreds
of employees may want to store similar information concerning each
of the employees. These employee entities share the same
attributes, but each entity has its own value for each attribute. An entity type de nes a collection of entities
that have the same attributes. Each entity type in the database is described by its name and
attributes. The gure shows an entity types employee and a list of some of the attributes of the
employee. A few individual entities of each type are also illustrated, along with the values of the
attributes. The collection of all entities of a particular entity type in the database at any point in time is
called an entity set or entity collection. Also, an entity type describes the schema or intention for a set of
entities that share the same structure. The collection of entities of a particular entity type is grouped into
an entity set, which is also called the extension of the entity type. Remember these two key
words, intention for schema, and extension for a collection of this entity type. The entity set is usually
referred to using the same name as the entity type, even though they are two separate concepts. For
example, employee refers to both a type of entity, as well as the current collection of all employee entities
in the database.

NOTATION for ER Diagrams

Key Attributes

• Key: An attribute of an entity type for which each entity must have a
unique value is a a called a key attribute of the entity type.
• For example, SSN of PERSON.
• Composite Key: A key attribute may be composite.
• VehicleTagNumber is a key of the CAR entity type with components
(Number, State).
fi
fi
• An Entity type may have more than one key.
• The PERSON entity type may have two keys:
• Passport Number.
• SSN.
• The CAR entity type may have two keys:
• Vehicleldenti cationNumber (popularly called VIN).
• VehicleTag Number (Number, State), aka license plate number.
• Each key is underlined.

Value Sets (or Domain of Values)

• A value set speci es the set of values that may be assigned to that attribute for each individual entity.
• Example: Lastname has a value which is a character string of upto 15 characters, say.
• Date has a value consisting of MM-DD-YYYY where each letter is an integer.

Attributes and Value Sets

• Value sets are similar to data types in most programming languages • e.g., integer, character (n), real etc.

Displaying an Entity type

• In ER diagrams, an entity type is displayed in a rectangular box.


• Attributes are displayed in ovals.
• Each attribute is connected to its entity type.
• Components of a composite attribute are connected to the oval representing the composite attribute.
• Each key attribute is underlined.
• Multivalued attributes displayed in double ovals.

Re ning the Initial Design by Introducing Relationships

• The initial design is typically not complete.


• Some aspects in the requirements will be represented as relationships.
• ER model has three main concepts:
• Entities (and their entity types and entity sets)
• Attributes (simple, composite, multivalued)
• Relationships (and their relationship types and relationship sets)

Relationships and Relationship Types

• Relationship: When an attribute of one entity type refers to


another entity type.
• A relationship relates two or more distinct entities with a speci c
meaning.
• For example, EMPLOYEE John Smith works on the ProductX
PROJECT, or EMPLOYEE Franklin Wong manages the
Research DEPARTMENT.
• Relationships of the same type are grouped into a relationship
type.
• For example, consider a relationship type WORKS FOR
between the two entity types EMPLOYEE and DEPARTMENT,
which associates each employee with the department for which
the employee works. Each relationship instance in the
relationship set WORKS_FOR associates one EMPLOYEE
entity and one DEPARTMENT entity.
fi
fi
fi
fi
Relationship Degree

• Degree of a relationship type


• Number of participating entity types
• Binary, ternary
• Both MANAGES and WORKS_ON are binary relationships

Relationship Types, Sets, and Instances

• Each relationship instance in the in relationship set WORKS_FOR


associates one EMPLOYEE entity and one DEPARTMENT.

Role Names and Recursive Relationships

• Role names:
• Role name signi es role that a participating entity plays in each
relationship instance.
• Recursive relationships:
• The same entity type participates more than once in a relationship type in di erent roles.
• It must specify the role name.
• Recursive relationships or Self-referencing relationships
• Example:

• In a recursive relationship type:


• Both participations are same entity type n di erent roles.
• For example, SUPERVISION relationships between EMPLOYEE (in role of supervisor or boss) and
(another) EMPLOYEE in role of subordinate or worker).
• In the following gure, rst role participation labeled with 1, and second role participation labeled with
2.
• Need to display role names in ER diagram to distinguish participations.

Recursive Relationship Type is: SUPERVISION (Participation Role Names are Shown)
fi
fi
fi
ff
ff
Re ning the COMPANY Database Schema by Introducing Relationships

• By examining the requirements, six relationship types are identi ed


• Binary relationships (degree 2)
• Given below with their participating entity types:
• WORKS FOR (between EMPLOYEE. DEPARTMENT)
• MANAGES (also between EMPLOYEE, DEPARTMENT
• CONTROLS (between DEPARTMENT, PROJECT)
• WORKS ON (between EMPLOYEE, PROJECT)
• SUPERVISION (between EMPLOYEE (as subordinate), EMPLOYEE (as supervisor))
• DEPENDENTS_OF (between EMPLOYEE, DEPENDENT)

Structural Constraints of Relationship Types

• Constraints on Relationship Types


• Cardinality Ratios for Binary Relationships
• One-to-one (1:1)
• One-to-many (1:N) or Many-to-one (N: 1)
• Many-to-many (M:N)
• Participation Constraints and Existence Dependencies
• Zero (optional participation, not existence- dependent)
• One or more (mandatory participation, existence-dependent)

Cardinality Ratios for Binary Relationships

• Speci es the maximum number of relationship instances that an entity


can participate in
• For example, in the WORKS_FOR binary relationship type,
DEPARTMENT: EMPLOYEE
• Many-to-one (N:1) relationship

Many-to-Many (M:N) Relationship

• Employee can work on several projects and a project can have several
employees.

One-to-One (1:1) Relationship

• An employee can manage at most one department and a department can have at most one manager.

Notation

• Cardinality ratios for binary relationships are represented on


ER diagrams by displaying 1, M, and N on the diamonds.

Participation Constraints and Existence Dependencies

• Speci es whether the existence of an entity depends on its being related to another entity via the
relationship type.
• Total
• Partial
• If every employee must work for a department, then an employee entity can exist only if it participates in
at least one WORKS_FOR relationship instance (Total participation).

Partial Participation

• Every employee is not expected to manage a department (Partial participation).


fi
fi
fi
fi
Notation

• Total participation (or existence dependency) is displayed as a


double line connecting the participating entity type to the
relationship.
• Partial participation is represented by a single line.

Alternative (Min, Max) Notation for Relationship Structural


Constraints

• Speci ed on each participation of an entity type E in a relationship


type R
• Speci es that each entity e in E participates in at least min and at
most max relationship instances in R
• Default(no constraint): min=0, max=n
• Must have min≤max, min≥0, max ≥1
• Derived from the knowledge of mini-world constraints

Example:
• A department has exactly one manager and an
employee can manage at most one department.
• Specify (0,1) for participation of EMPLOYEE in
MANAGES.
• Specify (1,1) for participation of DEPARTMENT in
MANAGES.

Attributes of Relationship Types

• An attribute may be added to Relationship Case 1.


• Include the date on which an Employee started working in a
department via an attribute StartDate for the Works for
relationship type.

• Case 2:
• Include the date on which manager started managing a
department via an attribute StartDate for the MANAGES
relationship type.

• Case 3:
• Include an attribute Hours for the WORKS_ON relationship type
to record the number of hours per week that a particular
employee works on a particular a project.

Weak Entity Types

• An entity that does not have a key attribute and that is identi cation-dependent on another entity type.
• A weak entity must participate in an identifying relationship type with an owner or identifying entity type.
• Entities are identi ed by the combination of:
• A partial key of the weak entity type.
• The particular entity they are related to in the identifying relationship type.
• Example:
• A DEPENDENT entity
• DEPENDENT is a weak entity type
• EMPLOYEE is its identifying entity type or owner
entity type via the identifying relationship type
DEPENDENT_OF
• Notation: Double Diamond
• A weak entity type always has a total participation
constraint (existence dependency) with respect to
its identifying relationship.
fi
fi
fi
fi
• Partial key: Name of DEPENDENT is the partial key.

The Enhanced Entity-Relationship (EER) Model

The Enhanced Entity-Relationship (EER) Model

• Additional semantic data modeling concepts were incorporated into conceptual data models for
databases that have requirements that are more complex than the more traditional applications.
• EER stands for Enhanced ER or Extended ER.
• EER Model Concepts:
• Includes all modeling concepts of basic ER.
• Additional concepts:
• Subclasses/superclasses
• Specialisation/generalisation
• Attribute and relationship inheritance
• Constraints on specialisation/generalisation
• The additional EER concepts are used to model applications more completely and more accurately.
• EER includes some object-oriented concepts, such as inheritance.

Enhanced ER (EER) Model

• Introduce Class/subclass relationships and type inheritance into the ER model.


• Concepts of specialisation and generalisation.
• Exercise:
• Consider the following requirement where Dependents of an Employee could be of di erent with their
own speci c attributes.
• Spouse (Occupation, Name, DOB...)
• Children (Grade, Name, DOB...)
• How can this be incorporated in the Dependent entity? Let's understand the concepts needed to
implement this.

Subclasses and Superclasses


• An entity type may have additional meaningful subgroupings of its entities.
• Example: EMPLOYEE may be further grouped into:
• SECRETARY, ENGINEER, TECHNICIAN,
• Based on the EMPLOYEE's Job
• MANAGER
• EMPLOYEEs who are managers (the role they play)
• SALARIED_ EMPLOYEE, HOURLY EMPLOYEE
• Based on the EMPLOYEE’s method of pay
fi
ff
• EER diagrams extend ER diagrams to represent these additional subgroupings, called subclasses Or
subtypes.

• Each of these subgroupings is a subset of EMPLOYEE entities.


• Each is called a subclass of EMPLOYEE.
• EMPLOYEE is the superclass for each of these subclasses.
• These are called superclass/subclass relationships:
• EMPLOYEE/SECRETARY
• EMPLOYEE/TECHNICIAN
• EMPLOYEE/MANAGER…
• These are also called IS-A relationships:
• SECRETARY IS-A EMPLOYEE
• TECHNICIAN IS-A EMPLOYEE ….

• Note: An entity that is member of a subclass represents the same real-world entity as some member of
the superclass:
• The subclass member is the same entity in a distinct speci c role.
• An entity cannot exist in the database merely by being a member of a subclass; it must also be a
member of the superclass.
• A member of the superclass can be optionally included as a member of any a number of its
subclasses.

• Examples:
• A salaried employee who is also an engineer belongs to the two subclasses:
• ENGINEER, and
• SALARIED EMPLOYEE
• A salaried employee who is also an engineering manager belongs to the three subclasses:
• MANAGER,
• ENGINEER, and
• SALARIED_EMPLOYEE
• It is not necessary that every entity in a superclass be a member of some subclass.

Attribute Inheritance in Superclass / Subclass Relationships

• An entity that is member of a subclass inherits:


• All attributes of the entity as a member of the superclass.
• All relationships of the entity as a member of the superclass.
• Example:
• In the previous slide, SECRETARY (as well as TECHNICIAN and ENGINEER) inherit the attributes
Name, SSN, ..., from EMPLOYEE.
• Every SECRETARY entity will have values for the inherited attributes.

Specialisation

• Specialisation is the process of de ning a set of subclasses of a superclass.


• The set of subclasses is based upon some distinguishing characteristics of the entities in the superclass.
fi
fi
• Example: {SECRETARY, ENGINEER, TECHNICIAN} is a specialisation of a EMPLOYEE based upon job
type.
• Example: MANAGER is a specialisation of a EMPLOYEE based on the role the employee plays.
• May have several specialisations of the same superclass.

• Example: Another specialisation of EMPLOYEE based on method of pay is {SALARIED_EMPLOYEE,


HOURLY_EMPLOYEE}.
• Superclass/subclass relationships and specialisation can be diagrammatically represented in EER
diagrams.
• Attributes of a subclass are called speci c or local attributes.
• For example, the attribute Typing Speed of SECRETARY.

• Example: Another specialisation of EMPLOYEE based on method of pay S {SALARIED EMPLOYEE,


HOURLY EMPLOYEE}.
• The subclass can also participate in speci c relationship types.

Generalisation

• Generalisation is the reverse of the specialisation


process.
• Several classes with common features are
generalised into a superclass. a
• Original classes become its subclasses.
• Example: CAR, TRUCK generalised into VEHICLE.
• Both CAR, TRUCK become subclasses of the
superclass VEHICLE.
• We can view {CAR, TRUCK} as a specialisation
of VEHICLE.
• Alternatively, we can view VEHICLE as a
generalisation of CAR and TRUCK.

Constraints on Specialisation and Generalisation

• We can determine exactly those entities that will become members of each subclass by a condition, the
subclasses are called predicate-de ned (or condition-de ned) subclasses.
• Condition is a constraint that determines subclass members.
• Display a predicate-de ned subclass by writing the predicate condition next to the line attaching the
subclass to its superclass.

• If all subclasses in a specialisation have membership condition on same attribute of the superclass,
specialisation is called an attribute-de ned specialisation.
• Attribute is called the de ning attribute of the specialisation.
• Example: JobType is the de ning attribute of the specialisation.
• {SECRETARY, TECHNICIAN, ENGINEER} of EMPLOYEE.

• If no condition determines membership, the subclass is called user-de ned.


• Membership in a subclass is determined by the database users by applying an operation to add an
entity to the subclass.
• Membership in the subclass is speci ed individually for each entity in the superclass by the user.

• Two basic constraints can apply to a specialisation/generalisation:


• Disjointness Constraint
• Completeness Constraint

• Disjointness Constraint:
• Speci es that the subclasses of the specialisation must be disjoint:
• An entity can be a member of at most one of the subclasses of the specialisation.
• It is speci ed by d EER diagram.
• If not disjoint, specialisation is overlapping.
• That is the same entity may be member of more than one subclass of the specialisation.
• It is speci ed by o in EER diagram.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
• Completeness Constraint:
• Total speci es that every entity in the superclass must be a member of some subclass in the
specialisation/generalisation.
• Shown in EER diagrams by a double line,

• Partial allows an entity not to belong to any of the subclasses.


• Shown in EER diagrams by single line.
• Hence, we have four types of specialisation/generalisation:
• Disjoint total
• Disjoint partial
• Overlapping total
• Overlapping partial

Example of Overlapping Total Specialisation

Displaying an Attribute-De ned Specialisation in EER Diagrams


fi
fi
Week-4

Formal Query Languages

• One of the expectations from relational model is that all data can be queried.

• The formal languages provide query constructs.

• RDBMS providers implement the capabilities of constructs in their products.

Declarative vs Procedural Languages

• Declarative languages provide ability to state what you want to query without ambiguity.

• For example, Tuple Relational Calculus and Domain Relational Calculus.

• Procedural languages have ner constructs that enable us to state the sequence of steps to obtain
query results.

Unary Operations

• Unary operations are done on single relation.


• They help us query any part of a relation.
• Unary operations in relational algebra are:
• SELECT
• PROJECT

Video Transcript:
The two main unary operators that we require are SELECT and PROJECT. So what SELECT is doing is
taking out of the tuples that are there in the relation. It is selecting some of those tuples and showing. And
then, what PROJECT is doing? We take a relation which is consisting of many columns or attributes, and
we require only a subset of them. Then we apply PROJECT, saying that we want to get only part of the set
of the columns. So the both of these apply on a single relation.

Binary Operations

• Binary operations are applied on two relations.


• Binary operations in relational algebra are:
• JOIN
• Several variants of JOIN
• Binary operations combine two relations in multiple ways into a single relation.
fi
Set Operations

• Relational Algebra considers relations as sets.


• Relational algebra includes set operations:
• UNION
• INTERSECTION
• MINUS
• CROSS PRODUCT
• All of the above operations result in a single relation.

Aggregate Functions

• Relational Algebra supports aggregate functions.


• Many queries are for aggregates such as:
• Sum
• Average
• Maximum
• Can not be responded without aggregate functions.

Relational Algebra - Unary & Set operations

Unary Relational Operation: SELECT

• The symbol σ (sigma) is used to represent the operator.

• The selection condition is a boolean expression on the attributes of relation R.


• The select operation results in a new relation such that:
• Tuples for which the selection condition true are included.
• Remaining tuples are excluded.
Unary Relational Operation: PROJECT

Combining Relational Operations

• Often a single operation of Relational Algebra is not adequate for real-world queries.

• Relational Algebra (like Algebra) allows us to write complex expressions that combine multiple
operations.

• Like in Algebra, if you if think expression is becoming complex, you can break into multiple simpler
expressions (with few additional variables), e.g:
• x = a+(b*(c+d*e)) can be written as:
• y = c+d*e, and x = a+b*y.

• Here is an example:
• Say, we want to obtain the rst name, last name. and salary of all employees who work in department
number 5
• This requires both select and project operations.
fi
Unary Operation: Rename

Example for Sequence of Operations

Type Compatible Relations


• Relations R(A1,A2,...An) and S(B1,B2,…Bm) are considered type compatible only if:
• m=n i.e. they have same number of attributes, and
• Domain (Ai) = Domain (Bi) for all values of i.

Set Operation: UNION

• The UNION operation is represented by the symbol U.


• It is a binary operation on two type compatible relations.
• RUS results in a new relation that contains all tuples in either R or S, or both.
• The tuples that are common to R and S are S represented in RUS as a single tuple, since
• Relational Algebra does not allow duplicates in a relation.

Set Operation: INTERSECTION

• The INTERSECTION operation is represented by symbol ∩.


• It is also a binary operation on two type compatible relations.
• R∩S results in a new relation that contains ONLY the tuples that are present in both R and S.

Set Operation: MINUS

• The MINUS operation is represented is by symbol "—“.


• It is also referred as Set Di erence.
• It is also a binary operation on two type compatible relations.
• R-S results in a new relation that contains the tuples that are present in R, but NOT in S.
ff
Example of Set Operations

Cartesian Product Operation

• Cartesian Product of Relational Algebra is represented by symbol X.


• Cartesian Product is also referred as Cross Product.
• It is also binary operation (like U,N, and - ).
• However, it does not require type compatibility between two participating relations.

Cartesian Product Operation

• Cartesian Product of relations:


• R (A1, A2, ..., An) and ne
• S (B1, B2, ..., Bm). Ms
• Results in a new relation (let us call it Q). it
• Q will have the structure of:
• Q (A1, A2, ..., An, B1, B2, ..., Bm).
• Every tuple in R will combine with every tuple of S to create a tuple in Q.
• Thus, if R has nR tuples and S has nS tuples, Q will have nR * nS tuples.

Example of Cartesian Product


Relational Algebra - Binary & Aggregate operations

JOIN Operation

• JOIN operation is represented by ⋈


• JOIN combines Cartesian Product with SELECT such that tuples have meaningful relation in the result.
• Since most databases have large number of relations, JOIN is important for getting useful results for
most queries.
• JOIN of two relations R and S is written as:
• R <join condition> S.
• JOIN of relations:
• R (A1, A2, ..., An) and
• S (B1, B2, ..., Bm).
• This results in a new relation (let us call it Q).
• Q will have the structure of:
• Q (A1, A2, ..., An, B1, B2, ..., Bm).

• You may notice that the structure is similar to what you get with Cross Product.

• However, the selection condition keeps only those tuples in Q, which satisfy the <join condition>, which
will be based on attributes in R and S.

• Thus if R has nR tuples and S has nS tuples, Q will have <= nR * nS tuples.

Example of JOIN

EQUIJOIN Operation

• In the previous video, we looked at JOIN operation, which combines Cartesian Product with selection
condition.
• The selection condition can be any Boolean expression involving attributes.
• When the Boolean expression is based on equality check, the join is called EQUIJOIN.
• This is the most common usage of JOIN.

Natural JOIN Operation

• We have a seen JOIN and a special and popular form of JOIN called EQUIJOIN.
• NATURAL JOIN is a special case of EQUIJOIN.
• NATURAL JOIN is denoted by symbol *
• NATURAL JOIN between two relations is an EQUIJOIN over all common attributes in two relations.
• NATURAL JOIN of relations:
• R (A, B, C, D, E) and C,E
• S (C, E, F, G).
• Results in a new relation (let us call it Q).
• will have the structure of:
• Q (A, B, C, D, E, F, G).
• Each tuple in R is joined with a tuple in S a with the same values for common attributes C and E.
• Q=R*S
• R ⋈ R.C=S.C AND RE=S.E S
• You will notice that Q has only one column for common attributes C and E.

Example for JOIN

OUTER JOIN Operation

• In all the JOIN operations, discussed so far, the selection condition determines tuples to be included in
the resulting relation.
• The tuples that do not match the selection condition are eliminated from the result.
• In OUTER JOIN, tuples that do not match the selection condition are also retained in the result.

OUTER JOIN Variants

Video Transcript:
When we say it is a left outer join, it is represented by the symbol
given here. What we do is that, let's say there is a relation R and
relation S, we want to take an outer join on that and say we want to
take a left outer join. What will happen? The R is on the left side of
the operation and S is on the right side of the operation. When we
apply a condition and take left outer join, so we are going to get the
tuples that are matching the condition, and we are also going to
get the remaining unmatched tuples from the R. From the S, they're
going to be dropped because this is a left outer join, so it is only
going to keep the unmatched tuples on the left side of the join
operation. Similarly, there is a right outer joint. What it does is
that, when you apply this operation, it is going to keep all the tuples
from the right side of the join operator. In this case, it is yes. If the
tuples of the R match the condition, they're going to stay. If they do not match the condition, they're going
to be dropped. But on the right side, that is from the yes, all the tuples are going to stay in the
relation. When we apply the full outer join, this is the symbol for that, all the unmatched tuples of the left
and also of the right are retained in the result.

Example for OUTER JOIN


DIVISION Operation

• The DIVISION operation is a binary operation that is denoted by ÷ .


• The DIVISION operation is appropriate for queries involving 'for all' semantics.

Example for DIVISION

1 == V

<==2 ^ == 3

Aggregate Functions

• The queries over individual tuples is not adequate. Aggregates over a group of tuples are often needed
for MIS.
Grouping for Aggregate Functions

• We need a construct for creating a grouping of tuples within a relation.


• Grouping is typically done based on the value of an attribute, e.g.:

• Above expression, groups tuples in EMPLOYEE relation based on DNO (Department No), and for each
department, the number of employees and average salary is returned.

Example for Aggregate Functions

Example for Aggregate Functions (With Grouping)

With re-naming columns.

Relational Calculus

• In Relational Algebra, a query is expressed by a set of operations required to get results.


• Relational Algebra is procedural.
• Relational Calculus is an alternate formal query language.
• It is declarative or non-procedural.
• In Relational Calculus, a query iS expressed using mathematical and logical symbols.
• Relational Calculus and Relational Algebra have equal expressive power.
Variants of Relational Calculus

• There are two variants of Relational Calculus:

• Tuple Relational Calculus - here query is expressed using the characteristics of the tuples expected in
the result.
• SQL is based on Tuple Relational Calculus

• Domain Relational Calculus - here query is expressed using the domains of attributes expected in the
resulting relation.

Tuple Relational Calculus (TRC)

• A TRC query is of the form: {t|P(t)}.

• The query is expecting a relation that consists of tuples (t) matching the predicate P(t).

• An example is {t | EMPLOYEE(t) AND t. Salary > 50000}.

Existential and Universal Quanti ers

• Existential quanti er is denoted by ∃. When we write (∃t), it means 'there exists a tuple t'.

• Universal quanti er is denoted by ∀. When we write (∀t) it means 'for all tuple t’

• Both the quanti ers help to write queries involving multiple relations, for example:
• {t.FNAME, t.LNAME | EMPLOYEE(t) and (∃d) (DEPARTMENT (d) and d.DNAME='Research' and
d.DNUMBER=t.DNO)}.

• The above query fetches names of employees in the Research department.

• If tuple variable occurs with ∃ or ∀ quanti ers, it is called a bound variable, otherwise, it is called a free
variable.

Examples for TRC

• List the name of each employee who works on some project controlled by department number 5
{e.Lname, e. Fname | EMPLOYEE(e) AND ((∃x) (∃w) (PROJECT (x) AND WORKS_ON (w) AND x.Dnum=5
AND w.Essn=e. Ssn AND x.Pnumber=w.Pno))}.

• List the names of employees who work on all the projects controlled by department number 5.
{e.Lname, e.Fname | EMPLOYEE (e) AND ((∀x) (NOT (PROJECT (x)) OR NOT (x.Dnum=5) OR ((aw)
(WORKS_ON (w) AND w.Essn=e.Ssn AND x.Pnumber=w.Pno))))}.

Domain Relational Calculus (DRC)

• Domain Relational Calculus is also non-


procedural like Tuple Relational Calculus.

• It has the same expressive power as Tuple


Relational Calculus.

• A DRC query is of the form: { X1, X2, . . ., Xn | COND (X1, X2, . . ., Xn, Xn+1, Xn+2, . Xn+m)}.
• The query is expecting a relation that consists of attributes (X1, X2, . . ., Xn) matching the predicate
COND (X1, X2, Xn, Xn+1, Xn+2, . . Xn+m).

• The condition can include variables that go beyond what is needed in query results.
fi
fi
fi
fi
fi
Example for DRC

• List the birthdate and address of the employee whose name is 'John B. Smith'.
{uv | (∃q) (∃r) (∃s) (∃t) (∃w) (∃x) (∃y) (∃z) (EMPLOYEE (qrstuvwxyz) and q=' John' and r=‘B' and s='Smith')}.

• Here q,r,s,t,u,v,w,x,y,z represent attributes constituting a tuple of EMPLOYEE.

Relational Calculus Example ∃ vs ∀

• List the names of employees who have no dependents.

• {q, s | (∃t) (EMPLOYEE (qrstuvwxyz) AND (NOT (∃l) (DEPENDENT (Imnop) AND t=I))}.

• The above query can be restated using universal quanti ers instead of the existential quanti ers, as
shown below:

• {q, s | (∃t) (EMPLOYEE(qrstuvwxyz) AND ((∀I) (NOT (DEPENDENT (Imnop)) OR NOT(t=I)))}.


fi
fi
Week-5
Language for Relational Model

• As we have seen earlier, Relational Model was proposed to simplify and generalise data models.

• Earlier models have made assumptions about hierarchy and relations within databases and required
expertise to write queries.

• The relational model allows organising entire data (and even metadata) as set of relations and queries
can connect relations based on values of columns.

• Researchers at IBM have developed a language, eventually called SQL (Structured Query Language),
that was simpler and natural to work with Relational model.

SQL - A Declarative Language

• SQL is declarative, users simply tell what they want. It is more like Relational Calculus.

• SQL was conceived as a language that can be used by end-users to query data.

• RDBMS software implements optimiser that nds the best way to execute the query.

SQL - A Standards Based Language

• SQL is supported as query language by all RDBMS providers.

• Standards for SQL are maintained by ANSI.

• RDBMS products try to conform to the standards.

• Usually, RDBMS systems convert an SQL query into Relational Algebraic expression. It allows to rewrite
into possibly more e cient form and execute.

SQL is not just a Query Language

• SQL statements cover:


• Data De nition.
• Data Manipulation
• Transaction Management.
• Access Control.

SQL - Feature Rich

• SQL early editions focused on working with simple tables.


• Later versions included support for:

• XML
• Object orientation
• Analytical Functions

SQL - Core & Extensions

• Starting from SQL-1999, standards for SQL are divided into Core and Extensions.

• The core has to be implemented by all SOL- compliant RDBMS.

• The extensions can be implemented as optional modules for speci c database applications.
fi
ffi
fi
fi
Data De nition Language

What is Schema?

• A relational database schema groups tables and other constructs of a database application.

• An example is Company Schema, which has tables for Employees, Departments, Projects etc.

• An SQL Schema has a name and authorisation identi er(person who is authorized to make changes to
data). A Schema includes:

• Tables
• Constraints
• Domains
• Views

Schema Creation

• An example is Company Schema, which has tables for Employees, Departments, Projects etc.
• SQL provides the following statement for creating Schema:
• CREATE SCHEMA <schema name> <parameters>.

What is a Catalog?

• SQL uses the concept of a catalog, which has information on collection of schemas.

• Typically, a catalog is a collection of tables with metadata about databases.

• Schemas within a catalog can share certain elements, such as type de nitions.

CREATE Statement

• SQL uses the terms table, row, and column


for the formal relational model terms relation,
tuple, and attribute.

• CREATE statement is used to create tables


(relations), as well as other constructs such
as virtual tables (views), triggers etc.

• CREATE statement can be speci ed with


explicit schema or implicit (default)
schema:

• CREATE TABLE COMPANY. EMPLOYEE


OR
• CREATE TABLE EMPLOYEE

CREATE TABLE Statement

• CREATE TABLE statement can be speci ed with all attributes, and constraints,
OR
• CREATE TABLE statement can be speci ed with partial or no information, and attributes and constraints
are added later using ALTER statement.
fi
fi
fi
fi
fi
fi
CREATE TABLE Example

• CREATE TABLE EMPLOYEE


( Fname VARCHAR(15) NOT NULL,
Minit CHAR.
Lname VARCHAR(15) NOT NULL,
Ssn CHAR(9) NOT NULL,
Bdate DATE.
Address VARCHAR(30),
Sex CHAR,
Salary DECIMAL(10,2),
Super_ssn CHAR(9)
Dno INT NOT NULL.
PRIMARY KEY (Ssn)),

• CREATE TABLE DEPARTMENT


Dname VARCHAR(15) NOT NULL,
Dnumber INT,
Mgr_ssn CHAR(9) NOT NULL
Mgr_start _date DATE.
PRIMARY KEY (Dnumber),
UNIQUE (Dname),
FOREIGN KEY (Mgr_ssn) REFERENCES
EMPLOYEE(Ssn));

Circular Reference

• If you had carefully looked at Employee table de nition, we did not include FOREIGN KEY constraints.

• A foreign key constraint can be speci ed with an existing table only.

• The foreign key Super_ssn in the EMPLOYEE table refers to the EMPLOYEE table itself.

• The foreign key Dno in the EMPLOYEE table refers to the DEPARTMENT table.

• These constraints can be added later using the ALTER TABLE statement.

Common Constraints

Commonly used constraints include:


• Key constraint
• Referential integrity constraint
• Null constraint
• Attribute domain constraint

Attribute Constraints

• If we do not want an attribute to be NULL, a constraint can be accordingly speci ed.


• NOT NULL constraint implicitly applies for attributes that are part of Primary Key.
• DEFAULT value can be speci ed for an attribute.
• CHECK clause can be applied on attribute value.
• Attribute datatype can be speci ed by a domain. A domain de nition can include constraints.

Key Constraints

• A relation may have one or more attributes that make up the primary key:
• Dumber INT PRIMARY KEY or
• PRIMARY KEY (Dumber, Dlocation)
fi
fi
fi
fi
fi
fi
Key Constraints

• A relation may have secondary key that can be speci ed using UNIQUE clause.

• Referential integrity can be speci ed in a relation with FOREIGN KEY clause.

• By default, SQL does not allow FOREIGN KEY violation during insert/delete/update operation.

• A schema designer can provide alternative actions when the constraint is violated.

Referential Integrity Actions

• When the FOREIGN KEY constraint is violated:


• Default Action: Reject the operation that violates the constraint.

• Alternative options are:

• SET NULL
• Value of the referencing attributes is changed to NULL.
• SET DEFAULT
• Value of the referencing attributes is changed to default value.
• CASCADE
• ON DELETE causes to delete all the referencing tuples, ON UPDATE causes to update all the
referencing tuples.

CONSTRAINT Example

CREATE TABLE EMPLOYEE


( ... ,
Dno INT NOT NULL DEFAULT 1,
CONSTRAINT EMPPK
PRIMARY KEY (Ssn),
CONSTRAINT EMPSUPERFK
FOREIGN KEY (Super_ssn) REFERENCES
EMPLOYEE(Ssn)
ON DELETE SET NULL ON UPDATE
CASCADE,
CONSTRAINT EMPDEPTFK
FOREIGN KEY(Dno) REFERENCES
DEPARTMENT(Dnumber)
ON DELETE SET DEFAULT ON UPDATE
CASCADE);

CREATE TABLE DEPARTMENT


( ... ,
Mgr ssn CHAR(9) NOT NULL DEFAULT
'888665555',
CONSTRAINT DEPTPK
PRIMARY KEY(Dnumber), ‹
CONSTRAINT DEPTSK
UNIQUE (Dname),
CONSTRAINT DEPTMGRFK
FOREIGN KEY (Mgr_ssn) REFERENCES
EMPLOYEE(Ssn)
ON DELETE SET DEFAULT ON UPDATE
CASCADE);
fi
fi
SCHEMA Changes

• SQL o ers commands to change a database schema:


• DROP
• ALTER

• Schema changes do NOT require any compilations of programs or schemas.


Video transcript:
The commands o ered by SQL for changing the database schema are to drop and alter. When the
schema is changed, we do not need to make changes to any of the SQL commands or if we are
using SQL as an embedded SQL within a program. There is no need to make changes to any of them or to
do any compilations. Of course, if we are dropping something that is used in those SQL statements, then
those SQL statements do not work. But if we are not dropping anything that is referenced within the SQL
statement, we simply do not need to take any action.

DROP Statement

• DROP statement is used to drop named schema elements:


• Tables
• Domains
• Constraint
• DROP uses two options:
• CASCADE: Constraints and views, that reference the table are also dropped
• RESTRICT: Table is dropped only if it is not referenced in any constraints or views

ALTER Statement

• ALTER TABLE statement is used to make changes to a table:


• Add/Drop a column
• Change a column de nition
• Add/Drop a constraint

It is also possible to ALTER:


• Schema
• View
• Column

DROP Example

• DROP SCHEMA COMPANY CASCADE;


• This removes the schema and all its elements including tables, views, constraints, etc.

• DROP SCHEMA COMPANY RESTRICT;


• The schema is dropped only if nothing in it

• DROP TABLE DEPENDENT CASCADE;


• The DEPENDENT table is dropped from the COMPANY schema and all constraints, view that reference
the table are also dropped.

ALTER Example

• ALTER TABLE COMPANY. EMPLOYEE


ADD COLUMN Job VARCHAR(12);

• Adds a new column but puts no values in the column.

• ALTER TABLE COMPANY. EMPLOYEE


DROP COLUMN Address CASCADE;
ff
ff
fi
• The Address column is dropped from the EMPLOYEE table. All views, constraints that refers to the
column, are also dropped.

• ALTER TABLE
COMPANY DEPARTMENT ALTER
COLUMN Mgr_ssn DROP DEFAULT;

• ALTER TABLE
COMPANY.DEPARTMENT ALTER
COLUMN Mgr_ssn SET DEFAULT
'333445555;
• Above 2 commands a ect default values.

• ALTER TABLE COMPANY EMPLOYEE


DROP CONSTRAINT EMPSUPERFK
CASCADE;
• Drops the constraint named EMPSUPERFK from the EMPLOYEE relation.

SELECT Statement

Retrieval in SQL
• SELECT statement is for retrieving information from a database.
• SQL SELECT is NOT same as Select operation in Relational Algebra.

• Important distinction between SQL and the formal relational model:


• SQL allows a table (relation) to have two or more tuples that are identical in all their attribute values.

• Thus, an SQL relation (table) is a multi-set (sometimes called a bag) of tuples; it is not a set of tuples.
• SQL relations can be constrained to be sets by having PRIMARY KEY or UNIQUE attributes.

SELECT Statement

• Basic form of the SQL SELECT statement is a SELECT-FROM-WHERE block.

SELECT <attribute list>


FROM ‹table list>
WHERE <condition>

• <attribute list> is a list of attribute names whose values are to be retrieved by the query.
• <table list> is a list of the relation names required to process the query.
• <condition> is a conditional (Boolean) expression that identi es the tuples to be retrieved by the query.

Example for SELECT Statement


ff
fi
SELECT Statement Example-1

• Let us look at an example of a simple query on one relation:


• Retrieve the birthdate and address of the employee whose name is 'John B. Smith'.

SELECT BDATE, ADDRESS


FROM EMPLOYEE
WHERE FNAME='John' AND MINIT='B'
AND LNAME='Smith'

• Similar to a SELECT-PROJECT pair of relational algebra operations:


• The SELECT clause speci es the projection attributes and the WHERE - clause speci es the
selection condition.

SELECT Statement Example-2

• Retrieve the name and address of all employees who work for the ‘Research' department.

SELECT FNAME, LAME, ADDRESS


FROM EMPLOYEE, DEPARTMENT
WHERE -DNAME='Research' AND
DNUMBER=DNO,

• Similar to a SELECT-PROJECT-JOIN sequence of relational algebra operations.


• (DNAME='Research') is a selection condition (corresponds to a SELECT operation in relational algebra).
• (DNUMBER=DNO) is a join condition (corresponds to a JOIN operation in relational algebra).

SELECT Statement Example-3

• For every project located in 'Sta ord', list the project number, the controlling department number, and
the department manager's last name, address, and birthdate.

SELECT NUMBER, DNUM, LNAME, BDATE, ADDRESS


FROM PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER AND MGRSSN=SSN
AND PLOCATION=‘Sta ord’

• There are two join conditions in the example.


• The join condition NUM=DNUMBER relates a project to its controlling department.
• The join condition MGRSSN=SSN relates the controlling department to the employee who manages that
department.
ff
fi
ff
fi
More on SELECT Statement

• If WHERE clause is missing, there is no tuple selection.

• All tuples in relation are retrieved as per SELECT, FROM clauses for a single relation query.
• For a multi-relation query cross-product of relations in FROM clause is used.

• Asterisk(*) in SELECT clause, retrieves all the attribute values of the selected tuples.
• The ORDER BY clause can be inserted to order the tuples in the result of a query by the values of one or
more of the attributes.

Ambiguous Attribute Names

• It is possible to have same attribute name used across multiple relations.

• In a SELECT statement involving multiple relations with same attribute name, a dot(.) notation can be
used resolve the ambiguity in reference to the attribute.

• Dumber is common to both Department and Dept _Locations.


• We can resolve ambiguity in SQL by using fully quali ed names Department. Dnumber and
Dept_Locations. Dnumber.

Example for Ambiguous Names

Tuple Variables

• Some SQL queries refer to the same relation twice.


• For each employee, retrieve the employee’s name, and the name of his or her immediate supervisor.

SELECT E. Fame, E.Lname, S.Fname,


S.Lname
FROM EMPLOYEE AS E, EMPLOYEE AS S
WHERE E. Super_ssn= S. Ssn;

• Here there is a self-join of EMPLOYEE with itself.


• S and E are tuple variables.
• E represents employees in the role of supervises or subordinates while S represents employees in the
role of supervisors.

Example for Tuple Variables


fi
Aliases

• Aliases can be used for both resolving ambiguity, improving readability, and providing understandable
labels.

SELECT E.name AS First_Name, E.Lname AS Last_Name, E.Address


FROM EMPLOYEE AS E. DEPARTMENT AS D
WHERE D. DName = 'Research' AND D.Dnumber = E.Dno;

Relations vs Sets

• Recall:
• SQL allows a table (relation) to have two or more tuples that are identical in all their attribute values.
• Thus, an SQL relation (table) is a multi- set (sometimes called a bag) of tuples; it is not a set of tuples.
• SQL relations can be constrained to be sets by having PRIMARY KEY or UNIQUE attributes.

Set Operations in SQL

• SQL directly incorporates some of the set operations from set theory.
• Set union (UNION)
• Set di erence (EXCEPT)
• Set intersection (INTERSECTION)
• Set operations apply only to type- compatible relations.
• Results from these set operations are sets of tuples; that is, duplicate tuples are eliminated from the
result.

Multiset Operations in SQL

• SQL also supports corresponding multiset operations for relations.


• Multiset union (UNION ALL)
• Multiset di erence (EXCEPT ALL)
• Multiset intersection (INTERSECTION ALL)
• Multiset operations also apply only to type- compatible relations.
• Results from these operations are multisets of tuples; that is, duplicate tuples can exist.

Multiset Example

SQL Set Example


ff
ff
SOL Modi cation Statements

INSERT Statement

• In its simplest form INSERT statement is used to insert tuple into a relation.
• The relation name and a list of a values for the tuple have to be speci ed.

INSERT INTO ‹table name>


VALUES ‹Values list>

• The values need to be listed in the same order as were speci ed during the CREATE TABLE.

A Variant of the INSERT Statement

• Another form of INSERT statement lists column names for the relation name and a list of values for the
columns listed.

INSERT INTO <table name> (<column list>)


VALUES <Values list>

• It is su cient if we include columns for which we want to insert values.


• Remaining columns will have NULL or default values.
• Column list must include columns with NOT NULL constraint, if a default value is not de ned.

SQL Insert Example

• INSERT statement for all columns in a relation. a

INSERT INTO EMPLOYEE


VALUES ( 'Richard', 'K', 'Marini', '653298653', '1962-12-30', '98
Oak Forest, Katy, TX', 'M', 37000, '653298653', 4 );

• INSERT statement for some columns in a


relation.

INSERT INTO EMPLOYEE (Fame, Lname, Dno, Ssn)


VALUES ('Richard', 'Marini', 4, ‘653298653');

SQL Insert Constraints


• INSERT statement supports all constraints de ned during table creation.
• Consider the following statement:

INSERT INTO EMPLOYEE (Fame, Lname, Ssn, Dno)


VALUES ('Robert', 'Hatcher', '980760540', 2);

• The above statement will fail if there is a referential integrity constraint exists on department no(Dno) and
there is no department with No 2 exists in 2 Department table.

• Consider the following statement:

INSERT INTO EMPLOYEE (Fame, Lname, Dno)


VALUES ('Robert', 'Hatcher', 5);

• The above statement will fail if Ssn is de ned as Primary Key or NOT NULL constraint
exists on the column with no default value.
ffi
fi
fi
fi
fi
fi
fi
INSERT for Multiple Tuples

• Another form of INSERT statement omits VALUES clause and uses a SELECT statement to obtain query
results and then inserts those tuples into the relation.

INSERT INTO ‹table name> (<column list>)


SELECT Statemnt

Video Transcript:
Now let us look at inserting the multiple tuples. There are a couple of things that we should know. First
thing is very often the RDBMs providers o er a bulk loading tool depending on the various vendors, so
you're going to get the tool with the appropriate directions. So you can use that tool to make a bulk
loading. And it is also often happens that people load many records at a time using the programs. So what
they can do is that they write embedded SQL within the program and then they load the individual records
in a loop. So the data say is provided in the form of a le or CSV, and what happens is the program reads
tuple by tuple and is going to load in the relation. So another way we can do it is that using the SQL itself,
but for this the data should have been already loaded in another table. So we already have the data in
another table and we want to selectively low insert into the new table where we want to upload. So we
want to insert into this table name that we are going to give and then the column list and then we provide
the SQL statement that will be extracting the tuples that should be inserted.

Insert Query Results

• In COMPANY database, we want to create a temporary table that has the employee last name, project
name, and hours per week for each employee working on a project.

CREATE TABLE WORKS_ON_INFO


( Emp_ name VARCHAR(15),
Proj_ name VARCHAR(15),
Hours _per week DECIMAL(3,1) );

• INSERT statement for loading above table from EMPLOYEE, PROJECT, and WORKS_ON tables.

INSERT INTO WORKS_ON_INFO


SELECT E.Lname, P.Pname, W. Hours
FROM PROJECT P, WORKS ON W, EMPLOYEE E
WHERE P. Pnumber = W.Pno AND W.Essn = E. Ssn;

UPDATE Statement

• The UPDATE statement is used to update values of listed columns in a relation.

UPDATE <table name>


SET column_name = value 1,
WHERE <condition>

• It can include any number of columns from the table.


• Values can be constants or expressions based on other columns.
• WHERE clause restricts tuples to be updated.

UPDATE Considerations

• Integrity constraints declared in the DDL are enforced during UPDATE.


• All new column values should match the declared domains.
• If UPDATE of a column is declared as UNIQUE, it will fail if the value already exists in the table.
• If updated column is a PRIMARY KEY involved in referential integrity constraints with CASCADE option,
tuples referring to the column get updated.
ff
fi
SQL UPDATE Example

• Example: Change the location and controlling department number of project number 10 to 'Bellaire' and
5, respectively.

UPDATE PROJECT
SET PLOCATION = 'Bellaire', DNUM = 5
WHERE PNUMBER = 10

SQL UPDATE Example - 2

• Example: Give all employees in the 'Research' department a 10% raise in salary.

UPDATE EMPLOYEE l
SET SALARY = SALARY * 1.1
WHERE DNO = 5

• In this request, the modi ed SALARY value depends on the original SALARY value in each tuple.
• The reference to the SALARY attribute on the right of = refers to the old SALARY value before
modi cation.
• The reference to the SALARY attribute on the left of = refers to the new SALARY value after
modi cation.

SQL UPDATE Example - 2

• The WHERE clause can include SQL Statement, making it a nested statement.

UPDATE EMPLOYEE
SET SALARY SALARY * 1.1
WHERE DNO IN (SELECT
DUMBER FROM
DEPARTMENT WHERE (
= DNAME = ‘Research’);

SQL UPDATE Example-3

• Example: Change the department number of Research department from 5 to 6.

UPDATE DEPARTMENT
SET DNUMBER = 6
WHERE NUMBER = 5

• Consider EMPLOYEE Table with FOREIGN KEY reference to DEPARTMENT


CREATE TABLE EMPLOYEE
( ... ,
Dno INT NOT NULL DEFAULT 1,
CONSTRAINT EMPPK
o PRIMARY KEY
CONSTRAINT EMPSUPERFK
o FOREIGN KEY (Super_ssn) REFERENCES
EMPLOYES (Ssn) ON DELETE SET NULL ON
UPDATE CASCADE,
CONSTRAINT EMPDEPTFK
FOREIGN KEY (Dno) REFERENCES
DEPARTMENT (Dnumber) ON DELETE SET
DEFAULT ON UPDATE CASCADE.

• When this statement is executed, it will make updates to EMPLOYEE table as per constraint
EMPDEPTFK. All Employees who had 5 as Dno will have it updated to 6.
fi
fi
fi
DELETE Statement

• The DELETE statement is used to deleting tuples from a relation.

DELETE FROM <table name>...


WHERE ‹condition>

• WHERE clause controls tuples to be deleted.

DELETE Considerations

• Integrity constraints declared in the DDL are enforced during DELETE.


• The number of tuples to be deleted depends on WHERE condition.
• If the relation has a PRIMARY KEY involved in referential integrity constraints with CASCADE option,
tuples of other relations referring to the relation (in DELETE statement) get deleted.

• A missing WHERE-clause speci es that all tuples in the relation are to be deleted; the table then
becomes an empty table.

Table for DELETE Statement

SQL DELETE Examples

• Example: Delete all employees with last name of "Brown".

DELETE FROM EMPLOYEE


WHERE Lame = ‘Brown';

• Since there is no one with the given last name, no deletions happen.
• Example: Delete all employees with SSN of "123456789".

DELETE FROM EMPLOYEE


WHERE Ssn = ‘123456789’;

• Since Ssn is unique, at most one tuple can exist with the value. There is one tuple, it will be
deleted.

• Example: Delete all employees Dno of 5.

DELETE FROM EMPLOYEE


WHERE Dno = 5;

• There are four employees working in Dno of 5. So 4 tuples will be deleted.

• Example: Delete all employees in the employee table.

DELETE FROM EMPLOYEE;

• Since no WHERE clause is given, all tuples in the table are deleted.
fi
Advanced Queries in SQL

Nested Queries

• A nested query in SQL contains a query within another query.

• The inner query is included as part of WHERE clause of the outer query.

• The inner query is a complete SQL query.

• Nested queries may be correlated.


• The inner query has to be evaluated for every tuple in the outer query.
• We can have several levels of nested queries.

Nesting Constructs

• A nested query can take the form of:

SELECT [column_ name]


FROM [table_ name]
WHERE expression operator
{ALL | IN | ANY | SOME} (subquery)

• IN construct helps in testing membership.


• Other constructs help with checking
• membership and also comparison with sets.
• ANY and SOME serve the same purpose.

Nested Query Examples


• Example: Retrieve the name and address of all employees who work for the ‘Research' department.

SELECT FAME, LNAME, ADDRESS


FROM EMPLOYEE
WHERE DNO IN (SELECT
NUMBER FROM
DEPARTMENT WHERE
DAME = ‘Research')

• There is no correlation between inner query and outer query.

• Example: Retrieve the name of each employee who has a dependent with the same rst name as the
employee.
SELECT E.NAME, E.LNAME
FROM EMPLOYEE AS E
WHERE E.SSN IN (SELECT
ESSN FROM
DEPENDENT WHERE
ESSN=E.SSN AND
E.FNAME=DEPENDENT_NAME

• Here we can see correlation between inner and outer queries.

• Example: Retrieve the names of employees whose salary is greater than the salary of all the employees
in department 5.

SELECT Lname, Fname


FROM EMPLOYEE
WHERE Salary > ALL (SELECT
FROM EMPLOYEE WHERE Dno = 5);
fi
EXISTS Function

• Powerful and readable nested queries can be built using EXISTS function.
• The EXISTS function, applied on a query, returns true if non-zero tuples are returned, else it returns false.
• The function is applied on the inner query as part of WHERE clause of the outer query.
• Exists function can be used with Correlated queries for solving complex queries.

EXISTS Function Examples

• Example: Retrieve the names of employees who have no dependents.

SELECT FAME, LNAME


FROM EMPLOYEE
WHERE NOT EXISTS (SELECT *
FROM DEPENDENT
WHERE SSN = ESSN

DIVISION with EXISTS Examples

• Example: Retrieve names of employees who work on all the projects controlled by department number
5.

SELECT FAME, LNAME


FROM EMPLOYEE
WHERE NOT EXISTS (SELECT Pnumber
FROM PROJECT
WHERE Dnum = 5
EXCEPT ( SELECT Pno
FROM WORKS ON
WHERE Ssn = Essn));

Video Transcript:
So what the inner query is doing is rst it is extracting the project numbers of all the projects where the
controlling department is ve. So it is getting project numbers. So it is retrieving all the project
numbers. And then we are using the minus operation, the sets. And then we are extracting from the works
on all the project numbers where a speci c employee.This is a correlated query. The SSN is taken from the
employee table. So we are getting the all the project numbers where the employee is working. So we are
making a comparison of all the projects which are controlled by department number ve with all the
projects where the employee is working. This minus clause which is written in SQL as except. So when we
subtract from the rst set of all the projects of department number ve, and then we subtract the project
numbers where the employee is working. So if we are left with null or no tuples, then it means that the
employee is working in all the projects which are controlled by department number ve. So we can simply
use the not exists to check if that is turning out with no members. In that case, we know that this
employee is working on all the projects controlled by department number ve.

UNIQUE Function

• The UNIQUE function, applied on a query, returns true if a unique tuples are returned, else it returns
false.
• The UNIQUE function fails only if result contains two tuples t, and t2 and t, = t2.
• Given the concept of NULL in relational model, UNIQUE will not give dependable results if NULLs are
present in query results.

UNIQUE Function Examples

• Example: Retrieve the names of employees who have two or more dependents.

SELECT FAME, LNAME


FROM EMPLOYEE
WHERE NOT UNIQUE (SELECT ESSN FROM DEPENDENT WHERE SSN=ESSN)
fi
fi
fi
fi
fi
fi
fi
fi
Join of Relations

In Relational Algebra, Join of two relations R and


S is written as R| ‹join condition» S.
The equivalent expression in SQL is
R JOIN S ON ‹join condition>

The JOIN and ON are keywords in SQL.


By default SQL treats JOIN as inner join.

OUTER keyword is used to make it outer join.

Variants of JOINs

SQL supports all variants of JOIN


NATURAL JOIN
Apply equijoin over attributes with
same names in two relations. This
can be controlled with alias/rename
(AS)
OUTER JOINs (LEFT, RIGHT, FULL)

SQL JOIN Example

Example: Retrieve the name and address of all employees who work for the 'Research'
department.
SELECT NAME, LAME, ADDRESS
FROM (EMPLOYEE JOIN DEPARTMENT ON Do = Dnumber)
WHERE Dname=‘Research’

The same query can be rewritten suing NATURAL JOIN.


SELECT NAME, LNAME, ADDRESS
FROM (EMPLOYEE NATURAL JOIN
(DEPARTMENT AS DEPT (Dname, Dno, Mssn, Msdate)))
WHERE Dname = ‘Research'

There is one common attribute Dno between EMPLOYEE and DEPT (alias of
DEPARTMENT)

Example for OUTER JOIN


Aggregates in SQL

• The SQL supports aggregate functions to summarise information from multiple tuples.
• The aggregate functions include:
• COUNT
• SUM
• MAX
• MIN
• AVG

GROUP BY and HAVING Clauses

• Often aggregates are needed over groups of tuples) rather than over entire query results.
• GROUP BY clause allows us to specify attributes whose values are used for grouping tuples to compute
aggregates.
• SQL provides HAVING clause to retrieve results for only certain groups based on values of aggregates.

SQL Aggregate Example

• Example: Retrieve the sum of the salaries of all employees of the 'Research' department, also values of
the maximum, minimum, and the average of salaries paid in this department.

SELECT SUM (Salary), MAX (Salary), MIN (Salary), AVG (Salary)


FROM (EMPLOYEE JOIN DEPARTMENT ON Dno = Dnumber)
WHERE Dname = ‘Research’

• The following simpler query retrieves the same aggregates for all employees of the company.

SELECT SUM (Salary), MAX (Salary), MIN (Salary), AVG (Salary)


FROM EMPLOYEE

SQL GROUP BY Example

• Example: Retrieve the number of employees, and their average salary for each department.

SELECT Dno. COUNT (*), AVG (Salary)


FROM EMPLOYEE
GROUP BY Dno:

• We have to be watchful if there are employees with NULL values in the Dno column.

SQL HAVING Example

• Example: Retrieve project number, project name, and the number of employees working on projects
having more than 2 employees.

SELECT Pnumber, Pname, COUNT (*)


FROM PROJECT, WORKS_ ON
WHERE Pnumber= Pno
GROUP BY Pnumber, Pname
HAVING COUNT (*) > 2;

• SELECT can only have columns included in GROUP BY clause and HAVING can only use the aggregate
included in SELECT clause.
Views in SQL

• A view is a virtual table that is based on other tables. Typically no data is stored for a view.
• A view can be used in place of a table in a SQL query.
• Results from an SQL query can be considered as a relation. A view is a named entity representing a
query.

Views in SQL

• Example 1: Create a view on time spent by employees


(by name) on various projects (by name).

CREATE VIEW WORKS ON INFO


AS SELECT Fame, Lname, Pname, Hours
FROM EMPLOYEE, PROJECT, WORKS ON
WHERE Ssn = Essn AND Pno Pnumber;

• Example 2: Create a view on count of employees and


total salary on each department (by name).

CREATE VIEW DEPT_INFO(Dept _name,


No_of_emps, Total_sal))
AS SELECT Dname, COUNT (*), SUM (Salary)
FROM DEPARTMENT, EMPLOYEE
WHERE Dumber = Dno
GROUP BY Dname;

Queries Using Views

• Example: Retrieve the last name and rst name of employees working on the ‘ProductX' project.

SELECT Fame, Lname


FROM WORKS ON INFO
WHERE Pname = ‘ProductX’;

• Example: Retrieve the no of employees and total salary for departments (by name) with more than 3
employees.

SELECT Dept_name, No_of_emps, Total_sal


FROM DEPT INFO
WHERE No_of _emps > 3;

Materialisation of Views

• Some RDBMS products support materialisation of views i.e. a temporary physical table is created for the
view.
• View materialisation improves performance of queries based on views.
• However, as the base tables get updated, temporary table contents become obsolete.

Update Using Views

• Update using views can become ambiguous.


• Update becomes meaningless on views containing aggregates.
• Update is possible on views based on single table.
• With primary key as part of view.
• No DISTINCT, GROUP BY clauses.
• RDBMS products have varying implementations.
fi
Week-6

Informal Design Guidelines, Update Anomalies, Functional Dependencies, and Inference Rules

Why Good Relational Schema Design?

• To have better clarity in


understanding the
database.
• To formulate
good/e cient queries.

Informal Design Guidelines for Relational Schema

• Making sure that the semantics


is clear.
• Reducing redundant information
in tuples.
• Reducing the null values in tuples.
• Avoid the possibility of generating
spurious tuples.

Example for Violations of Guidelines

• Issues with the above schema design:


• putting attributes of two real world entity types into one single relation - semantics not clear,
• redundant information, and
• too many null values.

Video transcript:
Let us have an example of violation of these rules. We have employee department table, where I have
employee ID, employee name, age, salary, department number, department name, location, manager
department, and start date. The attribute manager department indicates that the given employee is
manager for which department. Start date attribute indicates what is the start date for that employee as
manager of that department? If we carefully observe this relation, it is obvious that we have
clubbed attributes pertaining to two di erent entity types, that is department and employ in one single
relation. That's why the meaning of the attributes is not obvious or not clear to us. Similarly, there is good
amount of redundancy available in this table, because if there are multiple employees working with the
same department, we are redundantly storing the information like department name and the department
location for the department at multiple tuples that leads to complications, and at the same time, it also has
too many null values. For example, only few of the employees are managing the departments, and if there
is a column manager department, which is indicating the given employees managing which department
and then start date, for many of the two tuples, they will have null value for this because only few of the
employees are managing the departments.

Update Anomalies

• Bad design of relation schemas might lead to update anomalies, classi ed as:
• Insertion anomalies
• Deletion anomalies
• Modi cation anomalies
fi
ffi
ff
fi
Video Transcript:
Now, let's see what is insertion anomaly. If a new department has been introduced into the company and
which does not have any employee, then we cannot insert a tuple into the table, because since there is no
employee associated with that department, and employee ID is the primary key of the table, only
department information that is department number, department name, and location, cannot be inserted
into the database. This is called as insertion anomaly, that is, information pertaining to a
department, which does not have any employee working with it cannot be stored in the database. That is
insertion anomaly. Now, let us see what is deletion anomaly. Let us assume that there is only one
employee working with a given department, and if that employee is red, that entire row is gone, along
with the employee information, the department information is also lost. This is called as deletion
anomaly. Modi cation anomaly means, since the department name and the department location of a given
department are located at multiple tuples, if there is a change, for example, if the location of the
department changes from Delhi to Chennai, then we need to gure out all the locations, all the
tuples, where the department name for the department is stored, and we need to modify that. If we miss at
least at one place, then it leads to inconsistency. This is called modi cation anomaly. All these anomalies
are undesirable properties, which are result of bad design.

What is Functional Dependency (FD)?

• Functional dependency (FD) is a


constraint between two sets of attributes
from the database.
• Function dependency X -> Y means:
• X functionally determines Y in a relation
schema R, if and only if, whenever two
tuples of r(R) agree on their X values they
must necessarily agree on their Y values,
but Y -› X need not be valid.
• Note: FDs cannot be inferred.
• FDs represent business rules.

Depicting the FDs

• Department
{Dnumber - Dname, Mgrssn, mgrstDt}

• Work_on
{Essn, pno} -> Hours

Video Transcript:

Here, department table is there, schema is given. D number is the primary key. D name, manager
assistant, and manager start date are the attributes. Suppose if the business rule says that, D number can
determine D name, manager assistant, and managers start date, that is depicted like this. This is X
fi
fi
fi
fi
part, and then this is Y component, meaning that given D number, I can always determine what is the
department name, or I can always tell what is the manager assistant. Otherwise, I can always tell what is
managers start date. Sometimes the X component can contain more than one attribute. That's ne. If you
take this example, works on table. Here, it indicates which employee with ESSN that is Social Security
number, working on which project, and then for how many hours. The business rule says that one
employee can work with more than one project and one project can have more than one employee
working on it. For each of the combinations involving ESSN and then P number, how many hours he
has applied on that particular project is given by the column hours. In this case, the combination of the X
component contains two attributes, ESSN and PNO. That is, if you give me ESSN and then PNO then we'll
be able to tell what is number of hours that employee has worked on that particular project. That is the
meaning.

Inference Rules for FDs How IR is Valid?

Closure of F

• Let F be set FDs for a relation R.


• We can nd the closure of F represented as F+ by repeated application of rules IR1 to IR3
• Hence, the F+ includes all FDs given in F and other FDs inferred from F by repeated application of rules
IR1 to IR3.
• These rules IR1 to IR3 are known as Armstrong' Inference rules.

Video Transcript:
Now let us discuss what is closure of F. If F is set up functional dependencies for a given relation R, we
can nd closure of F by repeated application of rules, IR_1-IR_3. Hence, F closure includes all functional
dependencies that are already given in F and other functional dependencies inferred from F by repeated
application of IR_1-IR_3. These inference rules, IR_1-IR_3, which we can use to compute the closure of F
are known as Armstrong's inference rules. In fact, any other inference rule, F, IR_4, IR_5, and IR_6 can be
proved using IR_1-IR_3. IR_1-IR_3 are known as Armstrong's inference rules.

Introduction to Normalization, Normal Form(NF), Decomposition and Conditions for 1NF, 2NF, 3NF
and BCNF

Normalisation and Normal Forms (NFs)

• Normalisation process is rst proposed by Raymond Boyce and Edgar Codd in 1972.
• Normalisation - is the process of analysing the goodness of relation schemas based on their FDs and
PKs/Keys, and if necessary apply decomposition to minimize the redundancy.
• If a relation schema to be in a given NF, must satisfy certain conditions w.r.t. Keys and FDs.
• We have 1NF, 2NF, 3NF, BCNF, 4NF and 5NF.

First Normal Form (1NF)

• It states that the domain of any attribute must include only


atomic (single/simple/individual) values.
• Violating 1NF as Dloc contains multiple location names.
fi
fi
fi
fi
Second Normal Form (2NF)

• It is based on full functional dependency.


• {X -> A} is fully functional if we remove any attribute from X then that FD does not hold anymore.
• Conditions for 2NF:
• all non-key attributes are fully functionally dependent on key or
• no non-key attribute should be dependent on part of key (partial dependency).

Relation Schema not in 2NF

• Not in 2NF.
• Reason: Key is (eid, pnum) together.
• Eid is part of the key and we have a non-key attribute ename
determined by eid.

Decomposing R Into R1, R2 and R3

Video Transcript:
The decomposition is the process
of splitting a bigger relation into
smaller ones with an intention to take
it to the next higher level normal form
so that we can eliminate certain
amount of redundancy. Now, the
previous relation was violating 2NF
condition. Hence, we can say that the
highest normal form satis ed by the
previous relation was one 1NF
only. Now, it is time to bring R to 2NF.
What should we do? To bring 1NF
relation to 2NF we need to identify the partial dependencies. That is the functional dependencies of the
form X determines A which violate 2NF. Whatever functional dependencies violate 2NF, you take those
attributes and then put them into one new relation. Now, one single relation R is now split into R1, R2 and
R3. R1 contains eid and the ename because eid determines ename was violating 2NF. Similarly, in R2, we
put pnum and then plocation where pnum determines plocation was partial dependency and remaining
attributes eid, pnum, and Hours, we put in another relation that is R3. Now, if you carefully observe, eid is
the primary key of this R1 and pnum is the primary key of this R2, and then eid and pnum together is
primary key of this R3. If you carefully observe all these relations that is decomposed relations of R, R1,
R2 and R3, they satisfy 2NF condition. That is, there is no partial dependency of any non key attribute in
any of the relations, which is partially dependent on the part of the key of these decomposed
relations. Hence now, R, which was not in 2NF, by applying decomposition, we have brought it to 2NF, by
decomposing R into R1, R2, and R3.

Third Normal Form (3NF)

• It is based on transitive dependency.


• According to this, a relation should not have a non key attribute functionally determined by another non-
key attribute.
• Hence, there should not be any transitive dependency.

Condition/Rule for 3NF

• For each FD, X -> A in database, either X must be a key or A is a key


attribute.
• The above relation schema is not in 3NF because eid -> dnum and
dnum -> dname.
• Here, eid transitively determines dname.
fi
Boyce Codd Normal Form (BCNF)

• BCNF can be seen as a stricter version of 3NF.


• Condition:
• For each FD, X -> A,
• X must be a key.

Ex. 3NF to BCNF

Video Transcript:
How to bring a 3NF relation to BCNF? If R is a relation
not in BCNF and the FD, alpha determines beta is
violating the BCNF condition, then we can decompose
R into R1 and then R2 like this. R1 contain the attributes
that is here alpha and beta, they are single attribute or
set of attributes. R1 will contain attributes, alpha union,
beta will get into R1 and R minus beta minus alpha will
get into second relation. So here r is all attributes of
r. Beta is the attributes in beta component. Alpha is the
attributes in alpha component. So this is how we can
decompose a relation R which is not in BCNF to BCNF. Let us take an example. We have r. We have
student course and instructor. Student, course together is the key of this relation which can determine
instructor and instructor alone can determine course. If you take student course determines instructor, it is
in BCNF. But if you look at the other functional dependency, instructor determines course where X
component instructor is not a key, hence it is not in BCNF normal form. Instructor determines course is
violating BCNF condition. So here alpha is instructor and then beta component is course. To decompose
this R to take it to BCNF, R1 contains alpha union beta, that is instructor and course which are alpha
and beta will get into one relation and R minus beta minus alpha. Beta minus alpha means course minus
instructor, that is course. R is student, course, instructor. Student, course, instructor minus course is you
will get student and instructor, that will be there in R2, that is second relation. For the rst
relation, instructor is the key because we have a valid functional dependency. Instructor determines
course that is as it is, and for the second relation, since any one of them can determine the other. Hence,
we will make both the attributes of R2 as the key. So now R after decomposition has been split into two
relations, R1 and R2, and now both of them are in BCNF.

Normal Forms

Example 1
fi
Example 2

• Find the highest NF satis ed by the below relation R, and if not in BCNF, bring to BCNF.
• R (A,B,C,D) {AB - C; AB - D; C - D}.

• Solution:
• Key: {AB} combined.
• But the FD {C - D} induces transitive dependency hence the highest NF satis ed is 2NF.

• Decomposition to 3NF:
• R1 (A,B,C) Key {AB}; in 3NF and also in BCNF.
• R2 (C,D) Key (C} in 3NF and also in BCNF.

Example 3

• Find the highest NF satis ed by the below relation R, and if not in BCNF, bring to BCNF.
• R (A,B,C,D) {AB -> D; AB -> C; C -> B, B -> D}.

• Solution:
• Key is {AB} combined/composite.
• But, the FD {B > D} induces partial dependency hence the highest NF satis ed is 1NF.

• Decomposition to 2NF:
• R1 (A,B,C) with key {AB} and is in 3NF.
• R2 (B,D) with key {B} and is in BCNF.

• Decompose R1 to R11 and R12 to bring to BCNF:


• R11 (C,B) with key {C}.
• R12 (A,C) with key {AC}.

Desirable Properties of a Decomposition, Lossless and Dependency Preserving Decomposition

Desirable Properties of a Decomposition

• As we have seen, decomposition (of a bigger relation into smaller ones), is major step in the process of
normalisation.
• During this activity of decomposition, we need to make sure that the decomposition is Lossless join and
Dependency preserving.

Lossless Join Decomposition


fi
fi
fi
fi
Projection of r on a Decomposed Ri

Video Transcript:
For a very tuple, if you consider sid, sname and salary
values, and then generate a instance, it is known as the
projection of r on R_1. Hence, we have sid, sname and then
sal. This is called as projection of r on R_1. Similarly, projection
of r on R_2. Because the decomposition contains sid, sname,
department, and HOD, all the values pertaining to the tuples
under those columns will be there in projection. Similarly,
projection of r on R_3 is this, of course. For the sake of
convenience, this projection of r on R_2 is repeated in this slide
as well. Projection of r on R_3, has only HOD and the city, hence
it contains only this data. Now, what does this lossless join
decomposition rule says? If you take these projections, and then
naturally join them, what is a natural join? If two relations are
joined based on common attributes, it is called as natural join. Can we perform a natural join on projection
of r on R_1, and then projection of r on R_2? Yes, it is possible. What is common? Sname. These are
common. Now, this tuple will be joined with this tuple, and then this tuple is joined with second tuple like
that, and there is no problem. We will get the correct data. When we join, projection of r on R_1, and then
projection of r on R_2 naturally, there is no problem. But if you carefully observe if we join the projection of
r on R_2 and then projection of r on R_3, the common attribute is HOD. Now, this tuple will be joined with
this tuple, and then this tuple is also joined with this tuple because the values are going to be same. Now,
when the rst tuple, 121 Kiran, production, Murthy is joined with rst tuple in production of r on R_3. You
get 121, Kiran, production, Murthy and Delhi. That is the location of his department. When it is joined with
another tuple, that is also having the HOD name Murthy, because the HOD name can repeat, that is not
unique. Then again, you will get what? 121, Kiran working with production department. His name is
Murthy, and then the department location is Hyd, which is incorrect. A department cannot be located in
two places. It is incorrect. All this is happening because of this wrong decomposition. You are getting
spurious tuples, incorrect tuples. This is going to be the ill e ect of having a lossy decomposition. That's
why whenever we perform decomposition, we must make sure that the decomposition is lossless. The
decomposition we have seen is lossy and it will have spurious tuples, which is not desirable. Hence, when
we decompose a relation, it is desirable that it should be lossless and dependency preserving. This
ensures correct decomposition.

The E ect of the Mentioned Decomposition

• The decomposition we have seen will be lossy because:


• The natural join of relation states r2 and r3 of decomposed relations R2 and R3 will generate incorrect
tuples and hence will not be same as original r.
ff
fi
ff
fi
Test for Lossless Join Property (Matrix Approach)

Video Transcript:
For example, SSN is available in
decomposition R1. Hence I mark this with a
one. If the attribute is part of the
decomposed relation, mark it with a. The
subscript 1 indicates that it is column
number. ENAME is part of the relation R1,
hence it is marked with a, the column is
2. PNUMBER, PNAME, PLOCATION,
HOURS are not part of R1, hence they are
marked with b, 1 is row and 3 is column
number. Here, under PNAME for row 1, 1 is
row number and then 4 is the column
number. So it is simple to see that for every
row, each row indicates or represents one
decomposed relation. If the given attribute is part of that decomposed relation, we mark it with
a. Otherwise we mark it with b. Similarly, for R2, that is decomposed relation 2, SSN is not there, hence it
is b, 2 is the row number, 1 is the column number. ENAME is also not there, it is also b, PNUMBER is
there, it is a, PNAME is there, it is marked with a, PLOCATION is there, it is marked with a, HOURS is not
there, it is marked with b. R3 has SSN, PNUMBER, and HOURS, that's why only those columns for R3 are
marked with a, remaining all are bs. This is the initial matrix which captures the information about the
decomposed relations. Ok, now we will consider the original functional dependencies.
The form x determines y. For example, let us take SSN determines ENAME. Here, x component is SSN, y
component is ENAME.
Can we nd any one single row which has s on both sides of this functional dependency, that is, for both x
and y components? Is there any row which is marked with a? If you carefully observe S, row 1, SSN, and
ENAME, both are as. If it is the case, then look for some other row which has a for x component and b for
y component. R2 is not satisfying that condition, but R3 satis es. Here, this is the x component, it is a,
and then y component is b. Now in that case, what I can do, I can replace this b with a, that's what I have
done here. So for R3, the column ENAME, earlier, it was b, now it is changed to a. Ok, now after doing any
such modi cation, we should look for any row having all a's. No, not yet. Now I can consider some other
functional dependency of the form x determines y. I can consider PNUMBER determines PNAME. P
number determines p name where p number is the x component and p name is the y component. Is there
any row in this matrix, after this modi cation, having both x and y components marked with a? Yes, I have
R2, both are as, and I have R3, where x component is a and then y component is b. Now I can change this
b to a, that's what I have done. After doing this again, I will check whether any single row has all as. No, I
need to continue further. Now I will consider one more functional dependency of the form x determines
y. That is, both x and y components are S. I can consider p number determines p location. I have R2,
where PNUMBER and PLOCATION, both are S, and then R3 has only PNUMBER a, and then PLOCATION
was originally b. Now I can change it to a. After this modi cation. If I carefully observe all items, all values
for three reals, or change it to a. Now I can stop and then declare that this decomposition is lossless. Even
after doing this repeatedly and exhaust all possibilities, if I can't see any single row which is marked with
all as for all columns, then I can conclude that it is a lossy decomposition, or I can say that it is not a
lossless decomposition. This is how we can test whether a given relation decomposition is lossy or
not. This is matrix approach.

Video Transcript:
Now let us consider one more example, where I have
R with these attributes decomposed into R1
and R2, these are the original functional
dependencies. That is, for the same relation which
is given in the previous problem, we have another
possibility of decomposition, only R1 and R2. Okay,
now what is the matrix? It will have ssn ename,
etcetera as the columns, and then two rows, one
each for each of the decomposition. Since in R1 we
have only ENAME and PLOCATION, only ENAME
and then PLOCATION are marked with a. And in R2, we have SSN, PNUMBER, HOURS,
fi
fi
fi
fi
fi
PNAME, PLOCATION that is accepting ENAME, everything is marked with a. E name is b here marked
with b. Now what changes are possible if I consider the functional dependency SSN determines
ENAME. There is no row which has both x and y componets a. PNUMBER and PLOCATION, PNUMBER
can determine PLOCATION. If I take x, y functional dependency, PNUMBER is x component, PLOCATION
is y component S. R3 has both as, but there is no other row which is having both, sorry, at least x has
a. So that is why I cannot make any changes. So even after considering all possibilities, I do not see any
single row that can be marked with all S. That is why this decomposition can be declared to be a lossy
decomposition. That means the matrix approach can be applied to any sort of decomposition.
Now we will see another test which can be applied only to binary decompositions. Let us assume that if R
is decomposed into R1 and then R2, the original set of functional dependencies are F. Then you take the
intersection of R1 and R2. If that set of attributes, which is intersection of R1 and R2, is key of R1, or if it is
key of R2, then I can declare that it is a lossless decomposition.

Test for Lossless Join Property

• If the decomposition is a binary R decomposition, we can apply the following test.


• {R1 intersection R2) -> R1 OR R2.
• Example:
• R(A B C D) decomposed into:
• F={A->ABCD; C->D}.
• R1 (A B C).
• R2 (C D).

Video Transcript:
Now we will see another test which can be applied only to binary decompositions. Let us assume that if R
is decomposed into R1 and then R2, the original set of functional dependencies are F. Then you take the
intersection of R1 and R2. If that set of attributes, which is intersection of R1 and R2, is key of R1, or if it is
key of R2, then I can declare that it is a lossless decomposition. Let us take this example. R has ABCD as
the attributes, and where A is the primary key of the relation R, and A can determine all attributes. And I
also have one more functional dependency called S, C determines D. If the decomposition of R is done
into R1 and R2, where R1 contains ABC and R2 contains C and D, what is the intersection of R1 and
R2? ABC intersection CD, that is C. If we carefully observe, c is the key of two reals. Thus this condition is
satis ed either this or this. That means if you take the intersection of r1 and two reals, that set of attributes
must be the key of either r one or r two. If at least one of them is satis ed, I can conclude that the given
decomposition is lossless. In this case it is a lossless decomposition because r1 and two reals intersection
is c and then c is the key of relation.

Projection of F on Decomposed Relations

• Let R be a relation.
• F be the set of dependencies F on R.
• Let Ri is a relation in decomposition of R.
• Then the projection of F on R; denoted by (pieRi(F)), is the set of FDs of the form
• (X -> Y) in F+ such that the attributes of
• (X U Y) are contained in Ri.

Example to Compute Projections of F

• R (A,B,C,D)
• F={AB -> C; C->D}.
• Decomposition:
• R1 (A,B,C).
• R2 (C,D).
• F={AB -> C; C -> D}.
• F+ = {AB -> C; C -> D; AB -> D}.
• R (A,B,C,D) decomposed into:
• R1 (A,B,C) R2 (C,D)
• Projection of F on R1 = (pieR1(F)) = {AB -> C}.
• Projection of F on R2 = (pieR2(F)) = {C -> D}.
fi
fi
Dependency Preserving Decomposition

• The decomposition of R into {R1, R2, …Rm} is dependency preserving if the closure of union of
projections of F on {R1, R2,…Rm} is equal to F+.

• Then it is dependency preserving, else it is NOT.

Test for Dependency Preservation

• Ex.1:
• R (A,B,C,D,E) F={AB -> CD; D ->E; A ->C}.
• Decomposition: R1 (A,B,D), R2(A,C,D,E), R3 (A,B,E).

• Ex.2:
• R (A,B,C,D) F={AB -> CD; C -> D}.
• Decomposition: R1 (A,B,C), R2 (A,B,D).

Example 3

• Consider a relation R(A,B,C,D,E,F) and set of FDs: F={AB -> CE; C -> D;
E -> F}.
• If R is decomposed into:
• R1 (A,B,C,F), R2 (A,B,D) and R3 (C,E).
• We need to check if it is lossless.
Example 4

• Consider a relation R(A,B,C,D,E,F) and set of FDs: F={AB -> CE; C -> D;
E -> F}
• If R is decomposed into:
• R1 (A,B,C,F), R2 (A,B,D) and R3(C,E)
• Now we check if it is dependency preserving.
Week-7
Disk Properties and File Storage Schemes for Storing Data

Criteria for Comparison

• Speed with which data can be accessed.


• Cost per unit of data.
• Reliability.
• Data loss on power failure or system crash.
• Physical failure of the storage device.

Volatile and Non-Volatile Storage

• We can classify storage into:


• Volatile storage: Loses the content when power is switched o .
• Non-volatile storage: Contents persist even when power is switched o .

Hierarchy of Storage

• Primary storage: fastest media but volatile.


• Ex. cache, main memory.
• Secondary storage: next level in hierarchy, non-volatile, moderately fast access time.
• Also called on-line storage.
• Ex. ash memory, magnetic disks.
• Tertiary storage: lowest level in hierarchy, non-volatile, slow access time.
• Also called o -line storage.
• Ex. magnetic tape, optical disk storage.

Primary Storage

• Cache Memory:
• Fastest access, too small and costliest form of storage; volatile.

• Main Memory:
• Fast access; Generally smaller compared to Disk space; too expensive to store the entire database;
volatile.

Secondary Storage

• Flash memory:
• Data survives power failure.
• Reads are roughly as fast as main memory but writes are slow.
• Widely used in embedded devices such as digital cameras.

• Magnetic-disk:
• Data is stored on spinning disk, and read/written magnetically.
• Primary medium for the long-term storage of data.
• Typically stores entire database.
• Survives power failures and system crashes.

Tertiary Storage

• Optical storage: Non-volatile, data is read optically from a spinning disk using a laser; CD-ROM (700 MB)
and DVD (4.5 to 15 GB) most popular forms;
• Reads and writes are slower than with magnetic disk.

• Tape storage: Non-volatile, used primarily for backup (to recover from disk failure), and for archival data;
sequential- access -much slower than disk; very high capacity (2.5 to 8.5 TB and even more).
fl
ff
ff
ff
Why Magnetic Disks for Storing Data

• It is not possible to store large volumes of data on main memory.


• It must be permanent.
• Data access should be reasonably faster.
• Cost should be moderate.
• Magnetic Disks are the best suited media.
• Also referred to as Hard Disk Drive (HDD).

Magnetic Disks

• Disks are made of magnetic material shaped as thin circular disks.


• A disk can store data on single side or both sides.
• A collection of disks are arranged into a disk pack.
• Data on a disk surface is stored in concentric circles called tracks.
• In a disk pack, tracks with the same diameter on the various surfaces are called a Cylinder (imaginary
cylinder).
• The operating system divides each track into equal sized disk-blocks (pages), during disk formatting.
• Read-write head - Positioned very close to the platter surface (almost touching it); it reads or writes
magnetically encoded information.



• Di erent sector organisations on disk:
• Sectors subtending a xed angle.
• Sectors maintaining a uniform recording density.

Disk Controller

• Disk controller interfaces between the computer system and the disk drive hardware.
• Accepts high-level commands to read or write a sector.
• Initiates actions such as moving the disk arm to the right track and actually reading or writing the data.
• Computes and attaches checksums to each sector to verify that data is read back correctly.

Disk Systems

• In a disk subsystem, multiple disks connected to a computer system through a controller.


• In Storage Area Networks (SAN), a large number of disks are connected by a high- speed network to a
number of servers.

Computing Disk Capacity

• Assuming if there are 128 tracks on each surface of a disk pack having 8 double sided disks with
uniform surface con guration.
• Each track has 120 sectors and the sector size is 2KB.
• Now, we calculate (a) The total capacity of each cylinder in MB (b) Total capacity of the disk pack in GB.
ff
fi
fi
• Capacity of each cylinder= (No. surfaces X Track capacity) = (8X2) X (120X2KB)= 3840KB =3.840MB.
• Total capacity of the disk pack= No. Cylinders X Cylinder capacity = (128 X3.840 MB)= 491.52 MB=
0.49152 GB.

Disk Access Time


• Access time - The time it takes from when a read or write request is issued to when data transfer begins.
It consists of two components:
• Seek time - Time it takes to reposition the arm over the correct track. Typically 4 to 10 milliseconds on
typical disks.
• Rotational latency - Time it takes for the sector to be accessed to appear under the head. Typically 4
to 11 milliseconds on typical disks (5400 to 15000 r.p.m).

Disk Data Transfer

• Data-transfer rate - The rate at which data can be retrieved from or stored to the disk. Typically 25 to 100
MB per second.
• Mean time to failure (MTTF) - The average time the disk is expected to run continuously without any
failure, typically 3 to 5 years.

Optimisation of Disk-Block Access

• Block - A contiguous sequence of sectors from a single track.


• Data is transferred between disk and main memory in terms of blocks. Space available on main memory
is known as bu er space.
• Smaller blocks - More block transfers from disk.
• Larger blocks - More space wasted due to partially lled blocks.

Disk-Arm Scheduling

• Disk-arm scheduling algorithms order the pending accesses to tracks so that disk arm movement is
minimized.
• Ex. elevator algorithm.

RAID: Redundant Arrays of Independent Disks

• RAID is an advanced disk organisation technique that manages a large numbers of disks, providing a
view of a single disk of high capacity and high speed by using multiple disks in parallel.
• RAID Levels: De ned by two factors - data striping and redundancy.
• Standard: Level-0 to Level-6.

Files and Records

• A le is a sequence of records, where each record iS a collection of data values (or data items).
• Records are stored on disk blocks.
• The blocking factor (bfr) for a le is the a (average) number of le records stored in a disk block.
• A le can have xed-length records or variable-length records.

Files and Records Organisation

• Record organisation - File records can be:


• Unspanned or spanned.
• Fixed-length or variable-length.
• File organisation:
• Unordered les (heap).
• Ordered les (sequential).
fi
fi
fi
fi
ff
fi
fi
fi
fi
fi
Video Transcript:
We have one classi cation of record organizations, unspanned record organization and spanned record
organization. Unspanned means no single record will span across two di erent blocks. For example, if we
have the block size as 512 bytes, if the record size is 100, this is the rst record, second record, third
record, fourth, and fth. By then we might have completed 100 bytes. Towards the end, 12 bytes is left
over. And in unspanned record organization, we will not use this 12 bytes of space towards the end of the
block, because it is not possible to t the entire record, which is of length 100 bytes, in this space. Then
we will start inserting the next record, that is, the sixth record of the le, in next block. So towards the end,
we are wasting this 12 bytes of space. This organization is known as unspanned record organization. No
single record is spanning across multiple blocks. In spanned record organization, without wasting this 12
bytes of space, I will start inserting the sixth record from there. That is, out of this 100 bytes of sixth
record, 12 bytes are stored in this block, remaining 88 bytes will be stored in the next block. That is, this
particular sixth record of the le is now stored in two di erent blocks. That is called as or that is known as
spanning. In unspanned record organization, that won't happen. Another classi cation is xed length and
variable length record organization. So in a le, we will store all records which are of uniform length. That is
xed-length record organization. Opposite to that is variable-length record organization. Similarly, another
classi cation is ordered les and unordered les. That is when we insert record into a data le, we follow
maintain certain order. That is we want to insert all the records of employee table based on employee
id. That is, the employee ID value which is lesser will come before the record which is having employee ID
value higher. There is an order, okay? And when this kind of order les are maintained, inserting takes
time. Because when you insert a tuple or a record, rst, I need to identify or gure out the place where this
new record can be inserted. If there is no place available, then we need to make certain adjustment that
this particular new record will get there into that particular slot. That means it requires some adjustments,
it takes some time. But the advantage of this ordered les is once you insert a record, then there is an
order, then searching a record becomes very fast. I can do binary search. Similarly for unordered les,
there is no order. That is, whenever a new record comes, I don't look for a particular location and I take
that record and then just append it to the le. That is, I will put it at the end. That is insertion is faster. But
when you are searching for a particular record, then you will have to do sequential search, that is
expensive. That is the di erence between ordered le and unordered le.

Files and Records Organisation

• File operations:
• Open
• Reset
• Find
• Read
• Find next
• Delete
• Modify
• Insert
• Close

Files of Mixed Records

• To store several relations in one le using a multitable clustering le organisation.


• For example, clustering department and instructor records.
• Good for queries involving department and instructor, and for queries involving one single department
and its instructors.
• Bad for queries involving only department.
• Results in variable size records.

Video Transcript:

We can have les of mixed records. That is, usually, a le contains or stores similar kind of records. That
is, when we create a table, employee table, and automatically a le is allocated to store data pertaining to
the table, and all records in that le, they belong to employee table. That is, they have similar
structure. Sometimes it is also possible that I can cluster, that is, I can put di erent kinds of records in one
single le. For example, I can cluster department records and instructor records. If we have two tables,
departments in an institute and then instructors, that is, faculty available in the institute, each faculty
fi
fi
fi
fi
fi
fi
fi
ff
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
ff
fi
fi
fi
fi
fi
fi
ff
fi
ff
fi
fi
fi
fi
belongs to di erent or one department. We can club, that is, we can cluster records pertaining
to department table and then instructor table in one single le. The advantage is it is good for queries
involving department and instructor. That is, if you have a requirement, get names of all instructors who
are working with CSC department there. How do I store the records? First, the record pertaining to CSC
department will come, then followed by the records of instructors who are working with that
department. Hence getting the instructor details working with a particular department is very faster. And
for queries involving one single department, then they are very faster. But they are not e cient for queries
involving only department. If I wanted to get the head of the department, that is, name of the HODs of all
departments, then rst, I will get the HOD name in the record pertaining to CSC department. Then if there
are 100 faculty working in CSC department, there will be 100 records pertaining to instructors working in
that department. So I need to skip all those things. Then I will nd a record pertaining to next
department. After that I need to skip some records pertaining to the instructors working in that
department. Like that, if the query involves only department related information, they are not e cient.It
results in variable size records. That is, if I store the records pertaining to di erent types, that is, di erent
tables, relations in one single le, it will de nitely result in varying record sizes, lengths. Because the
record size of instructor will not be same as record size of a department record.

Data Dictionary

• The data dictionary (also called system catalog) stores metadata, that is, data about data, such as:
• Information about the structure of the relations.
• User and accounting information.
• Statistical data about relations.
• Physical le organisation information.
• Information about indexes.

Hashing for Database Systems and Static Hashing

What is Hashing for Databases?

• Hashing is used as a type of primary le organisation.


• The organisation is usually called a hash le.
• The search eld is called as hash eld of the le.
• If the hash eld is also a key a eld of the le, In which case it is called as hash key.

Hashing and Hash Function

• Hashing is used for nding the place to store a record and for
searching a record based on the key value.

Internal Hashing

• Used for internal les (on RAM).


• It is implemented as a hash table through use of an array of
records.
• The most common hash function used is:
• h(k) = K mod M.
• This gives the index of the location in the array.

Collision

• If two or more records are hashed to same location it is known as collision.


• When there is collision, then we need to nd some other location for the new record. This process is
known as collision resolution.

Collision Resolution
• Some well known collision resolution schemes:
• Open addressing.
• Chaining.
• Multiple hashing.
fi
fi
fi
ff
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
ff
ffi
ffi
ff
Video Transcript:
In open addressing, whenever collision happens, from the location where the collision has happened, we
will keep searching for some empty location in subsequent places. Whenever I nd empty space, then we
will insert that record. This is open addressing. In chaining, instead of looking for empty space in
subsequent locations, we will allocate some over ow locations. Then the colliding record will be placed in
that over ow location, and a pointer will be maintained from the original location to this over ow
location, which we call it as a chain, so on and so forth. Whenever there is a collision, we keep on
maintaining over ow locations. In the third scheme, that is multiple hashing scheme, when there is a
collision occurring with one hash function, then we will apply second hash function. That means, in this
scheme, we have multiple hash functions used. If there is a collision with the rst hash function, to nd the
new location, we will apply the next hash function. These are the three important popular collision
resolution schemes.

Good Hashing Scheme

• The goal of a good hashing function is to distribute the records uniformly over the available slots, so as
to minimize collisions while not leaving many unused locations.

External Hashing

• External hashing is used for disk les.


• External hashing is best suited for database systems.
• The target address space is divided into buckets, each of which
can hold multiple records.
• Sometime it could be a cluster of blocks.

How Does it Work?

• A hash function maps a key to relative bucket number, rather


than absolute block address.
• A mapping table, maintained in the header, will map the bucket
number to physical block address.
• Once the disk block is known, the actual search for the record
within the block is carried out in main memory bu er.

Handling Collision

• Over ow chaining.
• Best suited for database systems.

Video Transcript:
This is how collision can be resolved in static hashing. But if the distribution of keys is skewed, then it is
possible that there can be a large number of that collisions can happen and it results in too much of
chaining. We should be careful about that. Let us look at this example. These are the main buckets. Each
bucket can hold three records and if it is over owing, this pointer will point to the record in the over ow
buckets and these are over ow buckets. If you carefully observe the record, the hash function used in this
example is k mod ten. The record with 321 is mapped on to Bucket 1. Since there is space, it is inserted
similarly, record with 761 is also mapped onto bucket one. Since we have space there, it is also inserted
fl
fl
fl
fl
fi
fl
fl
ff
fi
fi
fl
fi
fl
similarly 91. It is also inserted. Now when it is time to insert a record with key value 981, it is also mapped
on to bucket one. But since it can hold only three records, the main bucket is already full. So there is a
place available in over ow bucket, that is, a work ow bucket is allocated. And this record is placed in this
over ow bucket. And this record pointer which is coming from this bucket one will point to this over ow
record which pertains to bucket one. And please note that records over owing from di erent main buckets
can be placed in same over ow buckets. That is one scheme. And it is also possible that you can have
di erent over ow buckets for di erent main buckets. That is also possible.

De ciencies of Static Hashing

• The above scheme is called as static hashing because the number of buckets allocated initially is xed.
• This is a big constraint for les that are dynamic.
• If we have less records, then the space is wasted.
• If number of records grow drastically, ends up with too much of chaining and performance degrades.

Dynamic Hashing Schemes

Need for Dynamic Hashing

• In static hashing, the hash function is xed.


• The number of main buckets is xed.
• For dynamic les, the number of records grow or shrink with time.
• Small number of buckets - Performance will degrade.
• Large number of buckets Space will be wasted.
• To overcome this we need dynamic hashing schemes for storing dynamic les.

Dynamic Hashing

• A dynamic scheme allows us to expand or shrink the hash address space dynamically.
• We study the following schemes:
• Extendible hashing.
• Linear hashing.

Extendible Hashing

• This scheme stores a directory structure in addition to the le.


• This access structure is based on the result of the hash function to the search eld.
• Each result of applying the hash function is a non-negative integer and hence can be represented with a
binary pattern.
• This is called as hash value of the record.
• Records are distributed among the buckets based on the values of the leading bits in their hash value.

Advantages and Overhead

• The performance does not degrade because of chaining.


• Because the collision is very very minimal. Though we don't completely avoid collision, but the amount
of collision amount of chaining is very very minimal.
• No additional space is wasted.
• Additional buckets can be allocated dynamically as needed.
• The only overhead in this scheme is that a directory structure needs to be searched before the buckets
are accessed.
ff
fi
fl
fi
fl
fl
fi
fl
fi
ff
fi
fl
fi
fl
fi
fi
ff
fi
fl
Schemata

• Global depth refers to number of bits we take as hash value,


i.e, here we have three bits as hash value so, G.D=3.

Example

• Assume that we need to load some records of the relation


EMP into expandable hash les based on extendible hashing
technique.
• Records are inserted into the le with following, key eld
values:
• 32, 28, 43, 15, 66, 27, 86, 54, 35.
• We start with G.D=2 and local depth= 1

• Bfr = 2. We start with GD = 2 and LD = 1. We use (K MOD 10) as the hash function.
• Keys 32, 28, 43, 15, 66, 27, 86, 54, 35.

• Bfr = 2. We start with GD = 2 and LD = 1. We use (K MOD 10) as the hash function.
fi
fi
fi
• Now, if we insert 86, it gets to B10 which is full. Now since B10 has LD = 2, we can't increase.
• First we must increase GD to 3. Now split B10 to B010 and B110 86 gets into B110.

Linear Hashing

• It is a dynamic hashing scheme.


• In linear hashing, no directory structure is used.
• Instead of one hash function, multiple hash functions are used.

Collision Resolution in Linear Hashing

• When collision occurs with one hash function, the bucket that over ows is split in to two and the records
in the original bucket are distributed among two buckets using the next hash function.
• Hence, we have multiple hash functions.
• Buckets are split in a linear order when the split criteria is satis ed.

Example

• Assume that we use linear hashing technique in some situation and we use the hash functions ho, h1,
h2, ... as (K mod 2), (K mod 4), (K mod 8) and so on.
• Assume that a bucket (one block) can accommodate 2 records.
• Keys to be inserted are:
• 14, 21, 7, 24, 6, 22, 5, 19.
• Note that split occurs when the le load factor (f) exceeds 0.7.
• While computing le load factor (f) we do not consider over ows.
fi
fi
fl
fi
fl
Problem-1

• Assume a Diskpack with uniform surface con guration with following speci cations.
• There are 9 double sided disks in the Diskpack. There are 256 cylinders (assume there are 256 tracks
too then) in the diskpack.
• Each track has 100 blocks. The block size is 1KB.
• Compute the capacity of each cylinder in MB.
• Compute the capacity of each surface in MB.
• Compute the total capacity of the Diskpack in MB.

Solution

• Track capacity= number of blocks on each track X block size= 100 X 1 KB = 100 KB.
• Number of tracks per surface = number of cylinders.
• Capacity of each surface= number of tracks per surface X track capacity = 256 X 100 KB = 25600KB =
25.6 MB.
• Capacity of each cylinder= number of surfaces X track capacity= 18 X 100 KB = 1800 KB = 1.8 MB.
• Diskpack capacity= Surface capacity X no. surfaces= 25.6 MB X 18 = 460.8 MB.
• OR Cylinder capacity X no. cylinders= 1.8 MB X 256 = 460.8 MB.

Problem-2

• Assume that we need to load some records of the


relation EMP into expandable hash les based on
extendible hashing technique.
• Records are inserted into the le with following key
eld values: 16, 31, 3, 19, 35.
fi
fi
fi
fi
fi
Solution

• Bfr = 2. We start with GD = 2 and LD = 1. We use (K MOD 10) as the hash function.

• Now, if we insert 35, it gets to Bo1 which is full. Now since Bo1 has LD 2, we can't increase. First we
must increase GD to 3 and create B001 and B101.

Problem-2 Linear Hashing

• Each bucket can take 2 records.


• File load factor- (f) = no. of record (no. blocks X bfr)
bfr = 2
• For computing f consider main buckets only;
initially n = 0
• Split criterion: when f exceeding 0.75
Week-8

Introduction to Indexing and Basic Concepts

Why Indexing or Hashing?

• We store a large number of data records in databases.


• • When we search for speci c record(s) like:
• Get details of all employees where age is 50.
• Get employee details for employees with eid 329, etc.
• We need some techniques to retrieve required records in a faster way.
• For this, we have Hashing and Indexing.

Indexing Structures

• Index structures (Access Structures) are used to speed up the search and retrieval of records in
response to queries on certain search conditions.
• In real world databases, indexes may be too large to be handled e ciently.
• Hence some sophisticated techniques are to be used.

Criteria for evaluation

• The criteria for evaluating the e cacy


• Hashing or Indexing techniques are:
• Access time.
• Insertion time.
• Deletion time.
• Space overhead.

How is Indexing Di erent from Hashing

• If we use a Hashing scheme which is well designed, the number of block accesses required to retrieve a
record is constant (1).
• In Indexing, the number of block accesses required depends on the Indexing scheme used and will be
di erent under di erent scenarios.

Data and Index Records

• Data record: Records of a relation stored in a single le containing blocks.


• Index records: Like data records, index records are also stored in database les. Any index record
normally has two elds:
Value Pointer
Key Value Location address of
the record containing
the key
• The attribute/ eld used for constructing index structure for a le is called an 'indexing eld’.

Classi cation

• The search eld is called as search key or index key.


• Indexes on key(unique) attributes:
• Built on ordering key - Primary index.(ex- eid,ssn)
• Non-ordering Key - Secondary index.
• Index on non-key(non-unique) attributes:
• Ordering non-key - Clustering Index.(ex- department number,
project number)
• Non-ordering non-key attribute- Secondary index.
ff
fi
fi
fi
ff
fi
ff
fi
ffi
fi
fi
ffi
fi
fi
Primary and Cluster Index

• Primary indexes are built on ordering key eld.


• Cluster indexes are built on ordering non- key attributes.
• Sometimes more than one indexing may be required for a le stores data of table.
• A le can have at most one primary index or one clustering index, but not both.

Dense and Sparse Index

• Dense Index: We have an index record for every data record.


• Sparse Index: Index records are created only for some data le records. This occupies less space.
• Primary index and clustering index are non- dense.

SQL Command to Create an Index

• Usually, when we de ne PK, an index is created automatically.

CREATE INDEX EMP_IND ON EMP(eid);

DROP INDEX EMP_IND;

Primary and Multilevel Indexing

Primary Indexing

• If records are ordered based on some key (usually PK) eld, the index built on that key eld is known as
Primary index.
• Ex: EMP (eid, ename, sal, adharid).
• Assume records ordered on eid.
• The index built on eid is Primary index.
• Each index record points to a data block.
• The rst record of each block is known an anchor record.
• There exists one record n index le for the key value in anchor record.
• That is, one index record for one data block. Hence sparse.

Schemata for Primary Indexing

Video Transcript:

Let us try to understand the concepts of primary


indexing using this diagram. Let us assume that we
have data blocks here which store records. That is
record one, record two, record three, record four. That
is, each data block in the data le can store up to four
records. The number of records that can be
accommodated in a block is known as blocking factor
and then we represent it as BFR. Here in this example,
the blocking factor is 4. The rst records let us assume
that it is employee id. The rst record, employee id is
two. Second record is ve like this and it is in ascending
order. That means in data le the records of employee
relation are ordered based on employee id, which is the
primary key. That means which is the unique key and
which is ordering eld. Now we are going to build primary index on this rst record of every block is known
as anchor record. So this becomes the anchor record of rst block. This becomes the anchor record of
second block, so on and so forth. Now, in the index level where we have only one level, every index record
will point to the block.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
That means if you look at the index record, it contains two elds, key eld and then pointer eld. If the key
value is two where all I can nd it. This pointer, that is the block address will tell me that. Go to this block
and you can nd a record with the given key value.

That means once we nd the block address, we can access the block on the disk and transfer the entire
block onto the ram and then do a sequential search within the block for the given key value. That is
ne. Second record of the index will point to the second block. That means any record with key value two
to 14 can be found in the block which is pointed by this pointer. Any record from 15 to 2024 key value can
be found by the pointer given in second record of the index structure. Any record between 25 and 29 can
be found by accessing the block given in record three of the indexing structure, so on and so forth. That
means there exists one index record for the rst record available in the data block, that is anchor
record. Hence, if we carefully observe the number of index records are much lesser than the number of
data records. Hence, primary indexes are sparse in nature, that is every index record points to only one
data block. How many index records are required? It is same as the the number of data blocks because
every index record is going to point to one particular data block and the records in the le are ordered
based on the key value. This is how a primary indexing structure looks like.

Multilevel Indexing

• In primary indexing, only one-level of indexing is present.


• Where as in Multilevel indexing, there can be multiple levels of indexing.
• Each intermediate and root level index record points to one index block in the next level.
• The number of block accesses with root level in indexing is (number of index levels + 1).

Schemata for Multilevel Indexing

Video Transcript:

Now let us look at the schemata of multilevel


indexing. For the data level and then rst level, it
goes similar to primary indexing. You can see here
the ordering eld is this, these are the values. So
rst record in the indexing structure points to the
rst block, that is the anchor record of the data
block. Second record points to the second block,
so on and so forth. Now, if we stop with only one
index level, then it is same as primary index, but
we do not stop there. To make it more and more
e ective and then e cient, we build multiple
levels. Now, when I am building the indexing structure at the second index level. I consider the index
blocks or index records available in the rst level of indexing to be data records and we build indexing
structure at the second level to point to the index blocks contained in the rst level. Now you can see
here, this is the rst level of indexing. Now when I am building the second level of indexing, this record
points to the rst block in the indexing rst level indexing second record in the second level points to the
second block in the rst level, indexing, so on and so forth. Depending on the need, we can stop at one
level, two levels, three levels, or if we continue further, it is possible that we will end up with only one block
at the top level that is root level. Then that is the case. If we wanted to access a record in the data le with
a given key value, we need to start our search from the root. Usually the database management systems
will store this root block in the ram. So suppose if I am searching for a record with key value 30, I go to the
root block in the index. I gure out the record with key value 30, there I nd a block address which is
pointing to the block in the next level. So I come to this block. Within that block I found a record with key
value 30. There I nd a pointer pointing to the data block. Then it is pointing to this, I come to this, I
transfer the entire data block onto the ram and then do a sequential search to get a record with given key
value. So if there are multiple levels in the indexing structure, let us assume that it is also having root
level. Then the number of block accesses required to retrieve a record with a given key value is equal to
number of levels in the indexing structure plus one. This plus one, as we already mentioned it is for
accessing the data block. So in this example there are two levels. So 2 plus 1 will be the maximum
number of disk blocks to be accessed to retrieve a record. This is multilevel indexing. For example, if I am
searching for a record with key value 38 again, I will start my search from the root block and 38 is taken
care by this record because this record can point to data blocks which can contain the records with key
fi
fi
fi
ff
fi
fi
fi
fi
fi
ffi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
values between 30 and 59. So I get the block address, I go to the next block in the next indexing level,
that is the down level. And here I nd a record in the rst level indexing in this block with key value 38. This
gives me a a pointer pointing to the data block which can contain a record with key value 38. This is a
successful search. In some cases it can be a failure search as well. For example, if I am searching for a
record with key value 26, then I start from here. 26 is possibly taken care by rst record in the top level. I
go to this block and 26 is possibly taken care by this lost record in the rst block of the rst level. Then I
follow this pointer, I go to this block, I transfer the entire block onto the ram and then do a sequential
search to retrieve the given record. But even after doing sequential search on the block, I found that I don't
have a record with key value 26, it is a failure search. That's ne. So whether it is a failure or success, we
need to reach the data level, and then transfer the entire block onto the ram, and then do a search in a
sequential order to check whether the given record is available or not. This is how the multilevel indexing
works.

Example-1 (Primary Indexing)

Primary index built on an ordered le on disk with 80,000 records stored. Block size is 512 Bytes. Record
length is xed and it is 70 Bytes. Key eld(PK) length iS 6 Bytes and block pointer is 4 Bytes. The le uses
unspanned record organisation.

Design Aspects and Performance

Size of disk block=512 Bytes.


Record length = 70 Bytes.
Block pointer = 4 Bytes.
Key eld 6 = 6 bytes; total records = 80000.
No. records per block (bfr) = oor (512/70) = 7.31
No. of data blocks needed = ceil (80000/7) = 11429.

Example-2 (Multilevel Indexing)

A multilevel index built on an ordered le with


80000 records stored on disk. Block size is 512
Bytes. Record length iS xed and it is 70 Bytes.
Key eld(PK) length is 6 Bytes and block pointer
is 4 Bytes. The le uses unspanned record
organisation.

Design Aspects and Performance

Size of the disk block = 512 Bytes; record length = 70 Bytes.


Block pointer = 4 Bytes. Key eld = 6 bytes; total records = 80000.
No. records per block (bfr) = oor (512/70) = 7.31 = 7.
No. of data blocks needed = ceil (80000/7) = 11429.
Index record length = key + pointer = 6+4 = 10 Bytes.
fi
fi
fi
fi
fi
fl
fl
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Clustering and Secondary Indexes

What is Clustering Index

• Indexes built on ordering non-key attributes are called as clustering indexes.


• Let us assume that we have a relation:
• EMP (eid, ename, sal, age, address, dno) If, the records are ordered based on dno value.
• If we build an index on dno (ordering non-key), it is known as clustering index.

Implementation

• Two popular approaches to implement clustering index are-


• Approach-1: A block can contain records with di erent key values.
• Approach-2: Separate block clusters for each group of records with same key value.
• Note: Both are non-dense indexes.

Approach-1

Approach-2
ff
What is Secondary Index?

• Indexes built on non-ordering attributes are called as Secondary indexes.


• A secondary index can be built on:
• a key (unique) attribute or
• a non-key attribute.
• EMP (eid, ename, adhar|D (unique), dno)
• Records ordered on eid.
• For secondary index, we can use adharID or Dno for indexing.

Secondary Index on Non-Key Field

Video Transcript:
Let us look at this ski matter for the same
employee table. This is employee ID, which is the
primary key, and then ordering attribute. The
records are ordered in the le based on the values
of this employee ID. Now, we want to create index
on dno, which is non-unique, non-ordering. Then
in the indexing level, we have index le
which contains index records for distinct values of
department number. In the given example, we
know that the distinct department numbers are
10, 20, 30, and 40. But we know that there can
be multiple records in the employee relation or the
table, for this given department number. For the same department number, there can be multiple
records. How can we solve this problem? Same record cannot directly point to multiple blocks which
contain the records with department number 10. For example, if you take, we want to index all the records
with the department number 10 value. In the indexing level, this is the record pertaining to department
number 10, and this one address cannot point to multiple blocks. It is not possible. Since it is not a
ordering attribute, the records pertaining to department number 10 can be spread across di erent
blocks. To solve this, instead of directly pointing to one single block, the index record with key value 10 is
pointing to an intermediate bucket. Which can contain pointers to individual blocks, which can contain the
records with the department number value 10. In this example, this particular record with key value 10 is
instead of directly pointing to the data block, it is pointing to intermediate bucket, where it has two
pointers. The rst pointer is pointing to rst block, where we have one record with the department number
10, and second pointer in the intermediate bucket is pointing to third block, which is also having a
record with the department number 10. Similarly, the second record in the index for the key value 20, that
is department number 20, it is pointing to an intermediate bucket where the rst pointer in the intermediate
bucket is pointing to block number 1, and the second pointer is pointing to block number 2. Block one
and two contain records pertaining to department number 20. So on and so forth, for department number
30, also, this intermediate bucket is pointing to di erent blocks which may contain the records with
department number 30, and similarly for 40. This is how we can build the indexing that is secondary
indexing for non-ordering non-key attribute.

Secondary Index on Key Field

Video Transcript:
Now, let us understand how we can build secondary
index on non-ordering key attribute. That means the
attribute is unique, but it is not de ning the order of the
records in the le. In the same table that is employee
table. Now let us consider that employee ID is the
primary key, and then it is the ordering attribute. Now,
another attribute, adharID is there, which is also
unique, but it doesn't de ne the order of the
attributes. Now, if there is a requirement that we need to
build an index on adharID, how do we do that? In the
previous case, since there are multiple records with
department number 10, and we picked up the distinct
fi
fi
fi
fi
fi
fi
fi
ff
fi
ff
values of department number and then that, were appearing in the indexing records that's why it was a
sparse indexing. Now, in this case, we know that every record will have distinct adharID, and they're not
ordered. The records may be spread across di erent blocks, and there is only one record pertaining to one
adharID. To build index, because it is built on non-ordering attribute, it is a secondary index. But in the
index level for a very distinct value of the adharID, I have a record, and since it is unique, there exists only
one data record pertaining to that particular distinct key value. For example, adharID 1181, there is only
one record, this is the record pertaining to 1181 and this is available in this block, so this pointer will point
to this block. Similarly, 1784 is available here in the second block so this pointer will also point to this. If
you take 4170, this is available in block one, so this pointer will point to this block. We know that instead
of pointing to exact records, the addresses, that is the second eld value, that is the block pointer in the
index record, will point to the block. Once we identify the required block on the disc, we will transfer that
block onto the Rm, and then we can make a sequential search for a given record with given key value on
Rm. This is how it works. That's why every data record has a index record. That's why it is a dense
indexing. Further, to make it more e ective, what we can do is we can build multiple levels on this. That is
possible. But this particular scheme, where we are stopping with one level and where the index is built
on non-ordering key attribute is also known as a secondary index.

B+ Tree Indexing

Search Trees

• A search tree is a special type of tree that is used to guide the search for a record based on the value of
one of the record’s elds.
• The multilevel indexing, which we have seen earlier, can be thought of as a variation of a search tree.
• The index eld values at various levels guide the search till we reach the block that contains the actual
data record.
• Search tree is a balanced tree.

B+ Tree

• The primary disadvantage of implementing multilevel indexes is that the performance degrades as the
le grows.
• It can be remedied by reorganisation, but frequent reorganisation is not advisable.
• B+ Tree is a multilevel search tree used to implement dynamic multilevel indexing.
• B+ tree is best suited for multilevel indexing of les, because it is dynamic and self reorganizing.

B+ Tree

• It is a balanced tree (all leaves are at same level).


• Each internal node is of the form:

B+ - Tree of Order p

• The maximum number of pointers = p.


• Minimum number of pointers in ceil(p/2).
• All leaves are at same level.
• Root node can have two pointers only.
• Every leaf node has pointer to the next leaf node.
fi
fi
fi
ff
ff
fi
fi
Sample B+ Tree

Video Transcript:

See the video, it's better to understand it there than reading


the transcript. Try to do some questions in the book for better
understanding.

B-Tree Structure

VIdeo Transcript:
Now, let us look at how B tree is di erent from B+ tree. If
we carefully look at B+ tree, in the intermediate level or
root level, we don't see any pointer pointing to the data
blocks. That means a pointer to the data blocks which
contain the records with the given key value can be found
only at the leaf level, not at the intermediate level. But if we
carefully observe B+ tree, you see here, if I'm searching for
a record with key value ve, I start my search at the root. Yes, there is a key value ve, then I don't have to
go to any other level in the tree. Directly, there is a pointer, which is pointing to the block, which contains
the record with key value ve. But this is not possible in B+ tree. Hence, in B+ tree, all data pointers, that
is the pointers to the data blocks, pertaining to a key value can be found only at the leaf level. But whereas
in B tree, that is possible, that is the root level or intermediate level nodes can also point to data
blocks. That is the fundamental di erence between B tree and B+ tree.

Di erence Between B+ -Tree and B -Tree

• In a B+ tree, record pointer for a record with given key can be found only at leaf node.
• But if it is in case of B-tree it can happen at the intermediate node also.
• Hence in B+ tree search, success or failure can be declared only after reaching leaf level.
• Where as in B-tree search can be successful at the intermediate level as well.
• On failure we reach the leaf level.

Example on B+ Tree Node Design

• Here is an example for detailing the node structure for a B+ tree indexing built for Student relation, on
student_id attribute as the key of the relation. The attribute student_id is of 4 bytes length. Other
attributes are - student_age (4 bytes), student_name (20 bytes), student_address (40 bytes), and
student_branch (3 bytes).

• The Disk block size is 1024 Bytes. The tree-pointer takes 4 bytes.
• Note: Each internal node is a disk block which contains search key values and pointers to subtrees.

Solution:
• Disk block size = 1024 Bytes.
• Size of B+ tree node= size of disk block.
• Each tree pointer points to disk block and takes 4 Bytes. Each key takes 4 Bytes.
• In a B+ tree node,
• No. of pointers = no. = keys +1.
• If no. keys = n then the no. pointers = n+1.
ff
fi
fi
ff
ff
fi
• Then min. size for a node = {(no.Keys* size of each key)+(no pointers * size of each pointer)} <= 1024
• (n*4)+ (n+1)*4 <= 1024 8n+4 <= 1024.
• 8n <= 1024 - 4 = 1020 n <= 127.5 or 127.
• Hence, in each internal node, no. keys = 127; and no. pointers = 128.
Week-9

Transaction Model

Transaction

• A transaction is an executing program that forms a logical unit of database processing.


• A transaction includes one or more database access operations read, write, delete etc.
• A transaction is an atomic unit of work that is either completed in its entirety or not done at all.
• For recovery purposes, the system needs to keep track of when the transaction starts, terminates, and
commits or aborts.

Transaction Model

• A transaction is a program unit that accesses and updates several data items.
• Read ()and Write () are the basic operations.

• Hence, as a result of failure, state of the system will not re ect the state of the real world that the
database is supposed to capture.
• We call that state as inconsistent state.
• It is important to de ne transactions such that they preserve consistency.

Transaction States

• Active State: Initial state when a transaction starts.


• Partially Committed State: This state is reached when
the last statement is executed, and the outcome is not
completely written to the sys Log.
• Failed State: After discovering that the normal execution
cannot be continued, a transaction is aborted and
reaches failed state.
• Committed State: Is reached after successful completion
of the transaction. All changes are written to the Sys log.
• Terminated State: Is reached after failure or success of
the transaction.

ACID Properties of a Transaction

• Transaction should possess the following properties, called as ACID properties:


• Atomicity
• Consistency Preservation
• Isolation
• Durability

• Atomicity: A transaction is an atomic unit of processing. It is either performed in its entirety or not
performed at all.
• Consistency Preservation: The successful execution of a transaction takes the database from one
consistent state to another.
• Isolation: A transaction should be executed as if it is not interfered by any other transaction.
• Durability: The changes applied to the data by a transaction must be permanent.
fi
fl
Schedules of Transactions

Schedule

• A schedule of transactions is a description that speci es the execution sequence of instructions in a set
of transactions.
• A schedule can describe the execution sequence of more than one transaction.

Serial Schedule

• In a serial schedule, instructions belonging to one single transaction appear


together.
• A serial schedule does not exploit the concurrency. Hence, it is less e cient.
• If the transactions are executed concurrently, then the resources can be utilized
more e ciently hence more throughput is achieved.
• But, a serial schedule always results in correct database state that re ect the real
world situations.

Concurrent Schedules

• When the instructions of di erent transactions of a schedule are executed in an


interleaved L manner, such schedules are called concurrent schedules.
• This kind of concurrent schedules may result in incorrect database state.

Serial and Concurrent Schedules

• A schedule S is serial if, for every transaction T participating in the schedule, all
the operations of T are executed consecutively in the schedule. Otherwise, the
schedule is called non- serial or concurrent schedule.

Serialisability

• A concurrent schedule S is serialisable if it is equivalent to some serial schedule of the same n


transactions.
• Being serialisable is not the same as being serial.
• Being serialisable implies that the schedule is a correct schedule.
• A serialisable schedule will leave the database in a consistent state.
• The interleaving is appropriate and will result in a state as if the transactions were serially executed yet
will achieve e ciency due to concurrent execution.
• Hence serialisable schedules are safe and e cient.

Con ict Equivalent Schedules

• If a schedule S can be transformed into a schedule S' by a series of swaps of non- con icting
instructions, we say that S and S’are con ict equivalent.

• Hence it is evident that if we swap non- con icting operations of a concurrent schedule, it will not a ect
the nal result.
fi
fl
ffi
ffi
ff
fl
ffi
fl
fi
fl
ffi
fl
ff
Con ict Serialisability

• A schedule S is con ict serialisable if it is con ict equivalent to a serial schedule.

• In this example, S5 in the above example, is a serial schedule and is con ict equivalent to S1. Hence S1
is a con ict serialisable schedule.

Video Transcript:
This is schedule S. This is the given schedule. There are T_1 and T_2. T_1 is reading A, T_1 is writing A,
then T_2 is reading A, then T_2 is writing A, then T_1, reading B and T_1 writing B. Then comes T_2
reading B, and then T_2 writing B. It is very evident that it is a concurrent schedule because the
instructions pertaining to T_1 and T_2 are interleaved. They are executed in a concurrent way. Now, if we
carefully observe, in this S, rst this, I take it as S_1. In this, if we take this pair where this belongs to T_2,
this belongs to T_1, it is evident that they are non-con icting pair of operations because they are
working on two di erent data items. Because they are non-con icting, I can swap them. That means I can
take it up and then I can bring it down. This is the result. S_2 is result of this rst swap. That's why I can
say that S_2 is con ict equivalent to S_1 because two is result of swapping a pair of, in this case, one pair
of non-con icting operations. Hence, it is no problem. Executing this will give you the same
result, because, the swap happened between only the non con icting pace. Now, if we take S_2, then, we
can observe that this pair is non con icting, because they are working on two di erent data items. I can
swap them. I can take it up and then bring this down. S_3 is a result of second swap. Now I can say that,
S_3 is con ict equivalent to S_2, and S_3 is also con ict equivalent to S_1. Still it is concurrent schedule
only. Now, if we carefully observe from S_3, this is non con icting pair. I can swap. I can take this up and
bring this down. S_4 is the result of that swap. S_4 is con ict equivalent to S_3, S_2 and S_1. Still it is
concurrent. From S_4, if I consider this pair of operations, they are non con icting because they are
working on di erent data items. There is no problem. I can swap them, take it up and then bring this
down. This is the result of that swap. S_5 is con ict equivalent to S_4, S_3, S_2 and S_1. But, what
happened here if you carefully observe after this fourth swap, this is equvalent to a serial schedule. That
means, executing this and executing this, the result will be same. S_1 is a concurrent schedule, but
S_5, which is a result of applying a series of swappings among the non con icting payo operations, is
equivalent to a serial schedule. That's why executing S_1

will produce the result same as executing T_1 rst, and then T_2 next in a serial order. Hence, executing
S_1, which is a concurrent schedule, will leave the database in consistent state. That's why we say that
S_1 is a serializable schedule. Because it is equivalent to some serial schedule, which is a result of
swapping non con icting pair o of operations, we say that serializability as con ict serializability. This
concept is known as con ict serializability. I say that, S_1 is con ict serializable schedule, and S_5 is a
serial schedule, which is con ict equivalent to S_1. Now, if we execute S_1, which is a concurrent
schedule, we'll produce the result same as executing T_1 rst and then T_2 next. Hence it will always
leave the database in some consistent state. We are exploiting concurrency at one hand, and we are also
making sure that the database will be in some consistent state, because it is execution of S_1 is
equivalent to executing T_1 rst, and executing T_2 next in a serial order. Hence, this is known as con ict
serializability. I say that S_1 is con ict serializable schedule. I prefer serializable schedules because we
can exploit concurrency at the same time. No problem with respect to consistency.
fl
fl
fl
fl
ff
ff
fl
fl
fl
fl
fi
fi
fl
ff
fl
fl
fl
fi
fl
fl
fl
fl
fi
fl
fl
fl
fl
fl
fl
fl
fi
fl
ff
ff
fl
Con ict Serialisability

• In this schedule we cannot perform any swap between instructions of T1


and T2. Hence it is not con ict serialisable.

Precedence Graph

• Let S be a schedule.
• We construct a precedence graph as below.
• Each transaction participating in the schedule will become a node.
• The set of edges consist of all edges Ti -> Tj for which one of the following
three conditions hold:
• Ti executes W(Q) before Tj executes R(Q)
• Ti executes R(Q) before Tj executes W(Q)
• Ti executes W(Q) before Tj executes W(Q)

Example-1

• Since there is no cycle, it is con ict serializable, meaning its serializable.

Example-2

• Since there is cycle, it is not con ict serializable.

Recoverable Schedules

• Recoverable schedule: if a transaction Tj reads a data item previously written by a transaction Ti then the
commit operation of Tj appears before the commit operation of Ti.
• The following schedule is not recoverable if T9 commits immediately after the read.

If T8 should abort, T9 would have read (and possibly shown to the user)
an inconsistent database state. Hence, database must ensure that
schedules are recoverable.
fl
fl
fl
fl
Cascading Rollback

• Cascading rollback: a single transaction failure leads to a series of transaction rollbacks. Consider the
following schedule where none of the transactions has yet committed (so the schedule is recoverable).

• If T10 fails, T11 and T12 must also be rolled back.

Concurrency Control

Problems with Concurrency

• The Lost Update Problem.


• The Temporary Update (or Dirty Read) Problem.
• The Incorrect Summary Problem.

Lost Update Problem

Temporary Update (or Dirty Read) Problem

Incorrect Summary Problem


Why Concurrency?

• In a DBMS. multiple transactions are executed concurrently.


• If the transactions are executed concurrently then the resources can be utilised more e ciently hence
higher throughput is achieved.
• For transactions, we consider data items as resources because transactions process data by accessing
them.

Need for Concurrency Control

• When multiple transactions access data elements in a concurrent way, this may destroy the consistency
of the database.
• When one transaction access a data item. if some other transaction access the same for con icting
operation, we may face issues discussed earlier, leading to inconsistency.
• Hence it is essential to Control the degree of concurrency to maintain the consistency.

Approaches to Concurrency Control

• One way to ensure serialisability is to allow the transactions to access the data items in a mutually
exclusive manner.
• This is to make sure that when one transaction access a data item no other transaction can modify that
data item.
• The following techniques implement mutual exclusion and control concurrency.
• Lock-based protocols
• Timestamp-based protocols

What is a Lock?

• A data item may be locked in various modes.


• A lock is a variable associated with a data item.
• When a transaction wants to access a data item in read/write mode, it must apply for the lock on that
item.
• If free, it will be granted.

Di erent Locks

• Types of locks:
• Shared (denoted by S): if a transaction obtains a shared mode lock on a data item Q, it can read Q but
not modify Q.
• Exclusive (denoted by X): if this lock is obtained, a transaction can read or write the data item.

Lock Compatibility Matrix

• Lock compatibility matrix


• If a transaction A acquires Shared Lock then other transaction like B and C can request for shared
locks, but requests for exclusive locks will be rejected.
• If a transaction A acquires Exclusive Lock then other transaction like B and C’s request for shared or
exclusive locks will be rejected.
ff
ffi
fl
How Locking Can Help Implementing Serialisability?

Lock Manager

• A lock manager can be implemented as a separate process to which transactions send lock and unlock
requests.
• The lock manager replies to a lock request by sending a lock grant messages (or a message asking the
transaction to roll back, in case of a deadlock).
• The requesting transaction waits until its request is answered.
• The lock manager maintains a data- structure called a lock table to record granted locks and pending
requests.

Two-phase Locking Protocol

• This is a protocol which ensures con ict-serialisability.

• Phase 1: Growing Phase:


• Transaction may obtain locks.
• But, the transaction is not allowed to release locks.

• Phase 2: Shrinking Phase:


• Transaction may release locks.
• But, the transaction is not permitted obtain new locks.

• The protocol assures serialisability.


• Two-phase locking does not ensure freedom from deadlocks.

Variants of Two Phase Locking Protocol

• Basic 2PL: Explained earlier. It is deadlock prone.

• Conservative 2PL: A transaction requires to lock all its data items before the transaction begins. (not the
case with Basic 2PL). Hence deadlock free.

• Strict 2PL: A transaction will not release any of its Exclusive lock before it commits or aborts. Hence
always recoverable. Most popular.

• Rigorous 2PL: A transaction will not release any of its lock (S or X) before it commits or aborts. Hence
always recoverable.
fl
What is a Deadlock?

•Neither T3 nor T4 can make progress- executing lock-S(B) causes T4 to


wait for T3 to release its lock on B, while executing lock- X(A) causes T3
to wait for T4 to release its lock on A.

•Such a situation is called a deadlock.


•To handle a deadlock one of T3 or T, must be rolled back, and its locks
released.

Deadlock Prevention

• Deadlock prevention protocols ensure that the system will never enter into a deadlock state.
• Some prevention strategies require that each transaction locks all its data items before it begins
execution.
• Ex: pre-declaration as done in case of conservative 2PL.

Other Deadlock Prevention Schemes

• Following schemes use transaction timestamps/priority for the sake of preventing deadlock.
• Wait-die scheme: (Non-preemptive)
• Older/high-priority transaction may wait for younger/low-priority one to release data item. Younger
transactions never wait for older ones; they are rolled back instead.
• Wound-wait scheme: (Preemptive)
• Older/high-priority transaction wounds (forces rollback) of younger/low-priority transaction instead of
waiting for it. Younger transactions may wait for older ones.

Deadlock Detection

• Wait-for Graph
• Deadlock condition can be determined by a wait-for graph.
• All transactions of the schedule become nodes.
• Draw an edge between two transactions Ti and Tj, if Ti is waiting for Tj to release a lock on a data item.
• If the graph has a cycle then we can say that the schedule will result in a deadlock.

Ex. for Wait-for Graph

(We can draw an edge from T4 to T1


or T4 to T2. Either way is acceptable)

Summary

• When deadlock occurs all transactions halt.


• Deadlocks occur due to mutual exclusion(Locks).
• Deadlocks can be prevented with Conservative 2PL.
• Wait-die and Wound-wait are used to prevent deadlock.
• Wait-for graph is used to detect deadlocks.
Use of Timestamps

• Maintaining the ordering between every pair of con icting transactions is signi cant.
• If we select the ordering in advance, we can achieve serialisability.
• Time-stamping is a method to x the ordering.
• Each transaction is assigned a unique xed timestamp.
• If TS (Ti) < TS (Tj), this implies that Ti should be executed before Tj - Here, the time-stamps determine
the serialisability order.

Read and Write Timestamps

• Each data item is associated with two timestamp values.


• W-timestamp (Q): represents the largest timestamp of any transaction that successfully executes Write
on Q.
• R-timestamp (Q): which denotes the largest time stamp of any transaction that successfully executed
Read on Q.

• These values are updated whenever read (Q) or write (Q) are executed by transactions.

Timestamp Ordering Protocol

• This protocol operates as follows:


• i) Suppose Transaction Tr issues read (Q)(since read with read is not con icting we will not take R
timestamp):
• If TS (Ti) < W-timestamp (Q), then it implies that Ti need to read Q which was already overwritten by
an younger transaction..
• Hence read operation is rejected and Ti is rolled back.
• If TS (Ti) ≥ W-timestamp (Q) then read operation is executed.

• ii) Suppose Ti, issues write (Q)(since write-read&write is con icting we will take R&W timestamp):
• If TS (Ti) < R-timestamp (Q) it implies that the value of Q being produced by Ti had to be written long
back. Hence reject Ti & roll back.
• If TS (Ti) < W-timestamp (Q), Ti is attempting to write some absolute value of Q.
• Hence reject Ti & roll back.
• Otherwise write operation is executed.
• Note: It is guaranteed to be con ict serialisable.

Example

Video Transcript:
First, let us consider this. That is, the read timestamp of A is
300, and write timestamp is 320. That is a transaction
with timestamp 300 has quite recently read that data
item, and the transaction with timestamp 320 has
quite recently written that data item or updated that data
item. Now, let us assume that there is a transaction T_i with
a timestamp 450, making a request for read of A. Then what
should we do? Because it is a request for read, we should
not consider or we should not look at RTS, that is, read timestamp. It is not necessary because read and
read is not con icting. We need to consider only the write timestamp. The write timestamp at this point is
320. That is, an older transaction has updated it, another transaction is now coming and then asking for
read. There is no problem. We can permit it. Let us consider the second case for this rst read operation
only. In the second case, RTS is 500, WTS of A is 400 and the timestamp of TI is 450, it is requesting a
read. We don't have to consider read timestamp. We need to consider only this. Even in this case, older
transaction has updated this, and the other transaction is now requesting a read operation. There is no
problem at all. We can permit that. If you consider the third case, the read timestamp need not be
considered. But if we look at the write timestamp, it is 480, which is higher than 450. That is, another
transaction has already modi ed the data item A. Hence, we should not permit that, and then we should
roll back TI. Similarly for Case 4, because the right timestamp is 490, that means, another transaction has
updated it. We should not permit that. TI should be rolled back. Now, if we consider that the transaction TI
fl
fi
fi
fl
fi
fl
fl
fl
fi
fi
with timestamp 450 is making a request for write, when it is a write operation, we should consider
both read and write timestamps of the data item. Let us consider Case 1, the read timestamp is 300 and
the write timestamp is 320. Both are less than 450. That is, an older transaction. Or older transactions
have updated and then read this data item. Hence, there is no problem. Now, if we consider Case 2, the
read timestamp is higher than 450. That means another transaction has already read something from the
database pertaining to this data value A. Hence, the older transaction is now coming and then trying to
update it, which is not correct. Hence, in Case 2, we should reject the right operation of TI, and then TI
should be rolled back and same as the case with Case 3. Because the read timestamp is 500, which is
higher than 450. Hence we should not permit. If we consider Case 4, there is no issue with read
timestamp, but there is issue with write timestamp. The write timestamp is 490, which is higher than the
timestamp of TI, which is 450. Hence, it is implied that another transaction has updated the data
value, and then this update that is proposed by this TI will be meaningless. Hence it should be rejected. TI
should be rolled back. This is how we can use the timestamps associated with data item and
then associated with the transactions to x the ordering of execution of di erent transactions.

Transaction support in SQL

• A single SQL statement is always considered to be atomic, hence can be treated as a transaction.
• If multiple statements are to be executed as an atomic unit, we can do the following.

EXEC SQL WHENEVER SQLERROR GOTO UNDO;


EXEC SET TRANSACTION;

EXEC SQL INSERT INTO EMP.....;


EXEC SQL UPDATE EMP......;
…….

EXEC SQL COMMIT;


GO TO END;
UNDO: EXEC SQL ROLLBACK;
END: …….;

Video Transcript:
Absorb these statements, EXEC SQL, whenever SQL error go to undo. Undo is the label. It is available
here. This is end label. EXEC set transaction, that is, this particular SQL statements de nes that
whatever SQL statements that follow this set transaction statement, they will be part of the
transaction. Then we can have any number of data manipulation commands. Then EXEC SQL
COMMIT. That is once all the statements are done, we can commit it. If there is a problem in
between while executing the SQL statements, this commit will not be executed. It executes go to. I repeat.
If there is no problem, all statements are executed perfectly, then SQL commit is executed, and it will go to
end. End means, it is completed, committed and then complete the transaction. If there is any issue in
between, while executing the SQL statements, then it is according to this statement, it will go to undo
label. Then the control comes here. It is going to execute EXEC SQL ROLLBACK. That is, all the
modi cations done by that failed transaction will be rolled back. This is how SQL supports transaction
model. This is a very broader view of how SQL can support these transactions. Di erent implementations
of SQL can have slightly varying syntax for supporting transactions.
fi
fi
ff
ff
fi
Week-10

Database Recovery

Introduction

• A transaction is a program unit that accesses


and updates several data items.
• Read ( ) and Write ( ) are the basic operations.

Failures and Inconsistency Problem

• A transaction can fail due to System crashes or transaction errors.


• As a result of failure of a transaction, the state of the system will not re ect the state of the real world
that the database is supposed to capture.
• We call that state as inconsistent state.

Need for Recovery

• When such failure happens, we must make sure that the database is restored to its previous consistent
which existed before the start of the transaction (which has failed).
• This process is known as recovery process.
• The most popular recovery schemes are log-based recovery schemes.

Log-Based Recovery Schemes

• If a transaction T performed multiple database modi cations, several output operations may be required,
and a failure may occur after some of these modi cations have been made but before all of them are
made.
• In order to restore to the recent consistent state, we must rst write the information describing the
modi cations to system log without modifying the database itself.
• This helps us to remove the modi cation done by a failed transaction.
• Removing/undoing the modi cations done by a failed transaction, is known a rollback.

System Log

• Database System Log:


• Each log record describes a single database write operation and contains the following details.
• Transaction name/id
• Data item name
• Old value
• New value

Types of Log Records

• <Ti start>: indicates transaction Ti started.


• <Ti, Xj, V1, V2>: transaction Ti has performed a write operation on data item Xj and value V1, before the
write and will have value v2 after the write.
• <Ti commit >: transaction Ti commits.
• <Ti abort >: transaction Ti aborted.
• With these log records we have the ability to undo or redo a modi cation that has already been output to
the DB.

Summary

• When a failure occurs, it is essential to take the DB to previous consistent state.


• To do this we must remove/undo the modi cations done by a failed transaction, which is known a
rollback.
fi
fi
fi
fi
fi
fi
fi
fi
fl
Deferred Modi cation Technique

• This technique ensures atomicity by recording all database modi cations (updates) in the log.
• But deferring (postpone) the actual updates to the database until the transaction commits.
• As no data item is written before commit record of the transaction, we need only new value.
• Hence, we perform only redo operation.
• The redo operation on Ti sets the value of all data items updated by transaction Ti to the new values.
• All new values will be found in the log records.

Deferred Modi cation Protocol

• Deferred Modi cation Protocol


• A Transaction cannot change the database on disk until it reaches its commit point.
• A transaction does not reach commit point until all its updates are recorded in the log and log is force-
written to disk (permanent storage).

Working Principle

• Redoing is needed when we have all modi cations on log, and have doubts about
successful writing to the DB.
• The logs will not contain old values because undo is not required, we only do redo.

• On failure, a transaction need to be redone if and only if the log contains both
<start> and <commit> records. Otherwise, we don't have to do anything.

Example

Action: T1 and T2 redo; T3 no action needed.

Immediate Modi cation Technique

• In this, database modi cations to be output to the database while the transaction is still in the active
state i.e., before commit, after log entry is done and log entries are written to disk.
• If such is the case, for incomplete transactions, on failure, undo operation is needed and for committed
transactions redo may be required.

Working Principle

On failure, a transaction need to be applied. Redo if the log


contains both <start> and <commit> records. Otherwise, only
<start> we need to apply undo.
fi
fi
fi
fi
fi
fi
fi
Example

Action: T1 and T2 redo; T3 undo.


We redo the committed transaction to make sure they are written into
the disks.

Need for Checkpointing

• In case of failure, the log needs to be searched to determine the transactions that need to be redone or
undone.
• But this searching is time consuming and most of the time the algorithm will redo the transactions which
actually written their updates to the DB, redoing them is waste of time.
• In order to reduce these types of overheads checkpointing is helpful.

Actions Needed During the Checkpointing

• Output all log records currently in main memory onto stable storage.
• Output all modi ed bu er blocks to the disk.
• Output log record <checkpoint> on to stable storage.
• During the recovery process the redo/undo operations for the transactions will be considered which
occur after or just before latest <checkpoint> record on log.

SQL Query Execution and Optimization

Steps in SQL Query Execution

1. Scanning: Identifying the language tokens


2. Parsing: Syntax checking
3. Validation: Checking the attributes and relations mentioned in the query valid or not
4. Generate query tree: Representation of the query
5. Devise an execution strategy: To retrieve data from the les
6. Query optimisation: Choosing a suitable strategy and generate execution plan
7. Code generation: Generating code for the plan
8. Execution

Work ow

Code can be executed Directly (interpreted)


Or
Stored and executed Later
fl
fi
ff
fi
SQL Query - Relational Algebra

• An SQL query is rst translated into an equivalent relational algebra expression - represented as a query
tree data structure.
• SQL queries are decomposed into query blocks.
• A query block forms the basic unit that can be translated into algebraic operators and optimised.
• A query block contains single SELECT-FROM-WHERE clause with operational GROUP BY and Having
clauses.
• Hence nested queries within a query are identi ed as separate query blocks.

Sort-Merge Strategy for Sorting

• Sorting is one of the primary algorithms used in query processing.


• The typical external sorting algorithm uses a sort-merge strategy.
• This starts by sorting small sub les called runs- of the main le and then merges the sorted runs,
creating larger sorted sub les that are merged in turn.
• This algorithm requires bu er space where the actual sorting and merging of runs is performed.

Phases of Sort-merge

1. Sorting phase: In this phase, runs (sub les) of the le that can t in the available bu er space are read
into main memory and sorted using an internal sorting algorithm, and written back to the disk as
temporary sorted sub les(or runs).

2. Merging phase: in this phase, the sorted runs are merged during one or more passes. In each pass, one
bu er block is needed to hold one block from each of the runs being merged., and one block is needed for
containing one block of the merge result.

Implementing JOIN

If relation R is joined with S on R.A=S.B.


1. Nested-loop join (brute forcing)
For every record t in R (outer loop), retrieve every record from S (inner loop) and test if they satisfy the join
condition.

2. Single-loop join (using an access structure to retrieve matching records)


If there exists an indexing for one of the attributes used in joining, say B in S, then for each record in R,
retrieve all records in S using the access path for the key R.A value.

3. Sort-merge join
If records in R and S are physically sorted by value of join attribute A and B, both les are scanned
concurrently in order of join attributes, matching the records that have the same values for A and B.
ff
fi
fi
fi
ff
fi
fi
fi
fi
fi
fi
fi
ff
4. Hash join
The records of R and S are hashed to the same hash le using the same hash function on the join
attributes A of R and B of S as hash keys.

Partitioning: First, the le with fewer records (say R) is hashed to buckets.

Probing: a single pass through the other le S hashes records to probe the appropriate bucket, and
combine matching records of R in the bucket.

Other Algorithms

We also have algorithms to implement:


• SELECT.
• PROJECT.
• Set operations.
• Implementing Aggregate operations and GROUP BY.
• Implementing Outer Joins.

Query Tree

• A Query Tree is a tree data structure that corresponds to a relational algebra expression.
• It represents the input relations of the query as leaf nodes of the tree and represents the relational
algebra query as internal nodes.
• The execution using the query tree consists of executing an internal node operation whenever its
operands are available and then replacing the internal node by the relation that results from executing
the operation.
• The execution terminates when the root node is executed and produces the result relation for the query.

Example

Heuristic Optimisation of Query Trees

• A query parser will generate a standard initial query tree that corresponds to the SQL query, without
doing any optimisation.
• These initial trees are ine cient if executed directly.
• Now, it is the job of the heuristic query optimiser to transform this initial query tree into a nal query tree
that is e cient to execute.

Equivalence Rules for RA Expressions

• The optimiser must include rules for equivalence among relational algebra expressions that can be
applied to initial tree.
ffi
fi
ffi
fi
fi
fi
Video Transcript:
For example, if we consider the rst relational algebraic expression joining EMPLOYEE table with
DEPARTMENT on Dnn from EMPLOYEE is equal to Dnum from DEPARTMENT. From the joined tuples,
selecting Dlocation=DELHI and from the result projecting on Ename and salary. This is one possible
approach. Alternative expression is rst I select only those tuples from DEPARTMENT table where the
Dlocation=DELHI is satis ed.

Then it is obvious that very few tuples or the records from DEPARTMENT can satisfy that condition, and as
a result of this SELECT operation, we have only few number of records from DEPARTMENT. Then join that
less number of records from DEPARTMENT with EMPLOYEE on Dno=Dnum, then pick up Ename and
salary. If we carefully observe, both of these expressions will produce same result. The only advantage is if
I join EMPLOYEE table with DEPARTMENT, without performing this SELECT operation on the
DEPARTMENT, then the joined table or the JOIN result will contain more number of tuples and then
requires more space. If I perform this SELECT operation on the DEPARTMENT table before I go for joining
operation, then I have very less number of tuples left for joining with EMPLOYEE table. That will be more
e ective and then more e cient.

Hence, it is obvious that there can be multiple possible relational algebraic expressions for retrieving the
same result. The optimizer must include rules for each equivalents among relational algebra expressions
that we will use during the optimization. Heuristic query optimization rule then utilizes these equivalents of
relational algebraic expressions to transform the initial tree into nal optimized query tree.

Optimisation Heuristics

• Break up conjunctive SELECT operations (helps in moving SELECT operations down in di erent
branches).
• Move each SELECT as far down the query tree as possible.
• More restrictive SELECT condition need to be executed rst.
• Combine a CARTESIAN PRODUCT with a subsequent SELECT in the tree into a JOIN, if the condition
represents a JOIN.
• Break up PROJECT operations and move them as far down as possible in di erent branches.

Illustration

Video Transcript:
Let us have a look at this example.

DEPARTMENT EMPLOYEE cartesian product. Then there is a


SELECT operation. Dno from EMPLPYEE is equal to Dnumber
from DEPARTMENT, and nally, projecting on Fname and
salary. What I can do according to the heuristics, what we have
seen in the previous slide, I can push the SELECT operation as
down as possible. That means I push this SELECT
operation. That is a condition that is applied on
DEPARTMENT table down in the branch of the tree. That is,
before performing this cartesian product or whatever it is, I
perform the SELECT operation on DEPARTMENT table to pick
up only those tuples where Dname= Accounts. The result of
this is I will minimize the number of tuples I am considering
for the next subsequent operation from DEPARTMENT
table. Then to perform JOIN operation between DEPARTMENT
and EMPLOYEE, I require only DEPARTMENT number. Hence I
perform project on DEPARTMENT number from DEPARTMENT table. I will
apply it as early as possible, that is immediately after the SELECT operation
on DEPARTMENT table. And similarly to join EMPLOYEE and DEPARTMENT,
I require Dnumber, and in the nal result, I require only Fname and salary. So
what we can do is we can project on only those attributes that are essential
for performing JOIN operation and that are essential for projecting the
result. So I perform this PROJECT operation on employee as early as
possible in the query execution. Then if we carefully observe a
ff
fi
fi
ffi
fi
fi
fi
fi
fi
ff
ff
cartesian product is followed by a SELECT condition. Now that can be replaced by a JOIN operation
according to the heuristics. And nally, we project on Fname and salary. Of course, in this example, we
can see that even this assertion may not be required. We can avoid that. This is how heuristics can be
applied to the initial query tree to obtain the nal optimized query tree. If we compare this nal tree with
the initial tree, it is very obvious that we have performed the SELECT and PROJECT operation on
DEPARTMENT table as early as possible. We have performed the PROJECT operation on EMPLOYEE
table at the earliest and then we have replaced cartesian product followed by SELECT condition with JOIN
operation. And this is more optimized execution.

Cost Components

• Access cost to secondary storage-This Is the cost of searching for, reading and writing data blocks on
the disk.
• Storage cost-cost of storing intermediate les generated.
• Computation cost-in memory searching, sorting and merging.
• Memory usage cost-number of memory bu ers needed.
• Communication cost-cost of shipping the query to database site and shipping back the result to the
place where the query is originated.

Summary

• The sequence of operations of any relational algebraic expression can be represented as a tree known
as query tree.
• Initial trees are not e cient.
• Transformations are possible based on the equivalence rules.
• Optimisation heuristics guide us through the transformation to obtain optimised query trees which are
e cient.

Database Security

Types of Security

• Database Security is a broader area that a addresses many issues, including the following:
• Legal and ethical issues.
• Policy issues at the governmental, institutional or corporate.
• System-related issues.
• Security policy speci c to the organisation.

Threats to Database

• Threats to databases result in the loss or degradation of some or all of the following commonly accepted
security goals.
• Loss of integrity: Database should be protected from improper modi cation.
• Loss of availability: Making database objects available to users or programs to which they have a
legitimate right.
• Loss of con dentiality: Protection of data from unauthorised disclosure.

Control Measures

Four main control measures that are used to provide security of data in databases are:
• Access control
• Inference control
• Flow control
• Data encryption

Video Transcript:
Four main control measures that are used to provide security of data in a database are as follows, access
control, inference control, ow control, and data encryption. Access control is about providing privileges to
various applications and users to access the data. Inference control means to protect the statistical
databases so that the individual's data should not be accessed or queried. Flow control refers to
ffi
fi
ffi
fi
fl
fi
fi
ff
fi
fi
fi
protecting the database so that the information is owing from one object to the other object so that the
access to the data by unauthorized persons will take place. Fourth one, data encryption, that is encrypting
the data in the database and the data that is being transmitted through network.

Information Security and Privacy

• Security: Refers to many aspects of protecting the DB from unauthorised use including authentication of
users, encryption, access control and intrusion detection.
• Privacy: It is the ability of individuals to control the terms under which their personal information is
acquired and used. (Policies).

Types of Access Control Mechanisms

Two popular access control mechanisms are:


• Discretionary Access control
• Mandatory Access control

• Discretionary Access control mechanism:


• Used to grant privileges to users, including capability to access speci c data les, records, or elds in
speci c mode (read, insert, delete, or update).

• Mandatory Security Mechanisms:


• Used to enforce multilevel security by classifying the data and users into various security classes or
levels, and implementing the appropriate security policy of the organisation.
• Ex: permit users to see only the data items classi ed at the user's own classi ed level (or lower).
• An extension to this is a role-based security model which enforces policies based on roles.

Discretionary Access Control

• Discretionary access control model is based on granting and revoking of privileges.


• There are two levels for assigning privileges:
• The account level.
• The relation level.

• Account level
• CREATE SCHEMA, CREATE TABLE, CREATE VIEW, ALTER, MODIFY, SELECT etc.

• Relational level
• Select privilege on R; modify privilege on R, reference privilege on R
• Ex.
• Grant SELECT on V(relation) to B(user);
• Revoke SELECT on EMP(relation) From A(user);

Mandatory Access Control

• In many applications, an additional security policy is needed that classi es data and users based on
security classes this approach is known as mandatory access control.
• Typical of security classes
• Top secret (TS) - Highest
• Secret(S)
• Con dential (C)
• Unclassi ed (U) - Lowest
fi
fi
fi
fi
fl
fi
fi
fi
fi
fi
Bell-LaPadula Model

• In Bell-LaPadula model, we have:


• Subject (user, account, program).
• Object (relation, tuple, column, view, operation).

• According to this:
• A subject S is not allowed read access to an object O unless class(S) ≥ class (O). This is known as
simple security property.
• A subject S is not allowed to write an object O unless class(S) ≤ class(0). This is known as the star
property.

Statistical Database Security

• Statistical databases are mainly used to produce statistics about various populations.
• The database may contain con dential data about the individuals.
• Individuals' data should not be disclosed.
• However, users are permitted to retrieve statistical information about the populations.
• Ex: average, sum, count, max, min etc.

Possible Threats

• Sometimes it is possible to infer the values of individual tuples from a sequence of statistical queries.
• Ex. Assume that employee data is in: EMP(eid, name, city, age, salary, quali, dept).
• If we wish nd the salary of 'Kiran', and we know that he is 42 years age, and lives in Pune. We can do
the following:
• SELECT COUNT(*) FROM EMP WHERE name='Kiran' and age=42;
• If the result is 1. We can execute-
• SELECT AVG(salary) FROM EMP WHERE name='Kiran' and age=42.
• Average salary will be Kiran’s salary.

Flow Control

• Flow control regulates the ow of information among accessible objects.


• A ow between object X and object Y occurs when a program reads from X and writes it to Y.
• Flow controls check that information contained in some object does not ow explicitly or implicitly into
less protected objects.
• Hence a user can not indirectly get data from Y which he can't get from X directly.

Flow Control Policy

• A ow policy speci es the channels along which information is allowed to move.


• A covert channel allows a transfer of information that violates the security or the policy.
• It allows information to pass from higher classi cation level to lower classi cation level through
improper means.

Encryption

• Encryption is the conversion of data into a form called cypher text that cannot be easily understood by
unauthorised persons.
• It enhances security and privacy when access controls are bypassed.
• It also helps to protect data under transmission.

Database Security Challenges

• Data Quality: To assess and attest the quality of data.


• Intellectual Property Rights: Digital water marking to enable provable ownership of data.
• Database Survivability: Database systems must continue to work with minimal capabilities even when
they are attacked.
fl
fl
fi
fi
fl
fi
fi
fl
fi
Problem solving

Problem 1

• Consider the following Log records for the transactions: T1, T2, T3, T4, and T5.

• For the above sequence of log records inserted in the same order, suggest the recovery actions on
system crash for each of the transactions, if immediate modi cation technique is used.

Solution to Problem - 1

• Actions to be taken, with immediate modi cation scheme:


• T1: Started and committed. Hence apply REDO.
• T2: Started but not committed. Hence apply UNDO.'
• T3: Started and committed. Hence apply REDO.
• T4: Started and committed. Hence apply REDO.
• T5: Started but not committed. Hence apply UNDO.

Problem 2

• Consider the following Log records for the transactions: T1, T2, T3, T4, and T5.

• For the above sequence of log records inserted in the same order, suggest the recovery actions on
system crash for each of the transaction, if deferred modi cation technique is used.

Solution to Problem - 2

• Actions to be taken, with deferred modi cation scheme:


• T1: Started and committed. Hence apply REDO.
• T2: Started but not committed. Hence NO action needed.
• T3. Started and committed. Hence apply REDO.
• T4: Started and committed. Hence apply REDO.
• T5: Started but not committed. NO action needed.

Problem - 3

• Consider the following Log records for the transactions: T1, T2, T3, T4, and T5.

• For the above sequence of log records inserted in the same order, including checkpointing, suggest the
recovery actions on system crash for each of the transactions.
• The recovery strategy used is immediate modi cation technique.
fi
fi
fi
fi
fi
Solution to Problem 3

• Actions to be taken, with immediate modi cation scheme:


• T1: Started and committed before CP. Hence NO action.
• T2: Started but not committed. Hence UNDO.
• T3: Started and committed before CP. Hence NO action.
• T4: Started and committed after CP. Hence apply REDO.
• T5: Started but not committed. Hence UNDO.

Video Transcript:
According to this system log entries t one started modi ed a old and new values are
given t one committed, t two started, t three started, t two modi ed b t three modi ed
b, t three committed and now it is time for checkpointing. At the time of
checkpointing, all transactions which are under execution are suspended. That is only t two because t one
and t three are already committed, t two is suspended and all updates done by the committed
transactions, that is t one and t three, are permanently written to the database and then checkpoint record
is inserted into the log and then the suspended transactions are resumed. Then it will proceed
further. After checkpointing, t four started and t four modi ed t two modi ed t four commit t ve start t ve
modi cation system crash occurred so since the system crash occurred this point of time, we need to
look at or take action appropriate action only for those transactions which are suspended during the
recent checkpointing, that is t two and the transactions which are started after the recent check, that is t
four and then t ve. Since the transaction T1 and T3 are started and committed before the recent
checkpoint, we will just ignore T1 and T3. And we need to take appropriate action according to the
recovery scheme for T2, T4 and T5, which are the transactions either suspended during the recent
checkpointing or start it after the recent checkpointing.

Summary

• Immediate modi cation scheme required both UNDO and REDO.


• Deferred modi cation needs only REDO.
• When checkpointing is in place, we apply REDO and/or UNDO with reference to recent CP. Hence will
be more e cient.

Problem Solving on Query Optimisation

• For the following SQL query, give:


• Relational algebraic expression.
• And the Optimised Query tree.

SELECT S. sid. S.age, C. cname. C.ceo, P. salary


FROM Student S, Company C, Placement P
WHERE S.sid=P.sid and C.cid=P.cid and
S.cgpa>8.0 and C.city=‘DELHI';

Solution

Video Transcript:
This is the SQL query given and this is one
possible relational algebraic expression and where
rst we join company and placement based on the
company id in company table is equal to company
id in placement table. The result is then joined with
student table where the student ID coming from
placement is equal to student ID in student
table. From this result we apply the select
condition that is where the city of the company is
equal to Delhi and the CGPA is greater than eight. Then we apply project operation to retrieve the
required attributes into the result that is SID, age, cname, CEO and salary from the respective tables. This
is the initial tree. If we carefully observe and it is not optimized because according to the heuristics we
need to apply the select operations as early as possible. That is we need to push down the select and
fi
fi
ffi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
project operations as down as possible in the query tree to minimize the size of the result set. At the same
time to minimize the attributes that will appear in the next result set.

Optimised Query Tree

Video Transcript:
When we apply heuristics on the company table, rst we
will apply select operation to pick up only those tuples
where the city is equals to Delhi, there may be any
number of attributes in company table, but we require CID
which is going to help us in joining, company name which
is required in the result and the CEO name which is
required in the result. Similarly, from placement we require
CID student id and salary which are essential for either
join operation at subsequent level or to be displayed in
the result. Then these result of this operation and this
operation is joined based on C.Cid is equals to
P.cid. Again, from the student table we apply the select operation as early as possible. We pick up only
those tuples where the CGPA is greater than eight and immediately we will pick up only those columns
which are essential either for performing the join operation at the next higher level or to be displayed in the
result. So we will pick up only the SID and age. Then we apply join on this result and on this result based
on P.Sid is equals to S.Sid. From the join result we pick up these attributes to be displayed as part of the
results and if we carefully observe, the initial query tree is transformed into this nal tree which is more
optimal, which is more e ective for execution.
ff
fi
fi

You might also like