Block 1
Block 1
1.0 INTRODUCTION
In the present time, most of your online activities require interaction with a database
system running as the backend of an application, such as purchasing from
supermarkets or e-commerce website, depositing and/or withdrawing from a bank,
booking hotel, airline or railway reservation, accessing a computerised library,
ordering a magazine subscription from a publisher, using your smartphone apps to
purchase goods. In all the above cases a database is accessed. Most of these backend
database systems may be called Traditional Database Applications. In these types of
databases, the information stored and accessed is textual or numeric. However, with
advances in technology in the past decades, different newer database models have
been developed, which will be discussed in Block 4 of this course. In this unit, we will
be introducing the architecture and structure of a traditional Database Management
System.
1.1 OBJECTIVES
After going through this unit, you should be able to:
5
The Database
Management System 1.2 NEED FOR A DATABASE MANAGEMENT
Concepts
SYSTEM
A Database is an organised, persistent collection of data of an organisation. The
database management system manages the database of an enterprise. Prior to use of
the database systems, file-based systems were popularly used. In order to appreciate
the strengths of database management systems, you may first list the problems
associated with the file base systems, which are discussed next.
• Fee paid by the students of the Computer Science Department per annum.
• The students who are willing to avail the pickup bus facility.
• Number of students taught by a faculty in a specific academic period.
• The number of students who have passed the programme this year in comparison
to earlier years.
• How many students of a specific department have registered for courses other
than their own department?
Please refer to Figure 1. The answer to all the questions cannot be computed by just
simple statements but will require extensive file processing. Thus, each of these
queries will require substantial amount of time in the file-based system.
6
Basic Concepts
• Data isolation: Since the file system stores data in separate files, which may
belong to different applications. These files are not accessible to other
applications and are difficult to share, especially when an application needs to
use more than one file. Also, as the number of files can be very large for such
systems, therefore, it would be difficult to search the relevant data from these
files.
• Data Duplication: As stated earlier, a file system has different files for
different applications, which may have overlapping data requirements. This will
result in duplication of data, which can result in inconsistent data when
duplicate data is updated. In addition, data duplication can also result in waste
of storage. An example of data duplication is shown in Figure 1, where the
address of several students may be stored in two different files, viz. student
folder and bus list file.
• Inconsistent Data: The data in a file system can become inconsistent if more
than one person modifies the data concurrently, for example, if any student
changes the residence and the change is notified to only his/her file and not to
the bus list. Entering wrong data is also another reason for inconsistencies.
• Data dependence: In the file systems, you need to clearly define the storage
organisation of data files and the structure of the records in the application code.
This means that it is extremely difficult to make changes to the existing
structure, as any change in structure would require change in all the programs
using that structure of the data. The programmer would have to identify all the
affected programs, modify them and retest them. This characteristic of the File
Based system is called program data dependence.
• Fixed Queries: File based systems are very much dependent on application
programs. Any query or report needed by the organization would require the
7
The Database application programmer to write a new program. As the type and number of
Management System queries or reports are expected to increase with time, producing different types
Concepts
of queries or reports would be difficult to implement in file-based systems.
Since applications of file-based systems are designed to answer specific queries,
therefore, new queries cannot be answered without generating an application.
This entire process is time consuming and complex.
The file-based systems require large number of applications, therefore, are difficult to
maintain. Further, each application would require separate provision from security.
Besides the above, the maintenance of the File Based System is difficult and there is no
provision for security. Further, data recovery from failures is inadequate or non-
existent.
8
Basic Concepts
Data Independence
In the file-based system, the descriptions of data and logic for accessing the data are
built into each application program making the program more dependent on data. A
change in the structure of data may require alterations to programs. Database
Management systems separate data descriptions from data. Hence it is not affected by
changes. This is called Data Independence, where details of data are not exposed.
DBMS provides an abstract view and hides details. For example, in Figure 2, you can
observe that the interface or window to data provided by DBMS to a user may still be
the same although the internal structure of the data is changed.
Improved Integrity
Data Integrity refers to validity and consistency of data. Data Integrity means that the
data should be accurate and consistent. This is implemented by enforcing checks or
constraints on the data while it is being entered or manipulated in a database. Data of a
database is not allowed to violate these constraints or rules. Constraints may apply to
data items within a record or relationships between records. For example, a constraint
on the age of an employee can be between 18 and 70 years only. While entering or
modifying the data of the age of an employee, the DBMS should enforce this
constraint. However, please note there is still a possible error if you enter the age of a
person as 55 instead of 25. Such errors would require different ways of checking.
Database systems support many other types of constraints, which will be discussed in
later units.
9
The Database Representing relationship among data
Management System
Concepts
Data of a database system is integrated data of an organization, which includes large
number of related data objects. DBMS should maintain and preserve these
relationships so that related data can be accessed easily.
Improved Security
The data of an organisation is vital and confidential. DBMS allows users to share only
that information that they are authorised to access. For example, the critical
information of an employee of an organisation should not be accessible to any other
employee of that organisation. Hence, data of the database should be protected from
unauthorised users. This is implemented by Database Administrator (DBA), who
provides the users with controlled privileges on the data and type of operations. To
enforce security, DBMS has a security and authorisation subsystem. Only authorised
users may use the specific data of a database. Further, the operations performed by
these uses on data can be restricted to data retrieval, insertion, update, deletion etc. or
any combination of these operations. For example, the Branch Manager of any
company may have access to all data and is allowed to modify the incentive data of
employees, whereas the Sales Assistant may not have access to salary details.
Improved Backup and Recovery
A file-based system may fail to provide measures to protect data from system failures,
as taking backups periodically is the responsibility of the user. However, most of the
DBMSs are designed in such a manner that it can recover from different types of
hardware failures. In addition, DBMS also supports data backup. In general, a DBMS
has a subsystem to support such provisions.
Support for concurrent Transactions
A transaction, in the context of a database system, is an atomic operation that is either
completed fully or not at all. A database should allow multiple transactions to be
carried out at the same time. For example, in a bank several money withdrawal and
money transfer operations may be carried out at the same time. Most of the
commercial DBMSs ensure that all these transactions do not interfere with each other.
…………………………………………………………………………….
…………………………………………………………………………….
2) What are the advantages of a DBMS?
…………………………………………………………………………….
……………………………………………………………………………..
3) Compare and contrast the traditional File based system with Database approach.
…………………………………………………………………………….
……………………………………………………………………………..
11
The Database The Conceptual Level
Management System
Concepts
The purpose of the conceptual level is to define the structure, relationships and
constraints of a database system. Therefore, the conceptual level includes the structure
of the data objects, relationships among those objects and the constraints on the data
objects and access control of objects. A standard language called Data Definition
Language (DDL) is often used in DBMSs to define the conceptual level or schema of
a database system. The conceptual schema can be defined by the database
administrator or the system administrator, as shown in Figure 3.
1.3.2 Mappings between the three Levels and its relationship with Data
Independence
The three level architecture defines a single database system across all the three
levels. Therefore, different levels must map with each other. This mapping led to the
concept of data independence, which was one of the major weaknesses in the file
systems. This concept is explained next.
The first mapping is between the conceptual level and the external level. The external
level is derived from the conceptual level. It is a part of the conceptual level, however,
please note that these parts must be related else the database will lose database
integrity. The advantage of this mapping is that an external user only needs to see the
external level any change in the conceptual level will be hidden from the user. For
example, at the conceptual level, you may keep information about the name of a
person using the data items like title, firstname and lastname. At the external level,
you may just map it to a data item name field. Thus, there exists a mapping, which
will map:
name = title || firstname || lastname ( || is a concatenation operation)
Suppose, at a later point you decide to add additional data item middlename in the
conceptual level, then you just need to change your mapping and not the programs
which are based on the external level. The mapping would be changed to:
name = title || firstname || middlename || lastname
Thus, the conceptual to external mapping may isolate the external users from the
changes in the conceptual level. These changes can be hidden in the conceptual to
internal mapping. This is also referred to as logical data independence.
The second mapping is between the physical level and the conceptual level. This
mapping insulates the conceptual level from changes in file organisation, indexes etc.,
without changing the conceptual level. In general, such changes may be performed to
enhance the performance of a database management system. For example, instead of
organising the student file on enrolment number, which is the primary key, you may
organise the student file on Programme + Student name. This may enhance the access
time of data for many important queries to the database. The conceptual level, in this
case, still remains unchanged. This is also referred to as physical data independence.
14
software failure etc. The database manager ensures that data of various Basic Concepts
transactions is not corrupted even after a failure.
• Concurrency control: One of the major uses of database systems is in
transaction management. In such systems, multiple transactions access and
modify the database simultaneously. The database manager ensures that
consistency of data is ensured even in the presence of concurrent transactions.
Transaction management is discussed in block 3 in detail.
To perform the roles or functions as discussed above, the database manager has the
following software components. These components are also shown in Figure 5.
• Authorisation control is used to verify the credentials of a user to ascertain if
s/he is an authorised person to access or modify the data.
• Command Processor converts the DDL/DML or other programming language
commands to an executable sequence of code.
• Integrity checker checks if the data that is being inserted in a database or is
being modified follows all the specified constraints on the database.
• Query Optimizer tries to optimise the processing time of different queries.
• Transaction Manager is one of the important components, which checks if the
user has the right to perform a transaction.
• Scheduler component manages the consistency of data, while concurrent
transactions are being processed by a database. The scheduler ensures an
ordering of transaction execution, even though they may be conflicting with
each other.
• Buffer Manager optimises the process of transfer of data between the main
memory and the disk, where the database is stored. The buffer manager is also
termed as the cache manager.
15
The Database
Management System
Concepts
16
1.4.7 Data files, Indices and Data Dictionary Basic Concepts
The data is stored in the data files. Data files can be standard files or encrypted files.
These files, in general, are not accessible directly to any system user. The indices are
stored in the index files. Indices provide fast access to data items. For example, a
book database may be organised in the order of accession number of books yet may be
indexed on Author name and Book titles for allowing faster access for searches on
these attributes.
17
The Database The back end are the servers, which are responsible for managing the physical
Management System database and providing the necessary support and mappings for the internal,
Concepts
conceptual, and external levels. Other functions of a DBMS, such as security, integrity
and access control, are also the responsibility of the back end.
The front end is just any application that runs on the DBMS. These may be
applications provided by the DBMS vendor, the user, or a third party. The user
interacts with the front end. A front-end may be user friendly interface provided by an
application developed for database system access and modifications. These
applications may be developed by the vendors themselves or by software developers
using an Application Programming Interface (API) or third-party applications.
In this context, many different types of interfaces and utilities are offered in support of
commercial database management systems. Some of these are:
• Interfaces:
• Command line Interface: This interface existed since the start of DBMS.
You can use this interface to write DDL, DML and other permitted
commands. These interfaces allow a very rich set of commands and are useful
for expert users like DBA.
• Graphical User Interface: Such interfaces are developed to allow users to
interact though database applications using menu driven interfaces. Such
interfaces have become more useful over the last decade.
• Utilities for Backup/Restore: The purpose of these utilities is to ensure that the
current state of a database is duly backed up on secondary storage, so that it can
be recovered or restored in the case of any failure or even disaster. The backup
is done periodically.
• Utilities for Load/Unload of data: Life of a database system is quite long. The
size of a database system keeps increasing if the old data is not deleted. Also,
hardware wears down over a period. Therefore, a database may be required to
move from one machine to another machine. Sometimes the change in DBMS
software or its versions also necessitates movement of data. The Load and
Upload utilities are used for the purposes as above.
• Utilities for Reporting or Analysis: Reports form the important component of
a database systems, especially management information systems. The reports
and analysis of data can be used at various levels of decision making in an
organisation. These utilities analyse and report data in various visual forms for
decision making.
18
Basic Concepts
Model Type Examples
Conceptual Model: Entity-Relationship Model: This model identifies the
Uses entities, their physical or logical entities and their attributes. In
attributes and addition, this model also identifies the relationships
relationships among among these entities. It is mainly represented in
entities graphical form using E-R diagrams. This is very useful
in Database design. The entity relationship diagram is
discussed in this Block.
Record based Logical Network Model: Network model, as the name suggests,
Models: represents data about entities using a set of records. The
relationships among these data records are represented
using links. A simple network model example is
explained in Figure 6. It shows a sample diagram for
such a system. This model is a very good model as far
as conceptual framework is concerned but is nowadays
not used in database management systems.
Relational Model: It models data of both entities and
relationships as tables. The relationship tables are
related to entity tables using a set of constraints. This
model is based on a sound mathematical theory of
relations. This is one of the widely used data models
and will be discussed in more detail in the subsequent
units of this course.
Object-based Models: Object-Oriented Model: An object in object-oriented
model contains the data of an entity. In addition, the
object also allows a set of defined operations on the data
Use objects as key data
stored in an object. Further, the relationships among the
representation
objects are implemented by creating links among these
components.
objects and by implementing message passing. Object-
Models are useful for databases where data
interrelationships are complex, for example, Computer
Assisted Design based components. These are explained
in Block 4.
Semi-structured Semi-structured data Model: Relational model is
Model: typically called a structured model, whereas, the semi-
Uses user defined tags structured model uses user defined tags, which loosely
force integrity contains. However, these models are very
flexible in comparison to relational models. These
models are popular in web-based data management, as
they are very light database systems.
Graph-based Models Graph data Model: Graph is an important data structure
consisting of nodes and arcs. The node can represent an
entity while the links can be used to represent a
relationship.
Figure 6 shows the records of students and the result of the students in various courses
taken by them. All the records of the students are shown as a single file. The student
records can contain one or many pointers to the course file. Please notice in Figure 6
that the person named Ajay, whose enrolment number is 21026 has obtained 50 marks
in course having course code 101. Further, he has obtained 19 marks in a course
19
The Database having course code 204. Thus, each record on the course is linked to one record of the
Management System student.
Concepts
305 50
Figure 7 shows the data of Figure 6 in a hierarchical data model. Both these models
are of historical nature. The relational model, object-oriented model and other model
models are explained in the later blocks of this course.
Database Information
1.7 SUMMARY
This unit provides you an introduction about the database system and database
management system. It first defines the need of a database management system
by comparing the database system approach to file system approach. The file
system has several limitations due to data redundancy, however, the database
approach integrates and shares the data among several applications. The unit
also discusses the basic advantages of the database approach are - sharing of
data, data independence, data integrity and security enforcement and
transaction management for concurrent users.
The unit also explains the three-level architecture of database systems, which
allows a database to be defined in terms of schemas at different levels for
different types of users. Such a design allows access of relevant data to
relevant users. In addition, the unit explains the physical architecture of DBMS
and describes various components of DBMS. Next, the unit discusses the
database system architecture for the commercial database and several data
models. It also briefly presents the recent trends in DBMSs. You are advised to
go through the further readings for more details on these topics.
1.8 SOLUTIONS/ANSWERS
21
The Database 3)
Management System
Concepts File Based System Database Approach
Cheaper Costly
Data dependent Data independent
Data redundancy Controlled data redundancy
Inconsistent Data Consistent Data
Fixed Queries Unforeseen queries can be answered
22
Relational and E-R
UNIT 2 RELATIONAL DATABASE Models
2.0 INTRODUCTION
In the first unit of this block, you have been provided with the details of the Database
Management System, its advantages, structure, etc. This unit is an attempt to provide
you information about relational model. The relational model is a widely used model
for DBMS implementation. Most of the commercial DBMS products available in the
industry are relational at the core.
The relational model is based on the theory of relations and was first proposed by E.F.
Codd. In this unit we will discuss the terminology, operators and operations used in
relational model.
2.1 OBJECTIVES
After going through this unit, you should be able to:
23
The Database PERSON_ID NAME AGE ADDRESS
Management System 1 Sanjay Prasad 35 B-4,Modi Nagar, UP
Concepts
2 Sharada Gupta 30 Pocket 2, Mayur Vihar, Delhi
3 Vibhu Datt 36 C-2, Saket, New Delhi
Figure 1: A Sample Person Relation
Following are some of the advantages of the relational model:
• Ease of use: The simple tabular representation of the database helps the user
define and query the database conveniently. For example, you can easily find
out the age of the person whose first name is “Vibhu”.
• Flexibility: Since the database is a collection of tables, new data can be added
and deleted easily. Also, manipulation of data from various tables can be done
easily using various basic operations. For example, you can add a telephone
number field in the table in Figure 1.
• Accuracy: In relational databases the relational algebraic operations are used to
manipulate data values in a database. These are mathematical operations and
ensure accuracy (and less of ambiguity) as compared to other models. These
operations are discussed in more detail in Section 2.4.
2.2.1 Tuple, Attribute, Domain and Relation
Before we discuss the relational model in more detail, let us first define some very
basic terms used in this model.
Tuple: Each row in a table represents a record or information of a single object,
which is also termed as a tuple. For example, Figure 1 has three records or tuples. A
record/tuple consists of a number of attributes, which is defined next.
Attribute: A relation consists of a large number of columns. Each of these columns,
which defines a separate data value, is termed as an attribute. The column name in a
relation is generally related to the meaning of data items of that column. For example,
Figure 2 represents a relation PERSON. The columns PERSON_ID, NAME, AGE,
ADDRESS and TELEPHONE are the attributes of the relation PERSON and each
row in the relation represents a separate tuple (record).
Relation Name: PERSON
PERSON_ID NAME AGE ADDRESS TELEPHONE
1 Sanjay Prasad 35 B-4,Modi Nagar, UP 011-25347527
2 Sharada Gupta 30 Pocket 2, Mayur 023-12245678
Vihar, Delhi
3 Vibhu Datt 36 C-2, Saket, New 033-1601138
Delhi
Figure 2: An extended PERSON relation
The relation of Figure 2 consists of 5 attributes, therefore, each tuple in this relation is
called a 5-tuple. Thus, if a relation has n attributes, then each record in that relation
would be termed as n-tuple.
Domain: A domain is a set of permissible values that can be accepted by a specific
attribute. For example, in Figure 2, if you assume that there may be a maximum of
100 persons in the relation, then you may assign PERSON_ID to a domain of integer
values, which should be in the range from 1 to 100 only. Once you have assigned this
domain, then you will not be able to assign any value below 1 and above 100 to
PERSON_ID. The domain of attribute AGE can be integer values between 0 and 150.
The domain can be defined by assigning a type or a format or a range to an attribute.
For example, a domain for a number 501 to 999 can be specified by having a 3-digit
number format having a range of values between 501 and 999. Domains need not be
contiguous numbers. For example, the enrolment number of IGNOU has the last digit
as the check digit, thus the enrolment numbers are non-continuous.
24
Relation: Each table is a relation. A relation is defined using two basic aspects, viz. Relational and E-R
Models
schema and an instance.
Since the PERSON relation contains four attributes, this relation is of degree 4.
A relation state r of the relation schema R (A1, A2, ..., An), also denoted by r(R) is a
set of n-tuples
Example 1:
RELATION SCHEMA For STUDENT:
STUDENT (RollNo: string, name: string, login: string, age: integer)
RELATION INSTANCE
STUDENT
ROLLNO NAME LOGIN AGE
t1 3467 Shikha [email protected] 21
t2 4677 Kanu [email protected] 20
Where t1 = (3467, shikha, [email protected], 20) for this relation instance, the
number of tuples m = 2 and each tuple contains n = 4 values.
Example 2:
RELATIONAL SCHEMA For PERSON:
PERSON (PERSON_ID: integer, NAME: string, AGE: integer, ADDRESS:
string)
RELATION INSTANCE
In this instance, the number of tuples m = 3 and each tuple has n = 4 values.
PERSON
2.2.2 Super Keys, Candidate Keys and Primary Keys for the
Relations
As discussed in the previous section, ordering of relations does not matter and all
tuples in a relation are unique. However, can you uniquely identify a tuple in a
relation? To answer this question, let us discuss the concept of keys in relations.
Super Keys
A super key is an attribute or set of attributes used to identify the records uniquely in
a relation.
For Example, in the Relation PERSON described earlier PERSON_ID is a super key
since PERSON_ID is unique for each tuple/record. Similarly (PERSON_ID, AGE)
and (PERSON_ID, NAME) are also super keys of the relation PERSON since their
combination is also unique for each tuple/record.
Candidate keys
Super keys of a relation may contain extra attributes. A candidate key is a minimal
super key, which means that a candidate key does not contain any extraneous
attribute. An attribute is called extraneous if even after removing it from the key, the
remaining attributes still has the properties of a super key. A relation may have several
candidate keys. A candidate key may contain one or many attributes of a relation.
The following properties must be satisfied by a candidate key:
• In any instance of a relation, the value of the candidate key attribute(s) should be
unique.
• You cannot put a ‘Null’ value in any attribute that is a part of the candidate key.
This rule is also termed the entity integrity rule. Thus, a candidate key is unique and
not null.
• A candidate key should have a minimal set of attributes.
• The value of a candidate key must be stable, which means it should not change
frequently or its value change should not be outside the control of the system.
26
A relation can have more than one candidate key and one of them can be chosen as a Relational and E-R
Models
primary key.
For example, in the relation PERSON, the two possible candidate keys are
PERSON_ID and NAME (assuming unique names exist in the table). PERSON_ID
may be chosen as the primary key.
Answer the following questions for the relational instance s of the following
Supplier relation S:
S
SNO SNAME CITY
S1 Smita Delhi
S2 Jim Pune
1. What are the different attributes of relation S and how many tuples
does S have?
………………………………………………..
2. What are the domains of the attributes of S?
……………………………………………….
…………………………………………..…….
3. On sorting this relation on the field CITY, will the relation change?
Will the order of tuples in the relation change?
……………………………………………………………………..
4. List the super keys, all the possible candidate keys, and the primary key
of the relation.
…………………………………..
Thus, domain constraint specifies the possible set of values that you want to put in an
attribute of a relation. The values that appear in each attribute/column must be drawn
from the domain associated with that column.
For example, consider the relation:
27
The Database STUDENT
Management System
Concepts
ROLLNO NAME LOGIN AGE
3467 Shikha [email protected] 21
4677 Kanu [email protected] 20
In the relation above, AGE of the relation STUDENT always belongs to the integer
domain within a specified range (say 0 representing just born to 150) and not to
strings or any other domain. Within a domain, non-atomic values should be avoided.
This sometimes cannot be checked by domain constraints. For example, a database
which has area code and phone numbers as two different fields will take phone
numbers as-
Area_code Phone
11 29534466
A non-atomic value in this case for a phone can be 1129534466, however, this value
can be accepted by the Phone field only.
2.3.2 Key Constraint
This constraint states that the key attribute value in each tuple must be unique, i.e., no
two tuples can contain the same value for the key attribute. This is because the value
of the primary key is used to identify a unique tuple in a relation.
Example 3: If A is the key attribute in the following relation R, then A1, A2 and A3
must be unique.
A B
A1 B1
A3 B2
A2 B2
28
Relational and E-R
A# B C Models
Null B1 C1
A1 B2 C2
Null B2 C3
A2 B3 C3
Null B1 C5
Note: A ‘#’ in the headings row is indicating that A is the Primary key of R.
In the instance r of relation R above, the primary key has Null values in the tuple
t1, tuple t3 and tuple t5. As per the entity integrity constraint, Null value in primary
key is not permitted. Thus, relation instance r is an invalid instance.
Referential integrity constraint
For defining the referential integrity constraint, first we explain the concept of a
foreign key and foreign key constraint.
Consider an attribute set A of a relation R, which references an attribute set B in a
relation S, then A will be referred to foreign key in R provided it fulfills the following
conditions:
1. B is the Primary key of S.
2. A and B are defined over the same domains. A is called the foreign key in
relation R, which references B in relation S.
3. For every value x of attribute set A, in any instance r of R, there exists a tuple
in any instance s of S, where the value of attribute set B is x. Please note that
there will be only one such tuple having the value x, as B is the Primary key
of S. This is called foreign key constraint.
4. Please note that the relation R is a referencing relation and S is called
referenced relation.
Instance r of R Instance s of S
A# B C^ E C#
A1 B1 C1 E1 C1
A2 B2 C2 E2 C3
A3 B3 C3 E3 C5
A4 B4 C3 E2 C2
A5 B1 C5
Notes:
• To add new records in a database system, you use insert operation. For example,
in the instance r of R of example 6, you may like to add a new record <’A5’, 25,’
C6’>. Addition of a new record may cause constraint violation, which are
explained below:
• Default option: - Insertion can be rejected and the reason for rejection can also
be explained to the user by DBMS.
• Ask the user to correct the data, resubmit, and give the reason for rejecting the
insertion.
Example 7:
Consider the Relation PERSON of Example 2:
PERSON
30
PERSON_ID NAME AGE ADDRESS Relational and E-R
Models
1 Sanjay Prasad 35 B-4,Modi Nagar, UP
2 Sharad Gupta 30 Pocket 2, Mayur Vihar, Delhi
3 Vibhu Datt 36 C-2, Saket, New Delhi
Note: In the first 3 cases of constraint violations above DBMS will reject the insertion.
Example 8:
Let instance r of relation R be:
31
The Database A# B C^
Management System
Concepts A1 B1 C1
A2 B3 C3
A3 B4 C3
A4 B1 C5
C# D
C1 D1
C3 D2
C5 D3
Note:
1) ‘#’ identifies the Primary key of a relation.
2) ‘^’ identifies the Foreign key of a relation.
1 Consider the tables Suppliers, Parts, proJect and SPJ relation instances in the
relations below.
S P
SNO SNAME CITY PNO PNAME COLOUR CITY
S1 Smita Delhi P1 Nut Red Delhi
S2 Jim Pune P2 Bolt Blue Pune
S3 Ballav Pune P3 Pen White Mumbai
J SPJ
JNO JNAME CITY SNO PNO JNO QUANTITY
J1 Sorter Pune S1 P1 J1 200
J2 Display Mumbai S2 P2 J2 700
32
Using the instance of relations as given above, answer the following questions: Relational and E-R
Models
1 List a domain constraint for S.
………………………………………………………………………
………………………………………………………………………
2 List the Primary keys of all the relations and primary key constraint for
the SPJ relation.
………………………………………………………………………
………………………………………………………………………
3 List all the Foreign keys and Foreign key constraints.
………………………………………………………………………
………………………………………………………………………
4 What would be the constraint violations if the following operations are
performed:
(a) Insert <Null, ‘Jack’, ‘Mumbai’> into S
(b) Insert <’S2’, Null, ‘J3’, 200> into SPJ
(c) Insert <’P2’, ‘Pencil’, ‘Grey’, ‘Kolkata’> into P
(d) Insert <’J3’, ‘Monitor’,’ Jaipur’> into J
………………………………………………………………………
………………………………………………………………………
………………………………………………………………………
• Relational Operations
These can be unary operations
1. SELECTION
2. PROJECTION
Or binary operations
1. CARTESIAN PRODUCT
2. JOIN
• Basic Set Operations
1. UNION
2. INTERSECTION
3. SET DIFFERENCE
• Assignment Operation, rename and other operations
In this section, we will discuss relational operations and basic set operations. You may
refer to the other operations from the further readings.
Selection Operation:
Selection is a unary operator, that is used to choose only those tuples from a relation
that fulfill a given criteria. It is represented using the mathematical symbol s, as given
below:
33
The Database
Management System s <Selection condition> (Relation)
Concepts
Example 13:
Consider the relation PERSON. If you want to display details of persons having age
less than or equal to 30, then the selection operation will be used as follows:
s AGE <=30 (PERSON)
Note:
2) You can apply more than one condition by using logical connectors.
Projection operation
The projection operation is used to select specific attributes from amongst all the
attributes of records. This is denoted as P .
Example 14:
Consider the relation PERSON. If you want to display only the names of persons, then
the projection operation will be used as follows:
P NAME(PERSON)
NAME
Sanjay Prasad
Sharad Gupta
Vibhu Datt
Example 12: Consider the following two relational instances of relation R1 and R2.
What would be the cartesian product of the relations?
R1 R2
A B C
A1 B1 C1
A1 B2 C2
A2 B2
A2 B3
A B C
A1 B1 C1
A1 B1 C2
A1 B2 C1
A1 B2 C2
A2 B2 C1
A2 B2 C2
A2 B3 C1
A2 B3 C2
JOIN is a binary operator. It combines two relations based on a given join condition.
A JOIN operation is represented by mathematical symbol ⋈. But how does JOIN
operation combine two relations? It would require that there should be at least one
attribute in each of the relations, which have the same domain. Such attributes are
called domain compatible attributes.
Syntax:
R1 ⋈<join condition> R2 is used to combine related tuples from two relations R1 and R2
into a single tuple.
<join condition> is of the form:
<condition>AND<condition>AND…………..AND<condition>.
• Degree of Relation:
Degree (R1⋈<join condition>R2) <= Degree (R1) + Degree (R2).
35
The Database When each condition is of the form A q B, where A is an attribute of R1 and B is
Management System
an attribute of R2; both A and B have the same domain; and q is one of the
Concepts
comparison operators { £, ³, ¹, =, <, <}.
b) Equijoin
Equijoin is a restricted form of When each condition appears with equality
operator (=) only.
c) Natural join
Natural join is defined as a specialised type of join, when the joining attributes in
the two joining relations have the identical name, in addition to the identical
domain and the condition for the join operation is equality (=) condition. In such
cases only one joining attribute is kept in the result.
The following example explains the Natural join operation.
Example 15:
Consider the instance of the following relations. The primary key of STUDENT
relation is ROLLNO and foreign key is COURSE_ID, which references the primary
key COURSE_ID of COURSE relation.
STUDENT
ROLLNO NAME ADDRESS COURSE_ID
100 Kanupriya 234, Saraswati Vihar. CS1
101 Rajni Bala 120, Vasant Kunj CS2
102 Arpita Gupta 75, SRM Apartments. CS4
COURSE
COURSE_ID COURSE_NAME DURATION
CS1 MCA 3yrs
CS2 BCA 3yrs
CS3 M.Sc. 2yrs
CS4 B.Sc. 3yrs
CS5 MBA 2yrs
Display the name and other details of all the students along with their course details.
Solution: You may observe that the course details are existing in the COURSE
relation and student names and other details are in the STUDENT relation. Therefore,
you need to join these two relations to extract information. The join attributes in this
case are COURSE_ID in STUDENT and COURSE_ID in the COURSE relation and
equality operator is to be used; therefore, you use natural join operation, which can be
represented as:
STUDENTCOURSE
ROLLNO NAME ADDRESS COURSE_ID COURSE_NAME DURATION
100 Kanupriya 234, Saraswati CS1 MCA 3yrs
Vihar.
101 Rajni Bala 120, Vasant CS2 BCA 3yrs
Kunj
102 Arpita 75, SRM CS4 B.Sc. 3yrs
Gupta Apartments.
36
There are other types of joins like outer joins. You must refer to further reading for Relational and E-R
Models
more details on those operations. They are also explained in later blocks of this
course.
Example 9:
R1 R2
A B X Y
A1 B1 A1 B1
A2 B2 A7 B7
A3 B3 A2 B2
A4 B4 A4 B4
R3 = R1 ∪ R2 is
R3
A B
A1 B1
A2 B2
A3 B3
A4 B4
A7 B7
If R1 and R2 are two union compatible functions or relations, then the result of
R3 = R1 ∩ R2 is the relation that includes all tuples that are in both the
relations
In other words, R3 will have tuples such that R3 = {t | t Î R1 Ù t Î R2}.
Example 10:
R1 R2
A B X Y
A1 B1 A1 B1
A2 B2 A7 B7
A3 B3 A2 B2
A4 B4 A4 B4 37
The Database
Management System R3 = R1 ∩ R2 is
Concepts
A B
A1 B1
A2 B2
A4 B4
Note: 1) Intersection is a commutative operation, i.e.,
R1 ∩ R2 = R2 ∩ R1.
2) Intersection is an associative operation, i.e.,
R1 ∩ (R2 ∩ R3) = (R1 ∩ R2) ∩ R3
Set Difference
If R1 and R2 are two union compatible relations, then the result of R3 =R1– R2 is the
relation that includes only those tuples that are in R1 but not in R2. In other words, R3
will have tuples such that R3 = {t | t Î R1 Ù t Ï R2}.
Example 11:
R1 R2
X Y A B
A1 B1 A1 B1
A7 B7 A2 B2
A2 B2 A3 B3
A4 B4 A4 B4
R1-R2 =
A B
A7 B7
R2-R1=
A B
A3 B3
2 Primitive operations are union, difference, product, selection and projection. The
A ∩ B can be computed using ……….
4) Consider the relational instances of the relations Suppliers, Parts, proJect and
SPJ relations given below. (Underline represents a key attribute. The SPJ relation
has three Foreign keys: SNO, PNO and JNO.)
S P
SNO SNAME CITY PNO PNAME COLOUR CITY
S1 Smita Delhi P1 Nut Red Delhi
S2 Jim Pune P2 Bolt Blue Pune
S3 Ballav Pune P3 Part1 White Mumbai
S4 Seema Delhi P4 Part2 Blue Delhi
S5 Salim Agra P5 Camera Brown Pune
P6 Part3 Grey Delhi
J SPJ
JNO JNAME CITY SNO PNO JNO QUANTITY
J1 Sorter Pune S1 P1 J1 200
J2 Display Bombay S1 P1 J4 700
J3 OCR Agra S2 P3 J2 400
J4 Console Agra S2 P2 J7 200
J5 RAID Delhi S2 P3 J3 500
J6 EDP Udaipur S3 P3 J5 400
J7 Tape Delhi S3 P4 J3 500
S3 P5 J3 600
S3 P6 J4 800
S4 P6 J2 900
S4 P6 J1 100
S4 P5 J7 200
S5 P5 J5 300
S5 P4 J6 400
Using the in the relations above, which of the following operations and constraints
would be valid:
39
The Database 5) Find the name of projects in the relations above, to which supplier S1 has
Management System supplied using relational algebra.
Concepts …………………………………………………………………………………………….
…………………………………………………………………………………………….
…………………………………………………………………………………….........…
…………………………………………………………………………………………….
2.5 SUMMARY
This unit is an attempt to provide a detailed viewpoint of database design. The topics
covered in this unit include the relational model including the representation of
relations, operations such as set type operators and relational operators on relations.
The E-R model explained in this unit covers the basic aspects of E-R modeling. E-R
modeling is quite common to database design, therefore, you must attempt as many
problems as possible from the further reading. The E-R diagram has also been
extended. However, that is beyond the scope of this unit. You may refer to further
readings for more details on E-R diagrams.
1. Supplier Number (SNO), Supplier Name (SNAME) and City of the location of the
supplier (CITY). The present instance s of the relation S has 2 tuples.
2. Domain of SNO is the codes that are being assigned to suppliers. The present coding
uses the first character as S followed by the sequence number of a supplier. The
domain of SNAME is strings of characters of some defined length and the domain
of CITY is the set of possible cities in India.
3. The relation will not change; the order of tuples will change.
4. The super key of the relation could be (SNO, SNAME, CITY) or (SNO, SNAME)
or (SNO, CITY) or assuming every supplier has unique name (SNAME, CITY) or
SNO or assuming every supplier has unique name SNAME.
The possible candidate keys are SNO or assuming every supplier has unique name
SNAME
Primary key may be selected as SNO and alternate key would be SNAME
1. relational algebra
2. A ─ ( A ─ B)
3. (d) Join
4.
i. Accepted
ii. Rejected (candidate key uniqueness violation)
iii. Rejected (violates RESTRICTED update rule, as SPJ contains tuples
having value S5 in SNO)
iv. Accepted (supplier S3 and all shipments for supplier S3 in the relation
SPJ would be deleted, as the rule is CASCADE)
v. Rejected (violates RESTRICTED delete rule, as SPJ contains tuples
having value P2 in PNO)
vi. Accepted (project J4 and all shipments for project J4 from the relation
SPJ are deleted)
vii. Accepted
viii. Rejected (primary/candidate key uniqueness violation as tuple S5-P5-J7
already exists in relation SPJ)
ix. Rejected (referential integrity violation as there exists no tuple for J8 in
relation J)
x. Accepted
xi. Rejected (referential integrity violation as there exists no tuple for P7 in
relation P)
xii. Rejected (referential integrity violation – the default project number jjj
does not exist in relation J).
5) The answer to this query will require the use of the relational algebraic
operations. This can be found by performing selection of supplies made by S1
in SPJ, then taking projection of resultant on JNO and joining the resultant to J
relation. Let us show steps:
First find out the supplies made by supplier S1 by selecting those tuples from
SPJ where SNO is S1. The relation operator being:
SPJT = s <SNO = ’S1’> (SPJ)
The resulting temporary relation SPJT will be:
JNO
J1
J4 41
The Database Next, take natural JOIN this table with J:
Management System
Concepts RESULT = SPJT2 ⋈ J
The resulting relation RESULT will be:
42
UNIT 3 ENTITY RELATIONSHIP MODEL
3.0 Introduction
3.1 Objective
3.2 Entity Relationship (E-R) Model
3.2.1 Entities
3.2.2 Attributes
3.2.3 Relationships
3.2.4 E-R diagram Basics
3.2.5 More about Relationships
3.2.6 Extended E-R Features
3.2.7 Defining Relationship for College Database
3.3 An Example
3.4 Conversion of E-R Diagram to Relational Database
3.5 Enhanced E-R Model
3.6 Converting E-R and EER Diagram to Relations
3.7 Summary
3.8 Solution/Answers
3.0 INTRODUCTION
In the previous unit of this block, you have gone through the concept of relational
database management systems and one of the important languages for relational
database – relational algebra. This unit explains, you about of an analysis model of
the database system, known as Entity-Relationship (E-R) model. The E-R model is a
widely used model for analysis of data requirements of an organisation. The E-R
model is primarily a semantic model and is very useful in creating raw database
design that can be further normalised. With the availability of object-oriented
technologies, the E-R model has been extended to include object-oriented features.
This unit also discusses these E-R extensions. We will also discuss the conversion of
E-R diagrams to tables, in this unit.
3.1 OBJECTIVES
3.2.1 Entities
First let us answer the question: What are entities?
• An entity is an object of concern, which is used to represent the things in the real
world, e.g., car, table, book, etc.
• An entity need not be a physical entity, it can also represent a concept in the real
world, e.g., project, loan, etc.
• It represents a class of things, not any one instance, e.g., ‘STUDENT’ entity has
instances of ‘Ramesh’ and ‘Mohan’.
Entity Set or Entity Type: A collection of a similar kind of entity is called an
Entity Set or entity type.
For the COLLEGE database, the objects of concern are Student, Faculty, Course
and Department. The collections of all the students’ entities form an entity set
STUDENT. Similarly, collection of all the courses form an entity set COURSE.
You may please note that entity sets need not be disjoint. For example – an entity,
say Mohan, may be part of the entity set STUDENT, the entity set FACULTY and
the entity set PERSON.
Entity identifier - key attributes: An entity set usually has one or more attributes,
which attains unique value for every distinct entity in a given entity set. Such an
attribute or set of attributes is/are called key attribute(s) and its values can be used to
identify each entity uniquely in the given entity set.
Strong entity set: An entity set which contains at least one key attribute is a Strong
entity set. For example, a Student entity set would contain at least one key attribute
Enrolment number, which is unique for every student, thus, the entity set Student is
a strong entity set.
Weak entity set: Entity sets that do not contain any key attribute, and hence cannot
be identified independently, are called weak entity sets. A weak entity cannot be
identified uniquely by its attributes, therefore, are recognised in conjunction with the
primary key attributes of another strong entity on which its existence is dependent
(called owner entity set).
Generally, a primary key of an owner entity set is attached to a weak entity set,
which has identifying attributes, called discriminator attributes. These two together
form the primary key of the weak entity set. The following restrictions must hold for
the above:
• The owner entity set and the weak entity set must participate in one to many
relationship set. This relationship set is called the identifying relationship
set of the weak entity set.
• The weak entity set must have total participation in the identifying
relationship.
One of the most common examples about the weak entity set is an entity set
Dependent and the related Strong entity set Employee in an organisation. The
Dependent entity set is used to list all the dependents of each employee of an
organisation. The attributes of the Dependent entity set are: Dependent name, birth
date, gender and relationship with the employee. Each Employee entity is said to
own the dependent entities that are related to it. However, please note that the
‘Dependent’ entity does not exist of its own, it is dependent on the Employee entity.
In other words, you can say that in case an employee leaves the organization, all
dependents related to him/her also leave along with this employee. Thus, a
‘Dependent’ entity has no significance without the entity ‘Employee’. Thus, it is a
weak entity.
3.2.2 Attributes
Let us first define - What is an attribute?
For example, a Student entity set may consist of attributes - Roll no, student’s name,
age, address, course, etc. An entity will have a value for each of its attributes. For
example, for a particular student, the following values can be assigned:
Domains: Each simple attribute of an entity type contains a possible set of values
that can be attached to it. This is called the domain of an attribute. An attribute
cannot contain a value outside this domain.
EXAMPLE- for STUDENT entity Age has a specific domain, integer values say
from 15 to 90.
Types of attributes
Attributes attached to an entity can be of various types. They are explained below:
Simple: An attribute that cannot be further divided into smaller parts and represents
the basic meaning is called a simple attribute. For example: Each of the attributes -
‘FirstName’, ‘LastName’, age of PERSON entity set are simple attributes.
Composite: Attributes that can be further divided into smaller units and each smaller
unit contains specific meaning. For example, the attribute NAME of a FACULTY
entity can be subdivided into First name, Last name and Middle name.
Single valued: Attributes having a single value for a particular entity. For Example,
Age is a single valued attribute of a STUDENT entity.
Multivalued: Attributes that have more than one value for a particular entity is called
a multivalued attribute. Different entities may have different numbers of values for
these kinds of attributes. For multivalued attributes you must also specify the
minimum and maximum number of values that can be attached. For example, the
phone number for a PERSON entity is a multivalued attribute.
Stored and derived: Attributes that are directly stored in the database are called
stored attributes. For example, ‘Birth Date’ attribute of a PERSON entity can be a
stored attribute. However, there are certain attributes, whose value can be computed
from the value of the stored attribute. For example, in the PERSON entity, the
attribute ‘Birth Date’ can be used to compute the attribute Age of a person on a
specific day. Thus, ‘Birth Date’ is a stored attribute, whereas Age may be a derived
attribute for this entity.
3.2.3 Relationships
First, let us define the term relationships, i.e. What Are Relationships?
1 1
College Principal
headed
By
Similarly, you can define the relationship between University and Vice-Chancellor.
Department Faculty
appoints
1 N
For example, in the diagram above, several faculty members may be appointed in
one department, however, a specific faculty member will be appointed in precisely
one department.
M 1
Course Instructor
Taught
By
Many-to-many: Entities in entity set A and entity set B are associated with any
number of entities from each other.
For example, consider that a course can be taught jointly by many faculty members
and each faculty member can teach several courses, then many-to-many relationship
holds, as shown below:
Course M TaughtBy N
Faculty
Another example is shown in the diagram given below. The relationship cardinality
M : N. This implies that an Author entity can be correlated to many Book entities,
which are written by him/her. Further, a Book entity can also be correlated with
several Author entities who have written the Book.
Book M N
WrittenBy Author
Account Balance
Specialisation
Generalisation
Generalisation
Interest
Charges
is-a
Savings Current
Aggregation: One limitation of the E-R diagram is that they do not allow
representation of relationships among relationships. In such a case the relationship
along with its entities are promoted (aggregated to form an aggregate entity which
can be used for expressing the required relationships). A detailed discussion on
aggregation is beyond the scope of this unit you can refer to the further readings for
more detail.
3.3 AN EXAMPLE
Let us explain it with the help of an example application. We will describe here an
example database application of a COLLEGE database and use it for illustrating
various E-R modeling concepts.
Problem Statement
For example, the entities Student, Faculty, Course and Department, which are strong
entities, relations as shown in Figure 3.4 would be created.
II) For each weak entity type W in the E-R Diagram, you create another relation R
that contains all simple attributes of W. Further, you add the key attribute(s) of the
owner entity set (say KP) of W in R. The primary key to this relation R is – <KP +
Discriminator attribute of W> and foreign key is KP, which references the owner
entity of W.
For example, conversion of weak entity Guardian into relation is shown in Figure
3.5. Please note that the owner entity of the Guardian entity is the strong entity
Student, whose key is RollNo. Therefore, the key to Guardian relation is
RollNo+Name. The Foreign key in Guardian relation is RollNo, which refers to
Student relation.
N-ary Relationships
There are several cases for creating relations for n-ary relationships. A very general
case is presented here. For each n-ary relationship set R where n > 2, you create a
new table S to represent R. You should include the primary key of all the
participating entity sets as the foreign key attributes in S. You should include any
simple attributes of the n-ary relationship set (or simple components of complete
attributes) as attributes of S. The primary key of S is usually a combination of all the
foreign keys that reference the relations representing the participating entity sets.
Figure 3.8 is a special case of n-ary relationship, i.e. a binary relationship.
Multivalued attributes:
For each multivalued attribute ‘A’ of an entity set E, you can create a new relation R
that includes an attribute corresponding to the primary key attribute of the relation
entity E that represents the entity set or relationship set that has as an attribute. The
primary key of R is then a combination of A and the primary key of relation of E.
For example, if a Student entity had RollNo, Name and PhoneNumber attributes,
where phone number is a multivalued attribute, then you will create two tables for
this entity as given below:
Student (RollNo, Name)
Phone (RollNo, PhoneNumber)
Converting Generalisation / Specialisation hierarchy to tables:
A simple rule for conversation may be to decompose all the specialised entities into
relations in case they are disjoint. For example, for the E-R diagram of Figure 3.1,
you can create the two tables as:
Saving_account (account-no, holder-name, branch, balance, interest).
Current_account (account-no, holder-name, branch, balance, charges).
The other way might be to create tables that are overlapping (not disjoint) for
example, assuming that in the E-R diagram of Figure 3.2 contains overlapping sub-
classes, then you would be creating the following three relations:
The first relation would be for the for higher level entity:
account (account-no, holder-name, branch, balance)
The specialisation entities will contain the Primary key of the generalised
entity and all the attributes of entity itself, as shown below:
saving (account-no, interest)
current (account-no, charges)
Thus, the information about a single account can be found in all the three
relations.
Check Your Progress 1
………………………………………………………………………….………
…………………………………………………………………….……………
……………………………………………………………….…………………
…………………………………………………………….
………………………………………………………………………….………
…………………………………………………………………….……………
……………………………………………………………….…………………
…………………………………………………………….
3) A supplier, located in only one-city, supplies various parts for the projects of
different companies located in various cities. You can name this database as
“supplier-and-parts”. Draw the E-R diagram for the supplier-and-parts.
……………………………………………….………………………………………
…………………………………….…………………………………………………
………………………….…………………………….………………………………
…………………………………………….
4) Convert the E-R diagram created for question 2 above into a relational database.
…………………………………………………………………………
…………………………………………………………………………
3.5 ENHANCED E-R MODEL
Enhanced E-R models can help in designing of relational and object-relational
database systems. In addition, to E-R modeling concepts, the Enhanced ER model
includes:
super class
Vehicle
Figure 3.10: EER diagram showing more than one specialization from one super class
In Figure 3.10, letter ‘d’ in the circle indicates that all these subclasses are disjoint
in nature, i.e. all the vehicle entities are disjoint, as they can be part of only one of
the subclass. Please also notice that in Figure 3.10, common attributes, like vehicle
number, owner name etc., are attributes of the super class, whereas attributes like
mileage of car, stock of scooter and capacity of truck are the attributes of the sub-
classes. Please note that an entity will be appearing twice in the EER diagram -
once in the sub-class and the other in the super class (Please refer to Figure 3.11).
Figure 3.11: Sharing of members of the super class vehicle and its sub-classes
When every entity in the super class must be a member of some subclass in the
specialisation it is called total specialisation. But if every entity does not necessarily
need to belong to any of the subclasses, it is called partial specialisation. The total is
represented by a double line. This is to note that in specialisation and generalisation
the deletion of an entity from a super class implies automatic deletion from sub-
classes belonging to the same; similarly, insertion of an entity in a super class results
in insertion of the entity in all the sub-classes for which the attributes of this entity
fulfills the constraints of attribute-defined specialisation. In case of total
specialisation, insertion of an entity in a super class implies compulsory insertion in
at least one of the sub-classes of the specialisation.
Union: In some cases, a single class has a similar relationship with more than one
class. For example, the sub class ‘Car’ may be owned by two different types of
owners: Individual or Organisation. Both these types of owners are different classes;
thus, such a situation can be modeled with the help of a Union (Refer to Figure
3.13).
Figure 3.13: Union of classes
In the next section, we discuss how these extended features can be converted to
relations.
So let us now discuss the process of converting the EER diagram into a relation. In
case of disjoint constraints with total participation. It is advisable to create separate
relations for the sub-classes. But the only problem in such a case will be to
implement the referential entity constraints suitably.
For example, assuming that this EER diagram can be converted into a relation as:
Car (Number, owner, mileage)
Scooter (Number, owner, stock)
Truck (Number, owner, capacity)
Please note that referential integrity constraint in this case would require a
relationship with three relations and therefore is more complex to implement.
In case, in the EER diagram of Figure 3.12 there is NO total participation of Vehicle
super class in the sub-classes, then there will be some vehicles, which are not Car,
Scooter and Truck, so how can you represent these? In addition, in case of
overlapping constraints, some tuples may get represented in more than one relation.
Thus, in such cases, it is ideal to create one relation for the super class and other
relations for the sub-classes having the primary key and any other attributes of that
sub-class. For example, with NO total participation the following relations would
be created for the EER diagram of Figure 3.12:
Finally, in the case of union since it represents dissimilar classes, you may represent
separate relations. For example, both individual and organisation will be modeled to
separate relations.
‘Type’ can be regular or visiting faculty. Visiting faculty members can teach
only one programme. Make a suitable EER diagram for this and convert the
EER diagram to table.
…………………………………………………………………………………
…………………………………………………………………………………
…………………………………………………………………
3.7 SUMMARY
This unit presents the concept of E-R model and EER models. Both these models
are represented with the help of E-R diagram and EER diagram. These diagrams are
very powerful tools to represent the need of data in a database system and can be
used for the design of a good database system. The E-R model explained in this unit
covers the basic aspects of E-R modeling. The unit defines the concept of entities,
attributes and relationships. Further, it defines different types of entities like strong
and weak entities; different types of attributes such as simple, composite, derived
etc.; the cardinality and participation constraints in a relationship. These concepts
are very useful and you should attempt solving related problems from the further
readings. Concepts of EER diagrams including generalisation, specialisation, union
etc. have also been explained in this unit. You may refer to further readings for
more details on E-R and EER diagrams.
1.
Let us show the step-by-step process of development of Entity-Relationship Diagram
for the Client Application system. The first two important entities of the system are
Client and Application. Each of these terms represents a noun, thus, they are eligible
to be the entities in the database. But are they the correct entity sets? Client and
Application both are independent of anything else and the company plans to keep
track of its clients and the applications being developed for them. Therefore, each of
the entities-Client and Application form an entity set.
But how are these entities related? Are there more entities in the system? Let us first
consider the relationship between these two entities, if any. Obviously, the
relationship among the entities depends on interpretation of written requirements.
Thus, we need to define the terms in more detail.
Let us first define the term Application. Some of the questions that are to be
answered in this regard are: Is keeping track of Accounts an application? Is the
accounting system installed at each client site regarded as a different application?
Can the same application be installed more than once at a particular client site?
Before you answer these questions, do you notice that another entity is in the
offering? The client site seems to be another candidate entity. This is the kind of
thing you need to be sensitive to at this stage of the development of the entity
relationship modeling process. So, let us first deal with the relationship between
Client and Site before coming back to Application. Just a word of advice: “It is often
easier to tackle what seems likely to prove simple before trying to resolve the
apparently complex.”
Each Client can have many sites, but each site belongs to one and only one client.
Now the question arises what entity type is Site? You cannot have a site without a
client. If any site exists without a client, then who would pay the company? This is a
good example of an existential dependency and a one-to-many relationship. Thus, Site
is a weak entity. This is illustrated in the part E-R diagram given below:
Client Application
Has
Site
Let us now relate the entity Application to other entities. Please note that several
applications developed by the company can be installed at several client sites. Thus,
there exists a many-to-many relationship between the entities Site and Application:
Application M N
Is_installed Site
Application Site
1 1
of at
M M
Installation
In the present design, there cannot be an Installation until you can specify the Client,
Site and Application. But since Site is existentially dependent on Client or in other
words, Site is subordinate to Client, the Installation can be identified by (Client) Site
(it means Client or Site) and Application. You do not want to allow more than one
record for the same site and application.
But what if we also sell multiple copies of packages for a site? In such a case, you
need to keep track of each individual copy (license number) at each site. In that
case, you need another entity named Package (with license number). You may even
need separate entities for Application and Package. This will depend on what
attributes you want to keep in each of these entities. Thus, with these requirements,
the E-R diagram may be extended to as shown below:
A final proposed E-R diagram for the problem is given below. Please keep thinking
and refining your reasoning. Please note that knowing and thinking about a system is
essential for making good E-R diagrams. (Please note that in this E-R Diagram, Site
and License are modeled as weak entities, though you can decide to change it.)
The following table lists probable entities identified so far, together with its
superclass, if any, primary keys, and any foreign keys.
Client - Client ID
Application - Application ID
Package - Package ID
Installation Site, Client ID, Site No, Application Client ID, Site No,
Application ID Application ID
License Package Package ID, Copy No Package ID
In the E-R diagram given above, EMPLOYEE is an entity, who works for a
department, i.e., entity DEPARTMENT, thus, WORKS_FOR is many-to-one
relationship, as many employees work for one department. Only one employee (i.e.,
Manager) manages the department, thus manages is the one-to-one relationship.
The attribute Emp_Id is the primary key for the entity EMPLOYEE, thus Emp_Id is
unique and NOT NULL. The candidate keys for the entity DEPARTMENT are
Dept_name and Dept_Id. Along with other attributes, NumberOfEmployees is the
derived attribute on the entity DEPARTMENT, which is the number of employees
working for a department. Both the entities EMPLOYEE and DEPARTMENT
participate totally in the relationship WORKS_FOR, as at least one employee works
for the department, similarly an employee must work in a department.
The entity EMPLOYEES and the entity PROJECTS are related though the many-to
many relationship WORKS_ON, as many employees can work for one or more than
one projects simultaneously. The entity DEPARTMENT and entity PROJECT has a
relationship CONTROLS. Since one department controls many projects, thus,
CONTROLS in a 1:N relationship. The entity EMPLOYEE participates totally in
the relationship WORKS_ON, as each employee works on at least one project. A
project should also have at least one employee, therefore, its participation is also
total in WORKS_ON.
The employees can have many dependents, but the entity DEPENDENTS cannot
exist without the existence of the entity EMPLOYEE, thus, DEPENDENT is a weak
entity. You can very well see the primary keys for all the entities. The underlined
attributes in the eclipses represent the primary key.
4. Let us first make a simple relation for the E-R diagram in the answer to question
2:
EMPLOYEE(EMP_ID, Fname, Mname, Lname, DOB, Address, Salary, Gender)
DEPARTMENT(Dept_ID, Dept_name, Location)
DEPENDENT(EMP_ID, Name, Gender, Relationship, Birthdate)
Foreign Key: EMP_ID references EMPLOYEE
PROJECT(ProjNO, Proj_name, Location)
WORKS_FOR: due to this relationship the Primary key of 1 side (Dept_ID) will
be added to the EMPLOYEE relation.
SUPERVISION: this relationship is on the same entity, therefore, an attribute
Supervisor_ID whose domain is EMP_ID will be added to the EMPLOYEE
MANAGES: It is a 1 : 1 relation, you can choose the Department side as there
will be less records. It also has an attribute StartDate. Therefore,
MANAGER_ID whose domain is EMP_ID and StartDate attribute would be
added to DEPARTMENT.
CONTROLS: It is a 1 : N relationship, so the Primary key of 1 side (Dept_ID)
will be added to the PROJECT relation.
DEPENDENT_OF: This relation is already included in DEPENDENT relation.
WORKS_ON: It is a many to many (N : N) relation with total participation on
both sides, therefore, a separate table will be created for WORKS_ON with
primary key of both the participating entities and attributes of this relation
(Hours)
Thus, the final relations would be:
EMPLOYEE (EMP_ID, Fname, Mname, Lname, DOB, Address, Salary,
Gender, Dept_ID, Supervisor_ID)
Foreign Key: Dept_ID references DEPARTMENT.
Domain Constraint: Domain of Supervisor_ID is EMP_ID.
DEPARTMENT (Dept_ID, Dept_name, Location, MANAGER_ID, StartDate)
Foreign Key: MANAGER_ID references EMP_ID of EMPLOYEE.
DEPENDENT (EMP_ID, Name, Gender, Relationship, Birthdate)
Foreign Key: EMP_ID references EMPLOYEE
PROJECT (ProjNO, Proj_name, Location, Dept_ID)
Foreign Key: Dept_ID references DEPARTMENT.
WORKS_ON (EMP_ID, ProjNo, Hours)
Check Your Progress 2
1) The EER diagrams are used to model advanced data models requiring
inheritance, specialisation and generalisation.
2) The basic constraints used in EER diagrams are disjointness, overlapping and
unions.
3) For disjointness and union constraints the chances are that you create separate
relations for the subclasses and no relation for super class. For overlapping
constraints, it is advisable to create a relation of super class and the relations
of sub-classes will have only those attributes that are not common to super
class except for the primary key.
4.0 INTRODUCTION
In earlier units, you studied the basic concepts of database management systems,
entity relationship diagram and relational algebra. Databases are used to store
information. Normally, the principal operations you need to perform on database are
those relating to:
• Creation of data
• Retrieving the data using conditions
• Modifying
• Deleting some information, which we are sure is no longer useful or valid.
Database structures data as two-dimensional tables, which allows easy processing of
these operations. However, as the size of databases are large, the databases are
required to be stored on secondary memory of computers (such as hard disk or SSD).
Therefore, the secondary storage systems of databases are mainly concerned with the
following issues:
• Storing table or tables as files: A single table can be stored as a file or several
tables can be put together as a cluster of related records called cluster file.
• It seems logical to store all the records of a table contiguously. But, how should
such records be ordered? The ordering of records in primary storage does not
matter, however, for the secondary storage a specific sequence may be desired.
Such decisions may be taken by the database demonstrator and may relate to the
performance of the database.
• In the cases of analytical queries, particular attributes are stored together, this
approach is called column-oriented approach, this approach is beyond the scope
of this Unit. You may refer to further readings for this approach.
80
This unit focuses on the file Organisation in DBMS, the access methods available and File Organisation in DBMS
the system parameters associated with them. File Organisation is the way the files are
arranged on the disk and access method is how the data can be retrieved based on the
file organisation.
4.1 OBJECTIVES
After going through this unit, you should be able to:
81
The Database Management q Substitute an estimate of the missing value
System Concepts q Trigger a report listing missing values
q In programs, ignore missing data unless the value is significant.
Physical Records
These are the records that are stored in the secondary storage devices. For a database
relation, physical records are the group of fields stored in adjacent memory locations
and retrieved together as a unit. Considering the page memory system, a data page is
the amount of data read or written in one I/O operation to and from a secondary
storage device to the memory and vice-versa. In this context we define a term
blocking factor that is defined as the number of physical records per page.
The issues relating to the Design of the Physical Database Files
Physical File is a file as stored on the disk or SSD. The main issues relating to
physical files are:
• Constructs to link two pieces of data:
q Sequential storage.
q Pointers.
• File Organisation: How are the files arranged on the disk?
• Access Method: How can the data be retrieved based on the file Organisation?
Let us see in the next section how the data is stored on the hard disks (HDD) or SSDs
82
File Organisation in DBMS
Fixed and Variable Length Records
There are two basic ways of storing a record on the disks – Fixed Length records and
Variable Length Records. In the fixed length records all the attributes are assigned
equal space in terms of bytes, just like fixed length structure in C programming. Thus,
each record will be of the same size. In such cases, only metadata can be used to
identify different records and their attributes.
As far as variable length records are concerned, it may be noted that each record of a
table may be of different length. This is because in some records, some attribute
values may be 'null'; or some attributes may be of type varchar, which allows variable
number of characters to be stored in an attribute, therefore each record may have a
different length string as the value of this attribute. To store variable length records, it
is necessary to use a character that marks the end of an attribute value. In addition, an
end of record marker will also be needed, as one disk block may store several records
Therefore, each record is separated from the next, again by another special character,
the record separator.
The next section discusses different types of file organisation that can be used to store
the files on the disks.
83
The Database Management
System Concepts
• New records can be inserted in any empty space that can accommodate them.
• When old records are deleted, the occupied space becomes empty and available
for any new insertion.
• If updated records grow, they may need to be relocated (moved) to a new empty
space. Thus, this file organisation keeps a list of empty spaces.
84
File Organisation in DBMS
Updating a sequential file usually creates a new file so that the record sequence on the
primary key is maintained. The update operation first copies the records till the record
after which update is required into the new file and then the updated record is put
followed by the remainder of records. Thus, the method of updating a sequential file
automatically creates a backup copy. However, such update operations are very time
consuming.
Addition of records in the sequential files are also handled in a similar manner to
update operation. If a record is to be inserted at the last record of the file, it can be
performed very easily. However, if a record is required to be inserted in between two
records, then such insertion would require shifting down all the subsequent records in
the file by one record space. In case of deletion of a record, all the subsequent records
need to be shifted up by one record space.
Sequential file organisation is most suitable, if all the records of a file are to be
processed in a sequence. For example, processing the monthly payroll of all the
employees of an organisation, will require processing of all the employees records
sequentially. However, a single update is expensive as a new file must be created,
therefore, to reduce the cost per update, all update requests are stored in a single
update file, which is sorted in the order of the sequential file ordering key. The file
containing the updates is sometimes referred to as a transaction file and is used to
update the sequential file in a single processing cycle.
This process is called the batch mode of updating. In this mode, each record of the
master sequential file is checked for one or more possible updates by comparing with
the update information of the transaction file. The records are written to a new master
file in a sequential manner. A record that requires multiple updates is written only
when all the updates have been performed on the record. A record that is to be deleted
85
The Database Management is not written to a new master file. Thus, a new updated master file will be created
System Concepts from the transaction file and the old master file.
Thus, update, insertion and deletion of records in a sequential file require a new file
creation. Can we reduce creation of this new file? Yes, it can be done easily, if the
original sequential file is created with holes, which are empty record spaces, as shown
in the Figure 4.3. Thus, reorganisation of file on addition and update operation can be
restricted to only one block, which is read into/ written from the main memory as a
single unit. Thus, holes increase the performance of sequential file insertion and
deletion. This organisation also supports a concept of overflow area, which can store
the spilled over records if a block is full. This technique is also used in index
sequential file organisation. A detailed discussion on it can be found in the further
readings.
Figure 4.3: A sequential file with empty spaces for record insertions
86
4.4.4 Hashed File Organisation File Organisation in DBMS
Hashing is the most common form of purely random access to a file or database. It is
also used as an optimisation technique to access columns that do not have an index.
Hashing involves the use of a hash function. Input to a hash function is the value of
the attribute or set of attributes of a record that are to be used for file organisation and
the output is the block address or page address, where that record can be found. Figure
4.4 shows a file using hashing file organisation. The hash function used for this file is
key mod 4. Notice that records with different key values can be placed in a Block
based on the hashing function. To search the location of a record, you can apply the
hashing function on the key value and then search the hashed block. For example, if
you are searching for the location of key 29, then the record can be found in 29 mod 4
= Block 1, read this Block in the main memory and do a linear search on key value to
locate the record in the main memory. The most popular form of hashing is division
hashing with chained overflow. You can refer to further readings for more details on
this file organisation.
Advantages of Hashed File Organisation
1. Insertion or search on hash-key is fast.
2. Best if equality search is needed on hash-key.
Disadvantages of Hashed File Organisation
1. It is a complex file Organisation method.
2. Search is slow.
3. It suffers from disk space overhead.
4. Unbalanced buckets degrade performance.
5. Range search is slow.
87
The Database Management
System Concepts
Block
28
……………
36 0
……………
Hash Block
key Pointer
21
34 21
……………
25 25 1
……………
28
31
36
34
10 .…………..
10 2
. ……………
.
.
.
31
……………
Hash function 3
(Key Value) Mod 4.
2) What are Direct-Access systems? What can be the various strategies to achieve
this?
……………………………………………………………………………………
……………………………………………………………………………………
……………………………………………………………………………………
3) What is file organisation and what are the essential factors that are to be
considered for a file organisation?
……………………………………………………………………………………
……………………………………………………………………………………
88
…………………………………………………………………………………… File Organisation in DBMS
……………………………………………………………………………………
4.5 INDEXES
One of the terms used during the file organisation is the term index. In this section, let
us define this term in more detail.
Every printed book that you read, in general, has an index of keywords at the end.
Notice that this index is a sorted list of keywords (index values) and page numbers
(address) where the keyword can be found. In databases also an index is defined in a
similar way, as the <index value, address> pair.
The basic advantage of having sorted index pages at the end of the book is that you
can locate the description about a desired keyword in the book. You could have used
the topic and subtopic listed in the table of contents, but it is not necessary that the
given keyword can be found there; also, they are not in any sorted sequence. If a
keyword is not listed in index and table of contents, then you need to search each page
of the book to find the required keyword, which is very cumbersome. Thus, an index
at the back of the book helps in locating the required keyword references very easily
in the book.
The same is true for databases that have a very large number of records. A database
index allows fast search on the index value in database records. It will be difficult to
locate an attribute value in a large database, if the index on that attribute is not
provided. In such a case the value is to be searched record-by-record in the entire
database, which is cumbersome and time consuming. It is important to note that for a
large database all the records cannot be kept in the main memory at a time, thus, data
needs to be transferred from the secondary storage device, which is more time
consuming.
An index entry consists of a pair consisting of index value and a list of pointers to disk
blocks for the records that have that index value. An index contains such information
for every stored value of the index attribute. An index file is very small compared to a
data file that stores a relation. Also index entries are ordered, so that an index can be
searched using an efficient search method like binary search. In case an index file is
very large, you can create a multi-level index, that is index on index. Multi-level
indexes are defined later in this section.
There are many types of indexes that are categorised as:
A primary index is defined on the attributes in the order of which the file is stored.
This field is called the ordering field. A primary index can be on the primary or
candidate key of a file. If an index is on the ordering attributes, which are not
candidate key attributes, then several records may be related to one ordering field
value. This is called clustering index. It is to be noted that there can be only one
physical ordering attribute or set of attributes for a file. Thus, a file can have either the
primary index or clustering index, not both. Secondary indexes are defined on the
non-ordering fields. Thus, there can be several secondary indexes in a file, but only
one primary or clustering index.
89
The Database Management Primary index
System Concepts
Primary index is a file that contains a sorted sequence of index records having two
columns: the ordering key field; and a block address for that key field in the data file.
The ordering key field for this index can be the primary key of the data file. Primary
index contains one index entry of the ordering key field for each Block of data. An
entry in the primary index file contains either the key value of the first record or the
key value of the last record, which are stored in that data block; and a pointer to that
data block.
Let us discuss the primary index with the help of an example. Let us assume a student
database as (Assuming that one block stores only four student records). Figure 4.5
shows a sample of this data file. The sample file is ordered on the attribute - enrolment
number.
Figure 4.5: A Student file stored in the order of student enrolment numbers
The primary index on this file would be on the ordering field – enrolment number.
The primary index on this file is shown in Figure 4.6. Please note the following points
in Figure 4.6.
• An index entry is defined as the attribute value, pointer to the block where that
record is stored. The pointer physically is represented as the binary address of
the block.
• Since there are four student records, which of the key values should be stored as
the index value? We have used the first key value stored in the block as the
90
index key value. This is also called the anchor value. All the records stored in File Organisation in DBMS
the given block have ordering attribute value as the same or more than this
anchor value.
• The primary index may be smaller in size, as it contains one index entry for
each storage data block. Also notice that not all the records need to have an
entry in the index file. This type of index is called a non-dense index. Thus, the
primary index is non-dense index.
• To locate the record of a student whose enrolment number is 2238422, you need
to find two consecutive entries of indexes such that index value 1 < 2238422 <
index value 2. In the Figure 4.6, you can find the third and fourth index values
as: 2238412 and 2258015 respectively satisfying the properties as above. Thus,
the required student record must be found in Block 3.
Figure 4.6: A Student file and the Primary Index on Enrolment Number
But does primary index enhance efficiency of searching? Let us explain this with the
help of an example (Please note we will define savings in terms of the number of
block transfers, as that is the most time-consuming operation during searching).
Example 1: An ordered student file (ordering field is enrolment number) has 20,000
records stored on a disk having the Block size as 1 K. Assume that each student record
is of 100 bytes, the ordering field is of 8 bytes, and block pointer is also of 8 bytes,
find how many block accesses on average may be saved on using primary index.
Answer:
Number of accesses without using Primary Index:
Number of records in the file = 20000
91
The Database Management Block size = 1024 bytes
System Concepts
Record size = 100 bytes
Number of records per block = integer value of [1024 / 100] = 10
Number of disk blocks acquired by the file
= [Number of records / records per block]
= [20000/10] = 2000
Assuming a block level binary search, it would require log22000
= about 11 block accesses.
Number of accesses with Primary Index:
Size of an index entry = 8+8 = 16 bytes
Number of index entries that can be stored per block
= integer value of [1024 / 16] = 64
Number of index entries = number of disk blocks = 2000
Number of index blocks = ceiling of [2000/ 64] = 32
Number of index block transfers to find the value in index blocks = log232 = 5
One block transfer will be required to get the data records using the index
pointer after the required index value has been located.
So total number of block transfers with primary index = 5 + 1 = 6.
Thus, the Primary index would save 11 – 6 = 5 block transfers for the given size of
data and index.
Is there any disadvantage of using a primary index? Yes, a primary index requires the
data file to be ordered, this causes problems during insertion and deletion of records in
the file. This problem can be taken care of by selecting a suitable file organisation that
allows logical ordering only.
Clustering Indexes.
It may be a good idea to keep records of the students in the order of the programme
they have registered, as most of the data file accesses may require programme wise
student data. A file can be ordered and physically stored on non-key attributes; an
index that is created on such non-key attributes would have multiple records pointed
to by a single index entry. Such an index is called a clustering index. Figure 4.7 and
Figure 4.8 show the clustering indexes in the same file organised in different ways.
92
File Organisation in DBMS
Please note that in Figure 4.7, the data file can have a single block in which data of
students of multiple programmes are stored. You can improve upon this organisation
by allowing only one Programme data in one block. Such an organisation and its
clustering index is shown in the Figure 4.8:
93
The Database Management
System Concepts
Figure 4.8: Clustering index with separate blocks for each clustering attribute value
• The names in the data file are unique and thus are being assumed as the
alternate key. Each name therefore is appearing as the secondary index entry.
• The pointers are block pointers, thus are pointing to the beginning of the block
and not a record. For simplicity, we have not shown all the pointers in Figure
4.9.
• This type of secondary index file is dense index as it contains one entry for each
record/distinct value.
• The secondary index is larger than the Primary index as we cannot use block
anchor values here as the secondary index attributes are not the ordering
attribute of the data file.
• To search a value in a data file using name, first the index file is (binary)
searched to determine the block, where the record having the desired key value
95
The Database Management can be found. Then this block is transferred to the main memory where the
System Concepts desired record is searched and accessed.
• A secondary index file, usually, has a larger number of index entries than that of
primary index. However, the secondary index improves the search time to a
greater proportion than that of a primary index. This is due to the reason - If a
primary index does not exist even then, you can perform binary search on the
blocks of data records, as the records are ordered in the sequence of primary
index value. However, if a secondary key does not exist, then you may need to
search the records sequentially. This fact is demonstrated with the help of
Example 2.
Thus, the Secondary index would save about 1990 block transfers for the given case.
This is a huge saving compared to a primary index. Please also compare the size of the
secondary index to the primary index.
Let us now see an example of a secondary index that is on an attribute that is not an
alternate key.
96
File Organisation in DBMS
A secondary index that needs to be created on a field that is not a candidate key can be
implemented using several ways. We have shown here the way in which a block of
pointer records is kept for implementing such an index. This method allows the index
entries to be of fixed length. It also allows only a single entry for the value of the
indexing attribute. In addition, the level of indirection allows multiple index pointes to
be stored in a single block of data. The algorithms for searching the index, inserting
and deleting new values into an index are very simple in such a scheme. Thus, this is
the most popular scheme for implementing such secondary indexes.
Sparse and Dense Indexes
As discussed earlier, an index is defined as the ordered <index value, address> pair.
These indexes in principle are the same as that of indexes used at the back of the
book. The key ideas of the indexes are:
• They are sorted on the order of the index value (ascending or descending) as per
the choice of the creator.
• The indexes are logically separate files (just like separate index pages of the
book).
• An index is primarily created for fast access to information.
• The primary index is the index on the ordering field of the data file, whereas a
secondary index is the index on any other field, thus, is more useful.
97
The Database Management But what are sparse and dense indexes?
System Concepts
A dense index contains one index entry for every value of the indexing attributes,
whereas a sparse index also called non-dense index contains few index entries out of
the available indexing attribute values. For example, the primary index on enrolment
number is sparse, while secondary index on student name is dense.
Multilevel Indexing Scheme
For small files, the indexing scheme keeps the address of the block file in each index
entry. Such indices would be small and can be processed efficiently in the main
memory. However, for a large file the size of the index can also be very large. In such
a case, you can create indexes at several levels, with the last level pointing to the data
records. Figure 4.11 shows this scheme.
99
The Database Management Please note in Figure 4.12 that a key value is associated with a pointer to a record. A
System Concepts record consists of the key value and other information fields. Please note that a node
on BST stores the <key value, address> pair.
Now, let us examine the suitability of BST as a data structure to implement indexes. A
BST as a data structure is suitable for an index if the complete index is contained in
the primary memory. However, indexes are quite large in nature and require a
combination of primary and secondary storage. Therefore, you can use B-Tree data
structure to implement the index.
The question that needs to be answered here is: what should be the order of B-Tree for
an index? The suggested order is from 80-200 depending on various index structures
and block size.
B-tree is a data structure, which was proposed by R. Bayer and E. McCreight of Bell
Scientific Research Labs in 1970. The B-Tree and its variants are secondary storage
structures and have been found to be very useful for implementing indexes. An N
order B-tree has:
• A node of B-tree of order N can have children/paths in the range - ceiling of
[N/2] to N. However, the root node of the tree can have 2 to N children/paths.
• Each node can have one fewer key than the number of children/paths, but a
maximum of N-1 keys can be stored in a node.
• The keys are normally arranged in a node in an increasing order.
• If a new key is inserted into a full node of order N (i.e. it already contains N-1
keys), then on addition of this new key value, the node would have N+1 paths
(N keys). This node is split into two nodes and the median key value is moved
to the parent of this node. In case the node that is being split is the root node,
then it is split into two nodes and a new root node is created by using the
median key of the node being split.
• B-tree does not allow any empty sub-tree, therefore, all the leaves of B-tree are
at the same level. Therefore, a B-tree is a completely balanced tree.
A B-Tree index is shown in Figure 4.13. The B-Tree has a very useful variant called
B+Tree, which has all the key values at the leaf level also, in addition to the higher
level. For example, the key value 1010 in Figure 12 will also exist at leaf level. In
100
addition, these lowest level leaves are linked through pointers. Thus, the B+tree is a File Organisation in DBMS
very useful structure for index-sequential organisation. You can refer to further
readings for more details on these topics.
Till now we have discussed file organisations having the single access key. But is it
possible to have file organisations that allow access of records on more than one key
field? This section discusses the two file organisations that allow multiple access
paths, with each path having a different key. These are called multi-key file
Organisations. These file organisations, in general, are part of a real database
management system. Two of the commonest techniques for this Organisation are:
• Multi-list file Organisation
• Inverted file Organisation
Let us discuss these techniques in more detail. But first let us discuss the need for the
Multiple access paths.
4.7.1 Multiple Access Paths
In practice, most of the online information systems require the support of multi-key
files. For example, consider a banking database application having many kinds of
users such as:
• Teller
• Loan officers
• Branch manager
• Account holders
All these users access the bank data however in a different way. Let us assume a
sample data format for the Account relation in a bank as:
Account Relation:
Account Account Branch Account Balance Permissible
Number Holder Name Code type Loan Limit
A teller may access the record above to check the balance at the time of withdrawal.
S/he needs to access the account based on branch code and account number. A loan
approver may be interested in finding the potential customer by accessing the records
in decreasing order of permissible loan limits. A branch manager may need to find the
top ten most preferred customers in each category of account, so s/he may access the
database in the order of account type and balance. The account holder may be
interested in her/his own record. Thus, all these applications are trying to refer to the
same data but using different key values. Thus, all the applications as above require
the database file to be accessed in different format and order.
Multiple indexes can be used to access a data file through multiple access paths. In
such a scheme only one copy of the data is kept, only the number of paths is added
with the help of indexes. Let us discuss two important approaches, viz. multi-list file
organisation and Inverted file organisation.
4.7.2 Multi-list file Organisation
This file organisation, as the name suggests, consists of multiple lists or indexes. The
records in each list are linked from the index value. The linking of records, in general,
is done in the sorted sequence of the key attribute to facilitate searching, insertion and
deletion operations. The following example explains the multi-list file organisation.
101
The Database Management A sample data of employees of an organisation is given in Figure 4.14. Assume that
System Concepts the Empid is the key attribute. You can create multiple index lists using this data.
Assumed Employee Employee Job Title Highest Gender City of Married Salary
Record id Name Qualification (Female posting - M/ per
Number (Empid) F /Male Single - month
M) S
A 795 Praveen Engineer B. Tech. M Dehradun S 16,200/-
B 495 Rohini Manager B. Tech. F Dehradun M 19,000/-
C 905 Rishika Manager MCA F Jaipur S 17,100/-
D 705 Gaurav Engineer B. Tech. M Jaipur M 13,200/-
E 595 Dipti Manager MCA F Jaipur S 14,100/-
The primary link order (in the order of primary key Empid) would be:
B(495), E(595), D(705), A(795), C(905)
The primary index for this file would be:
>= 500 but < 700
> = 700 but < 900
>= 900 but < 1100
The index file for the example data as per Figure 4.14 is shown in Figure 4.15.
Figure 4.15: Linking together all the records in the same index value.
Please note that in the Figure 4.15, those records that fall in the same index value
range of Empid are linked together. These lists are smaller than the total range, which
will improve search performance.
This file can be supported by many more indexes that will enhance the search
performance on various fields, thus, creating a multi-list file organisation. Figure 4.16
shows various indexes and lists corresponding to those indexes. For simplicity we
have just shown the links and not the complete record. Please note that nodes in the
original file are assumed to be in the order of Empid’s.
Let us assume that the inverted file organisation for the data shown contains a dense
index. Figure 4.17 shows how the data can be represented using inverted file
organisation.
Please note the following points for the inverted file organisation:
• The index entries are of variable lengths as the number of records with the
same key value is changing, thus, maintenance of index is more complex than
that of multi-list file organisation.
• The queries that involve Boolean expressions require accesses only for those
records that satisfy the query in addition to the block accesses needed for the
indices. For example, the query about Female, MCA employees can be solved
by the Gender and Qualification index. You just need to take the intersection
of record numbers on the two indices. (Please refer to Figure 4.17). Thus, any
complex query requiring Boolean expression can be handled easily through
the help of indices.
103
The Database Management
System Concepts 4.8 IMPORTANCE OF FILE ORGANISATION IN
DATABASES
To implement a database efficiently, there are several design tradeoffs needed. One of
the most important ones is the file Organisation. For example, if there were to be an
application that required only sequential batch processing, then the use of indexing
techniques would be pointless and wasteful.
There are several important consequences of an inappropriate file Organisation being
used in a database. The wrong file Organisation will result in:
• much larger processing time for retrieving or modifying the required record.
• undue disk access that could stress the hardware.
Needless to say, there could be many undesirable consequences at the user level, such
as making some applications impractical.
Check Your Progress 3
1) What is the difference if indexes are implemented using binary search tree or
using B-tree?
………………………………………………………………………….…
………………………………………………………………………….…
2) What are the advantages of using B+ tree index over B tree index if the
supported file organisation is required to access the records sequentially too.
………………………………………………………………………….…
………………………………………………………………………….…
…………………………………………………………………………….
3) What is the need of a multi-list organisation? What is the advantage of storing
the number of records in an index entry?
………………………………………………………………………….…
………………………………………………………………………….…
…………………………………………………………………………….
4.9 SUMMARY
In this unit, we discussed the physical database design issues in which we had
addressed the file Organisation and file access method. The unit also discusses
different types of file organization giving their advantages and their disadvantages.
An index is an important component of a database system, as one of the key
requirements of DBMS is efficient access to data. This unit explains various types of
indexes that may exist in database systems. Some of these are: Primary index,
clustering index and secondary index. The secondary index results in better search
performance but adds on the task of index updating. This unit also discusses two
multi-key file organisations viz. multi-list and inverted file organisations. These are
very useful for improving query performance.
104
File Organisation in DBMS
4.10 SOLUTIONS / ANSWERS
Check Your Progress 1
1)
Operation Comments
File Creation It will be efficient if transaction records are ordered by
record key
Record As it follows the sequential approach it is inefficient. On
Location an average, half of the records in the file must be
processed to locate a record.
Record It will require you to browse through all the records to
Creation check if such a record already exists or not. Thus, the
entire file must be read and written. Efficiency improves
if a group of records are created together. This operation
could be combined with deletion and modification
transactions to achieve greater efficiency.
Record The entire file must be read and written. Efficiency
Deletion improves with greater number of deletions. This
operation could be combined with addition and
modification transactions to achieve greater efficiency.
Record Very efficient if the number of records to be modified is
Modification high and the records in the transaction file are ordered by
the record key.
2) Direct-access systems do not search the entire file; rather they move directly to
the record, which is to be accessed. To be able to achieve this, several strategies
like relative addressing, hashing and indexing can be used.
105
The Database Management Size of an index entry = 4+8 = 12 bytes
System Concepts Number of index entries that can be stored per block
= integer value of [1024 / 12] = 85
Number of index entries = number of disk Blocks of file = 8000
Number of index blocks = ceiling of [8000/ 85] = 94
Number of index block transfers to find the value in index blocks
= ceiling of [log294] = 7
One block transfer will be required to get the data records using
the index pointer after the required index value has been
located. So total number of block transfers with secondary
index = 7 + 1 = 8
Thus, the Primary index would save about 5 block transfers for the given case.
106