DBMS Notes
DBMS Notes
Introduction to Data
Introduction:
In computerized information system data are the basic resource of the organization. So, proper
organization and management for data is required for organization to run smoothly. Database
management system deals the knowledge of how data stored and managed on a computerized
information system. In any organization, it requires accurate and reliable data for better decision
making, ensuring privacy of data and controlling data efficiently.
The examples include deposit and/or withdrawal from a bank, hotel, airline or railway reservation,
purchase items from supermarkets in all cases, a database is accessed.
What is data?
Data are the known facts or figures that have implicit meaning. It can also be defined as it is the
representation of facts, concepts or instructions in a formal manner, which is suitable for
understanding and processing. Data can be represented in alphabets (A-Z, a-z), digits (0-9) and using
special characters (+,-.#,$, etc)
e.g: 25, “ajit” etc.
Information:
File:
File is a collection of related data stored in secondary memory.
3) Data isolation :
Because data are scattered in various files and files may be in different formats with new
application programs to retrieve the appropriate data is difficult.
4) Integrity Problems:
Developers enforce data validation in the system by adding appropriate code in the various
application program. How ever when new constraints are added, it is difficult to change the
programs to enforce them.
5) Atomicity:
It is difficult to ensure atomicity in a file processing system when transaction failure occurs
due to power failure, networking problems etc. (atomicity: either all operations of the
transaction are reflected properly in the database or non are)
6) Concurrent access:
In the file processing system it is not possible to access the same file for transaction at same
the time.
7) Security problems:
There is no security provided in file processing system to secure the data from unauthoriz-
ed user access.
Database:
A database is organized collection of related data of an organization stored in formatted way which
is shared by multiple users.
For example consider the roll no, name, address of a student stored in a student file. It is collection
of related data with an implicit meaning. Data in the database may be persistent, integrated and shared.
Persistent:
If data is removed from database due to some explicit request from user to remove.
Integrated:
A database can be a collection of data from different files and when any redundancy among those
files are removed from database is said to be integrated data.
Sharing Data:
The data stored in the database can be shared by multiple users simultaneously without affecting the
correctness of data.
Why Database:
In order to overcome the limitation of a file system, a new approach was required. Hence a database
approach emerged. A database is a persistent collection of logically related data. The initial attempts
were to provide a centralized collection of data. A database has a self describing nature. It contains
not only the data sharing and integration of data of an organization in a single database.
A small database can be handled manually but for a large database and having multiple users it is
difficult to maintain it. In that case a computerized database is useful.
The advantages of database system over traditional, paper-based methods of record keeping are:
⚫ Compactness: No need for large amount of paper files
⚫ Speed: The machine can retrieve and modify the data faster way then human being
⚫ Less drudgery: Much of the maintenance of files by hand is eliminated
⚫ Accuracy: Accurate, up-to-date information is fetched as per requirement of the user at any
time.
Function of DBMS:
1. Defining database schema: it must give facility for defining the database structure also
specifies access rights to authorized users.
2. Manipulation of the database: The dbms must have functions like insertion of record into
database, updation of data, deletion of data, retrieval of data
3. Sharing of database: The DBMS must share data items for multiple users by maintaining
consistency of data.
4. Protection of database: It must protect the database against unauthorized users.
5. Database recovery: If for any reason the system fails DBMS must facilitate data base
recovery.
Advantages of DBMS:
Reduction of redundancies:
Centralized control of data by the DBA avoids unnecessary duplication of data and effectively reduces
the total amount of data storage required avoiding duplication in the elimination of the inconsistencies
that tend to be present in redundant data files.
Sharing of Data:
A database allows the sharing of data under its control by any number of application programs or
users.
Data Integrity:
Data integrity means that the data contained in the database is both accurate and consistent.Therefore
data values being entered for storage could be checked to ensure that they fall with in a specified
range and are of the correct format.
Data Security:
The DBA who has the ultimate responsibility for the data in the dbms can ensure that proper access
procedures are followed including proper authentication to access to the DataBase System and
additional check before permitting access to sensitive data.
Conflict Resolution:
DBA resolve the conflict on requirements of various user and applications. The DBA chooses the
best file structure and access method to get optional performance for the application.
Data Independence:
Data independence is usually considered from two points of views; physically data independence and
logical data independence.
Physical Data Independence allows changes in the physical storage devices or organization of the
files to be made without requiring changes in the conceptual view or any of the external views and
hence in the application programs using the data base.
Logical Data Independence indicates that the conceptual schema can be changed without affecting
the existing external schema or any application program.
Disadvantage of DBMS:
1. DBMS software and hardware (networking installation) cost is high
2. The processing overhead by the dbms for implementation of security, integrity and sharing of
the data.
3. Centralized database control
4. Setup of the database system requires more knowledge, money, skills, and time.
5. The complexity of the database may result in poor performance.
Data Item:
The data item is also called as field in data processing and is the smallest unit of data that has
meaning to its users.
Eg: “e101”, ”sumit”
A subschema is derived schema derived from existing schema as per the user requirement. There
may be more then one subschema create for a single conceptual schema.
External Level :
The external level is at the highest level of database abstraction . At this level, there will be many
views define for different users requirement. A view will describe only a subset of the database. Any
number of user views may exist for a given global schema(coneptual schema).
For example, each student has different view of the time table. the view of a student of BTech (CSE)
is different from the view of the student of Btech (ECE). Thus this level of abstraction is concerned
with different categories of users.
Each external view is described by means of a schema called sub schema.
Conceptual Level :
At this level of database abstraction all the database entities and the relationships among them are
included. One conceptual view represents the entire database. This conceptual view is defined by the
conceptual schema.
The conceptual schema hides the details of physical storage structures and concentrate on describing
entities, data types, relationships, user operations and constraints.
It describes all the records and relationships included in the conceptual view. There is only one
conceptual schema per database. It includes feature that specify the checks to relation data consistency
and integrity.
Internal level :
It is the lowest level of abstraction closest to the physical storage method used. It indicates how the
data will be stored and describes the data structures and access methods to be used by the database.
The internal view is expressed by internal schema.
Database Users :
Naive Users :
Users who need not be aware of the presence of the database system or any other system supporting
their usage are considered naïve users . A user of an automatic teller machine falls on this category.
Online Users :
These are users who may communicate with the database directly via an online terminal or indirectly
via a user interface and application program. These users are aware of the database system and also
know the data manipulation language system.
Application Programmers :
Professional programmers who are responsible for developing application programs or user interfaces
utilized by the naïve and online user falls into this category.
Database Administration :
A person who has central control over the system is called database administrator .
The function of DBA are :
1. Creation and modification of conceptual Schema definition
2. Implementation of storage structure and access method.
3. Schema and physical organization modifications .
4. Granting of authorization for data access.
5. Integrity constraints specification.
6. Execute immediate recovery procedure in case of failures
7. Ensure physical security to database
Database language :
DDL Compiler:
The DDL compiler converts the data definition statements into a set of tables. These tables contains
information concerning the database and are in a form that can be used by other components of the
dbms.
File Manager:
File manager manages the allocation of space on disk storage and the data structure used to represent
information stored on disk.
Database Manager:
A database manager is a program module which provides the interface between the low level data
stored in the database and the application programs and queries submitted to the system.
1. Interaction with File Manager: The data is stored on the disk using the file system which
is provided by operating system. The database manager translate the different DML
statements into low-level file system commands so the database manager is responsible for
the actual storing, retrieving and updating of data in the database.
2. Integrity Enforcement: The data values stored in the database must satisfy certain
constraints (eg: the age of a person can't be less then zero). These constraints are specified
by DBA. Data manager checks the constraints and if it satisfies then it stores the data in the
database.
3. Security Enforcement: Data manager checks the security measures for database from
unauthorized users.
4. Backup and Recovery: Database manager detects the failures occur due to different causes
(like disk failure, power failure, deadlock, software error) and restores the database to
original state of the database.
5. Concurrency Control: When several users access the same database file simultaneously,
there may be possibilities of data inconsistency. It is responsible of database manager to
control the problems occur for concurrent transactions.
Query Processor:
The query processor used to interpret to online user’s query and convert it into an efficient series of
operations in a form capable of being sent to the data manager for execution. The query processor
uses the data dictionary to find the details of data file and using this information it create query
plan/access plan to execute the query.
Data Dictionary:
Data dictionary is the table which contains the information about database objects. It contains
information like
1. external, conceptual and internal database description
2. description of entities, attributes as well as meaning of data elements
3. synonyms, authorization and security codes
4. database authorization
DBMS STRUCTURE:
Database manager
File manager
DBMS
Data file
Data dictionary
Data Model:
The data model describes the structure of a database. It is a collection of conceptual tools for
describing data, data relationships and consistency constraints and various types of data models such
as
1. Object based logical model
2. Record based logical model
3. Physical model
Basic Concepts:
The E-R data model employs three basic notions : entity sets, relationship sets and attributes.
Entity Sets:
An entity is a “thing” or “object” in the real world that is distinguishable from all other objects. For
example, each person in an enterprise is an entity. An entity has a set properties and the values for
some set of properties may uniquely identify an entity. BOOK is entity and its properties (called as
attributes) bookcode, booktitle, price etc.
An entity set is a set of entities of the same type that share the same properties, or attributes. The set
Attributes:
An entity is represented by a set of attributes. Attributes are descriptive properties possessed by
each member of an entity set.
Customer is an entity and its attributes are customerid, custmername, custaddress etc.
An attribute as used in the E-R model, can be characterized by the following attribute types.
c) Derived Attribute:
The values for this type of attribute can be derived from the values of existing attributes, e.g. age
which can be derived from currentdate – birthdate and experience_in_year can be calculated as
currentdate-joindate.
Relationship Sets:
A relationship is an association among several entities. A relationship set is a set of relationships of
the same type. Formally, it is a mathematical relation on n>=2 entity sets. If E1, E2…En are entity sets,
then a relation ship set R is a subset of
{(e1,e2,…en) | e1Є E1, e2 Є E2.., en Є En}
where (e1,e2,…en) is a relation ship.
Mapping Cardinalities:
Mapping cardinalities or cardinality ratios, express the number of entities to which another entity can
be associated via a relationship set. Mapping cardinalities are most useful in describing binary
relationship sets, although they can contribute to the description of relationship sets that involve more
than two entity sets. For a binary relationship set R between entity sets A and B, the mapping
1. One to One:
An entity in A is associated with at most one entity in B, and an entity in B is associated with at
most one entity in A.
Eg: relationship between college and principal
1 1
college has principal
2. One to Many:
An entity in A is associated with any number of entities in B. An entity in B is associated with at the
most one entity in A.
Eg: Relationship between department and faculty
1 M
Department Works Faculty
in
3. Many to One:
An entity in A is associated with at most one entity in B. An entity in B is associated with any
number in A.
M 1
emp Department
Works
4. Many to Many:
Entities in A and B are associated with any number of entities from each other.
M N
customer account
deposits
Recursive Relationships:
When the same entity type participates more than once in a relationship type in different roles, the
relationship types are called recursive relationships.
Participation Constraints:
The participation constraints specify whether the existence of any entity depends on its being
related to another entity via the relationship. There are two types of participation constraints
a) Total : When all the entities from an entity set participate in a relationship type, is called total
participation. For example, the participation of the entity set student on the relationship set must ‘opts’
is said to be total because every student enrolled must opt for a course.
b) Partial: When it is not necessary for all the entities from an entity set to particapte in a relationship
type, it is called partial participation. For example, the participation of the entity set student in
‘represents’ is partial, since not every student in a class is a class representative.
Weak Entity:
Entity types that do not contain any key attribute, and hence can not be identified independently are
called weak entity types. A weak entity can be identified by uniquely only by considering some of its
attributes in conjunction with the primary key attribute of another entity, which is called the
identifying owner entity.
Generally a partial key is attached to a weak entity type that is used for unique identification of
weak entities related to a particular owner type. The following restrictions must hold:
• The owner entity set and the weak entity set must participate in one to may relationship set.
This relationship set is called the identifying relationship set of the weak entity set.
• The weak entity set must have total participation in the identifying relationship.
Example:
Consider the entity type Dependent related to Employee entity, which is used to keep track of the
dependents of each employee. The attributes of Dependents are: name, birthdate, sex and relationship.
Each employee entity set is said to its own the dependent entities that are related to it. However, not
that the ‘Dependent’ entity does not exist of its own, it is dependent on the Employee entity.
Keys:
Super Key:
A super key is a set of one or more attributes that taken collectively, allow us to identify uniquely an
entity in the entity set. For example , customer-id, (cname, customer-id), (cname, telno)
Candidate Key:
In a relation R, a candidate key for R is a subset of the set of attributes of R, which have the
following properties:
1. Uniqueness: No two distinct tuples in R have the same values for the candidate key
2. Irreducible: No proper subset of the candidate key has the uniqueness property that is
the candidate key. Eg: (cname,telno)
Primary Key:
The primary key is the candidate key that is chosen by the database designer as the principal means
of identifying entities within an entity set. The remaining candidate keys if any, are called Alternate
Key.
composite attribute
entity
Weak entity
attribute Relationship
Abstraction is the simplification mechanism used to hide superfluous details of a set of objects. It
allows one to concentrate on the properties that are of interest to the application. There are two main
abstraction mechanism used to model information:
empno name
dob
employee
Generalization Specialization
Is Is
degree degree
Is Is Is Is
EMPLOYEE(empno,name,dob) Faculty(empno,degree,intrest)
FULL_TIME_EMPLOYEE(empno,salary) Staff(empno,hour-rate)
PART_TIME_EMPLOYEE(empno,type) Teaching (empno,stipend)
Aggregation:
Aggregation is the process of compiling information on an object, there by abstracting a higher level
object. The entity person is derived by aggregating the characteristics of name, address, ssn. Another
form of the aggregation is abstracting a relationship objects and viewing the relationship asan object.
Job
Branch
Employe
Works
on
Manag
Manager
ER- Diagram For College Database
Conversion of ER-Diagram to Relational Database
2. For each weak entity type W in the ER diagram, we create another relation R that contains all
simple attributes of W. If E is an owner entity of W then key attribute of E is also include In
R. This key attribute of R is set as a foreign key attribute of R. Now the combination of
primary key attribute of owner entity type and partial key of the weak entity type will form
the key of the weak entity type
• One-to-Many Relationship:
For each 1:N relationship type R involving two entities E1 and E2, we identify the entity type
(say E1) at the N-side of the relationship type R and include primary key of the entity on the
other side of the relation (say E2) as a foreign key attribute in the table of E1. We include all
simple attribute (or simple components of a composite attribute of R (if any) in the table E1)
For example:
The works in relationship between the DEPARTMENT and FACULTY. For this relationship
choose the entity at N side, i.e, FACULTY and add primary key attribute of another entity
DEPARTMENT i.e., DNO as a foreign key attribute in FACULTY.
• Many-to-Many Relationship:
For each M:N relationship type R, we create a new table (say S) to represent R, we also include
the primary key attributes of both the participating entity types as a foreign key attribute in S.
Any simple attributes of the M:N relationship type (or simple components as a composite
attribute) is also included as attributes of S.
For example:
The M:N relationship taught-by between entities COURSE and FACULTY should be
represented as a new table. The structure of the table will include primary key of COURSE
and primary key of FACULTY entities.
TAUGHT-BY (ID (primary key of FACULTY table), course-id (primary key of COURSE
table)
• N-ary Relationship:
For each N-ary relationship type R where n>2, we create a new table S to represent R, We
include as foreign key attributes in S the primary keys of the relations that represent the
participating entity types. We also include any simple attributes of the N-ary relationship type
(or simple components of complete attribute) as attributes of S. The primary key of S is usually
a combination of all the foreign keys that reference the relations representing the participating
entity types.
Customer Loan
Loan -
sanctio
Employee
• Multi-Valued Attributes:
For each multivalued attribute ‘A’, we create a new relation R that includes an attribute
corresponding to plus the primary key attributes k of the relation that represents the entity type
or relationship that has as an attribute. The primary key of R is then combination of A and k.
For example, if a STUDENT entity has rollno, name and phone number where phone number
is a multivalued attribute then we will create table PHONE (rollno, phoneno) where primary
key is the combination. In the STUDENT table we need not have phone number, instead if can
be simply (rollno, name) only.
PHONE(rollno, phoneno)
name
Account_n
Account branch
generalisation
specialisation
Is-a
intrest charges
Saving Current
Hierarchical Model:
• A hierarchical database consists of a collection of records which are connected to one
another through links.
• A record is a collection of fields, each of which contains only one data value.
• A link is an association between precisely two records.
• The hierarchical model differs from the network model in that the records are organized as
collections of trees rather than as arbitrary graphs.
Tree-Structure Diagrams:
• The schema for a hierarchical database consists of
o boxes, which correspond to record types
o lines, which correspond to links
• Record types are organized in the form of a rooted tree.
o No cycles in the underlying graph.
o Relationships formed in the graph must be such that only
one-to-many or one-to-one relationships exist between a parent and a child.
Single Relationships:
▪ Example of E-R diagram with two entity sets, customer and account, related through a
binary, one-to-many relationship depositor.
▪ Corresponding tree-structure diagram has
o the record type customer with three fields: customer-name, customer-street, and
customer-city.
o the record type account with two fields: account-number and balance
o the link depositor, with an arrow pointing to customer
▪ If the relationship depositor is one to one, then the link depositor has two arrows.
▪ Must consider the type of queries expected and the degree to which the database schema fits
the given E-R diagram.
▪ In all versions of this transformation, the underlying database tree (or trees) will have
replicated records.
▪ Create two tree-structure diagrams, T1, with the root customer, and T2, with the root
account.
▪ In T1, create depositor, a many-to-one link from account to customer.
▪ In T2, create account-customer, a many-to-one link from customer to account.
Virtual Records:
• For many-to-many relationships, record replication is necessary to preserve the tree-
structure organization of the database.
• Data inconsistency may result when updating takes place
• Waste of space is unavoidable
• Virtual record — contains no data value, only a logical pointer to a particular physical
record.
• When a record is to be replicated in several database trees, a single copy of that record is
kept in one of the trees and all other records are replaced with a virtual record.
• Let R be a record type that is replicated in T1, T2, . . ., Tn. Create a new virtual record type
virtual-R and replace R in each of the n – 1 trees with a record of type virtual-R.
• Eliminate data replication in the following diagram ; create virtual-customer and virtual-
account.
• Replace account with virtual-account in the first tree, and replace customer with virtual-
customer in the second tree.
• Add a dashed line from virtual-customer to customer, and from virtual-account to account,
to specify the association between a virtual record and its corresponding physical record.
Network Model:
▪ Data are represented by collections of records.
o similar to an entity in the E-R model
o Records and their fields are represented as record type
▪ type customer = record type account = record type
customer-name: string; account-number: integer;
customer-street: string; balance: integer;
customer-city: string;
▪ end end
▪ Relationships among data are represented by links
o similar to a restricted (binary) form of an E-R relationship
o restrictions on links depend on whether the relationship is many-to-many, many-to-
one, or one-to-one.
Data-Structure Diagrams:
▪ Schema representing the design of a network database.
▪ A data-structure diagram consists of two basic components:
o Boxes, which correspond to record types.
o Lines, which correspond to links.
▪ Specifies the overall logical structure of the database.
Since a link cannot contain any data value, represent an E-R relationship with attributes with a
new record type and links.
To represent an E-R relationship of degree 3 or higher, connect the participating record types
through a new record type that is linked directly to each of the original record types.
1. Replace entity sets account, customer, and branch with record types account, customer, and
branch, respectively.
2. Create a new record type Rlink (referred to as a dummy record type).
3. Create the following many-to-one links:
o CustRlink from Rlink record type to customer record type
o AcctRlnk from Rlink record type to account record type
o BrncRlnk from Rlink record type to branch record type
DBTG Sets:
o The structure consisting of two record types that are linked together is referred to in the
DBTG model as a DBTG set
o In each DBTG set, one record type is designated as the owner, and the other is designated as
the member, of the set.
o Each DBTG set can have any number of set occurrences (actual instances of linked records).
o Since many-to-many links are disallowed, each set occurrence has precisely one owner, and
has zero or more member records.
o No member record of a set can participate in more than one occurrence of the set at any
point.
o A member record can participate simultaneously in several set occurrences of different
DBTG sets.
RELATIONAL MODEL
RELATIONAL MODEL
Relational model is simple model in which database is represented as a collection of “relations”
where each relation is represented by two-dimensional table.
The relational model was founded by E. F. Codd of the IBM in 1972. The basic concept in the
relational model is that of a relation.
Properties:
o It is column homogeneous. In other words, in any given column of a table, all items are of
the same kind.
o Each item is a simple number or a character string. That is a table must be in first normal
form.
o All rows of a table are distinct.
o The ordering of rows with in a table is immaterial.
o The column of a table are assigned distinct names and the ordering of these columns is
immaterial.
Tuple:
Each row in a table represents a record and is called a tuple .A table containing ‘n’ attributes in a
record is called is called n-tuple.
Attributes:
The name of each column in a table is used to interpret its meaning and is called an attribute.Each
table is called a relation. In the above table, account_number, branch name, balance are the attributes.
Domain:
A domain is a set of values that can be given to an attributes. So every attribute in a table has a
specific domain. Values to these attributes can not be assigned outside their domains.
Relation:
A relation consist of
o Relational schema
o Relation instance
Relational Schema:
A relational schema specifies the relation’s name, its attributes and the domain of each attribute. If
R is the name of a relation and A1, A2,…An is a list of attributes representing R then R(A1,A2,…,An)
is called a Relational Schema. Each attribute in this relational schema takes a value from some
specific domain called domain(Ai).
Example:
PERSON (PERSON_ID:INTEGER, NAME:STRING, AGE:INTEGER, ADDRESS:STRING)
Total number of attributes in a relation denotes the degree of a relation since the PERSON relation
scheme contains four attributes, so this relation is of degree 4.
Relation Instance:
A relational instance denoted as r is a collection of tuples for a given relational schema at a specific
point of time.
A relation state r to the relations schema R(A1, A2…, An) also denoted by r(R) is a set of n-tuples
R{t1,t2,…tm}
Where each n-tuple is an ordered list of n values
T=<v1,v2,….vn>
Where each vi belongs to domain (Ai) or contains null values.
The relation schema is also called ‘intension’ and the relation state is also called ‘extension’.
Eg: Relation schema for Student
STUDENT(rollno:string, name:string, city:string, age:integer)
Relation instance:
Student:
Rollno Name City Age
101 Sujit Bam 23
102 kunal bbsr 22
Keys:
Super key:
A super key is an attribute or a set of attributes used to identify the records uniquely in a relation.
For example, customer-id, (cname, customer-id), (cname,telno)
Candidate key:
Super keys of a relation can contain extra attributes. Candidate keys are minimal super keys. i.e, such
a key contains no extraneous attribute. An attribute is called extraneous if even after removing it from
the key, makes the remaining attributes still has the properties of a key(atribute represents entire
table).
In a relation R, a candidate key for R is a subset of the set of attributes of R, which have the
following properties:
• Uniqueness: no two distinct tuples in R have the same values for
the candidate key
• Irreducible: No proper subset of the candidate key has the
uniqueness property that is the candidate key.
• A candidate key’s values must exist. It can’t be null.
• The values of a candidate key must be stable. Its value can not change outside the
control of the system.
Eg: (cname,telno)
Primary key:
The primary key is the candidate key that is chosen by the database designer as the principal
means of identifying entities with in an entity set. The remaining candidate keys if any are called
alternate key.
CONSTRAINTS
RELATIONAL CONSTRAINTS:
There are three types of constraints on relational database that include
o DOMAIN CONSTRAINTS
o KEY CONSTRAINTS
o INTEGRITY CONSTRAINTS
DOMAIN CONSTRAINTS:
It specifies that each attribute in a relation an atomic value from the corresponding domains. The
data types associated with commercial RDBMS domains include:
o Standard numeric data types for integer
o Real numbers
o Characters
o Fixed length strings and variable length strings
Thus, domain constraints specifies the condition that we to put on each instance of the relation.
So the values that appear in each column must be drawn from the domain associated with that
column.
Rollno Name City Age
101 Sujit Bam 23
102 kunal bbsr 22
Key Constraints:
This constraints states that the key attribute value in each tuple msut be unique .i.e, no two tuples
contain the same value for the key attribute.(null values can allowed)
Emp(empcode,name,address) . here empcode can be unique
Integrity CONSTRAINTS:
There are two types of integrity constraints:
o Entity Integrity Constraints
o Referential Integrity constraints
CODD'S RULES
Let us discuss the whole process with an example. Let us consider the following two relations as the
example tables for our discussion;
Query Tree
Used in query representation used in parsing.
Query Optimization:
A single query can be executed through different algorithms or re-written in different forms and
structures. Hence, the question of query optimization comes into the picture – Which of these forms
or pathways is the most optimal? The query optimizer attempts to determine the most efficient way
to execute a given query by considering the possible query plans.
There are broadly two ways a query can be optimized:
1. Analyze and transform equivalent relational expressions: Try to minimize the tuple and
column counts of the intermediate and final query processes (discussed here).
2. Using different algorithms for each operation: These underlying algorithms determine how
tuples are accessed from the data structures they are stored in, indexing, hashing, data retrieval
and hence influence the number of disk and block accesses (discussed in query processing).
Analyze and transform equivalent relational expressions
Here, we shall talk about generating minimal equivalent expressions. To analyze equivalent
expression, listed are a set of equivalence rules. These generate equivalent expressions for a query
written in relational algebra. To optimize a query, we must convert the query into its equivalent form
as long as an equivalence rule is satisfied.
1. Conjunctive selection operations can be written as a sequence of individual selections.
This is called a sigma-cascade.
Explanation: Applying condition intersection is expensive. Instead, filter out tuples
satisfying condition (inner selection) and then apply condition (outer selection) to the then
resulting fewer tuples. This leaves us with less tuples to process the second time. This can be
extended for two or more intersecting selections. Since we are breaking a single condition into
a series of selections or cascades, it is called a “cascade”.
2. Selection is commutative.
Explanation: condition is commutative in nature. This means, it does not matter whether we
apply first or first. In practice, it is better and more optimal to apply that selection first
which yields a fewer number of tuples. This saves time on our outer selection.
3. All following projections can be omitted, only the first projection is required. This is
called a pi-cascade.
Explanation: A cascade or a series of projections is meaningless. This is because in the end,
we are only selecting those columns which are specified in the last, or the outermost projection.
Hence, it is better to collapse all the projections into just one i.e. the outermost projection.
4. Selections on Cartesian Products can be re-written as Theta Joins.
▪ Equivalence 1
Explanation: The cross product operation is known to be very expensive. This is because
it matches each tuple of E1 (total m tuples) with each tuple of E2 (total n tuples). This
yields m*n entries. If we apply a selection operation after that, we would have to scan
through m*n entries to find the suitable tuples which satisfy the condition
. Instead of doing all of this, it is more optimal to use the Theta Join, a join specifically
designed to select only those entries in the cross product which satisfy the Theta condition,
without evaluating the entire cross product first.
▪ Equivalence 2
Explanation: Theta Join radically decreases the number of resulting tuples, so if we apply
an intersection of both the join conditions i.e. and into the Theta Join itself,
we get fewer scans to do. On the other hand, a condition outside unnecessarily
increases the tuples to scan.
5. Theta Joins are commutative.
Explanation: Theta Joins are commutative, and the query processing time depends to some
extent which table is used as the outer loop and which one is used as the inner loop during the
join process (based on the indexing structures and blocks).
6. Join operations are associative.
▪ Natural Join
Explanation: Joins are all commutative as well as associative, so one must join those
two tables first which yield less number of entries, and then apply the other join.
▪ Theta Join
Explanation: Theta Joins are associative in the above manner, where involves
attributes from only E2 and E3.
7. Selection operation can be distributed.
▪ Equivalence 1
Explanation: Applying a selection after doing the Theta Join causes all the tuples returned
by the Theta Join to be monitored after the join. If this selection contains attributes from
only E1, it is better to apply this selection to E1 (hence resulting in a fewer number of
tuples) and then join it with E2.
▪ Equivalence 2
Materialization
• Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use
intermediate results materialized into temporary relations to evaluate next-level operations.
• E.g., in figure below, compute and store
then compute the store its join with customer, and finally compute the projections on customer-
name.
Pipelining
• Pipelined evaluation : evaluate several operations simultaneously, passing the results of one
operation on to the next.
🟊 E.g., in previous expression tree, don’t store result of instead, pass tuples directly to
the join.. Similarly, don’t store result of join, pass tuples directly to projection.
• Much cheaper than materialization: no need to store a temporary relation to disk.
• Pipelining may not always be possible – e.g., sort, hash-join.
• For pipelining to be effective, use evaluation algorithms that generate output tuples even as
tuples are received for inputs to the operation.
• Pipelines can be executed in two ways: demand driven and producer driven
• In demand driven or lazy evaluation
🟊 system repeatedly requests next tuple from top level operation
🟊 Each operation requests next tuple from children operations as required, in order to
output its next tuple
🟊 In between calls, operation has to maintain “state” so it knows what to return next
🟊 Each operation is implemented as an iterator implementing the following operations
- open()
▪ E.g. file scan: initialize file scan, store pointer to beginning of file as
state
▪ E.g.merge join: sort relations and store pointers to beginning of sorted
relations as state
- next()
▪ E.g. for file scan: Output next tuple, and advance and store file
pointer
▪ E.g. for merge join: continue with merge from earlier state till
next output tuple is found. Save pointers as iterator state.
- close()
Title
πTitle(BOOK) = “DBMS”
“COMPILER”
“OS”
Selection
➢ Selects rows that satisfy selection condition.
➢ No duplicates in result
➢ Schema of result identical to schema of (only) input relation.
➢ Result relation can be the input for another relational algebra operation! (Operator
composition.)
Example: For the example given above:
σAcc-no>300(BOOK) =
Acc- Title Author
No
400 “COMPILER “Ullman”
”
500 “OS” “Sudarshan”
σTitle=”DBMS”(BOOK)=
➢ All of these operations take two input relations, which must be union-compatible:
o Same number of fields.
o Corresponding’ fields have the same type.
➢ What is the schema of result?
Consider:
Borrower Depositor
Cust- Loan-no Cust-name Acc-no
name Suleman A-100
Ram L-13 Radheshyam A-300
Shyam L-30 Ram A-401
Suleman L-42
List of customers who are either borrower or depositor at bank= πCust-name (Borrower) U
πCust-name (Depositor)=
Cust-name
Ram
Shyam Customers who are both borrowers and depositors = πCust-name
Suleman (Borrower) ∩ πCust-name (Depositor)=
Radeshyam
Cust-
Name
Ram
Suleman
Customers who are borrowers but not depositors = πCust-name (Borrower) πCust-name
(Depositor)=
Cust-name
Shyam
Cartesian-Product or Cross-Product (S1 × R1)
• Each row of S1 is paired with each row of R1.
• Result schema has one field per field of S1 and R1, with field names `inherited’ if possible.
• Consider the borrower and loan tables as follows:
Borrower: Loan:
Cust-name Loan-no Loan-no Amount
Ram L-13 L-13 1000
Shyam L-30 L-30 20000
Suleman L-42 L-42 40000
The rename operation can be used to rename the fields to avoid confusion when two field names are
same in two participating tables:
For example the statement, ρLoan-borrower(Cust-name,Loan-No-1, Loan-No-2,Amount)( Borrower × Loan) results into-
A new Table named Loan-borrower is created where it has four fields which are renamed as Cust-
name, Loan-No-1, Loan-No-2 and Amount and the rows contains the same data as the cross product
of Borrower and Loan.
Loan-borrower:
Cust- Loan-No-1 Loan- Amount
name No-2
Ram L-13 L-13 1000
Ram L-13 L-30 20000
Ram L-13 L-42 40000
Shyam L-30 L-13 1000
Shyam L-30 L-30 20000
Shyam L-30 L-42 40000
Suleman L-42 L-13 1000
Suleman L-42 L-30 20000
Suleman L-42 L-42 40000
Rename Operation:
It can be used in two ways :
• return the result of expression E in the table named x.
• return the result of expression E in the table named x with the attributes
renamed to A1, A2,…, An.
• It’s benefit can be understood by the solution of the query “ Find the largest account balance
in the bank”
It can be solved by following steps:
• Find out the relation of those balances which are not largest.
• Consider Cartesion product of Account with itself i.e. Account × Account
• Compare the balances of first Account table with balances of second Account table in the
product.
• For that we should rename one of the account table by some other name to avoid the
confusion
It can be done by following operation
ΠAccount.balance (σAccount.balance < d.balance(Account× ρd(Account))
• So the above relation contains the balances which are not largest.
• Subtract this relation from the relation containing all the balances i.e . Πbalance (Account).
So the final statement for solving above query is
Πbalance (Account)- ΠAccount.balance (σAccount.balance < d.balance(Account× ρd(Account))
Additional Operations
Natural Join ( )
• Forms Cartesian product of its two arguments, performs selection forcing equality on
those attributes that appear in both relations
• For example consider Borrower and Loan relations, the natural join between them
will automatically perform the selection on the table returned by
Borrower × Loan which force equality on the attribute that appear in both Borrower
and Loan i.e. Loan-no and also will have only one of the column named Loan-No.
• That means = σBorrower.Loan-no = Loan.Loan-no (Borrower × Loan).
• The table returned from this will be as follows:
Eliminate rows that does not satisfy the selection criteria “σBorrower.Loan-no = Loan.Loan-no” from Borrower
× Loan =
Borrower.Cust- Borrower.Loan- Loan.Loan- Loan.Amount
name no no
Ram L-13 L-13 1000
Ram L-13 L-30 20000
Ram L-13 L-42 40000
Shyam L-30 L-13 1000
Shyam L-30 L-30 20000
Shyam L-30 L-42 40000
Suleman L-42 L-13 1000
Suleman L-42 L-30 20000
Suleman L-42 L-42 40000
Division Operation:
• denoted by ÷ is used for queries that include the phrase “for all”.
• For example “Find customers who has an account in all branches in branch city
Agra”. This query can be solved by following statement.
ΠCustomer-name. branch-name ( ) ÷ Πbranch-name (σBranch-city=”Agra”(Branch)
• The division operations can be specified by using only basic operations as follows:
Let r(R) and s(S) be given relations for schema R and S with
r ÷ s = ΠR-S(r) - ΠR-S ((ΠR-S (r) × s) - ΠR-S,S (r))
select customer-name
from Customer
where customer-street like “%Main%”
Set Operations
• union, intersect and except operations are set operations available in SQL.
• Relations participating in any of the set operation must be compatible; i.e. they must have
the same set of attributes.
• Union Operation:
o Find all customers having a loan, an account, or both at the bank
(select customer-name from Depositor )
union
(select customer-name from Borrower )
It will automatically eliminate duplicates.
o If we want to retain duplicates union all can be used
(select customer-name from Depositor )
union all
(select customer-name from Borrower )
• Intersect Operation
o Find all customers who have both an account and a loan at the bank
(select customer-name from Depositor )
intersect
(select customer-name from Borrower )
o If we want to retail all the duplicates
(select customer-name from Depositor )
intersect all
(select customer-name from Borrower )
• Except Opeartion
o Find all customers who have an account but no loan at the bank
(select customer-name from Depositor )
except
(select customer-name from Borrower )
o If we want to retain the duplicates:
(select customer-name from Depositor )
except all
(select customer-name from Borrower )
Aggregate Functions
• Aggregate functions are those functions which take a collection of values as input and return
a single value.
• SQL offers 5 built in aggregate functions-
o Average: avg
o Minimum:min
o Maximum:max
o Total: sum
o Count:count
• The input to sum and avg must be a collection of numbers but others may have collections
of non-numeric data types as input as well
• Find the average account balance at the Sadar branch
select avg(balance)
from Account
where branch-name= “Sadar”
The result will be a table which contains single cell (one row and one column) having
numerical value corresponding to average balance of all account at sadar branch.
• group by clause is used to form groups, tuples with the same value on all attributes in the
group by clause are placed in one group.
• Find the average account balance at each branch
select branch-name, avg(balance)
from Account
group by branch-name
• By default the aggregate functions include the duplicates.
• distinct keyword is used to eliminate duplicates in an aggregate functions:
• Find the number of depositors for each branch
select branch-name, count(distinct customer-name)
from Depositor, Account
where Depositor.account-number = Account.account-number
group by branch-name
• having clause is used to state condition that applies to groups rather than tuples.
• Find the average account balance at each branch where average account balance is more
than Rs. 1200
select branch-name, avg(balance)
from Account
group by branch-name
having avg(balance) > 1200
• Count the number of tuples in Customer table
select count(*)
from Customer
• SQL doesn’t allow distinct with count(*)
• When where and having are both present in a statement where is applied before having.
Nested Sub queries
A subquery is a select-from-where expression that is nested within another query.
Set Membership
The in and not in connectives are used for this type of subquery.
“Find all customers who have both a loan and an account at the bank”, this query can be written
using nested subquery form as follows
select distinct customer-name
from Borrower
where customer-name in(select customer-name
from Depositor )
• Select the names of customers who have a loan at the bank, and whose names are neither
“Smith” nor “Jones”
select distinct customer-name
from Borrower
where customer-name not in(“Smith”, “Jones”)
Set Comparison
Find the names of all branches that have assets greater than those of at least one branch located in
Mathura
select branch-name
from Branch
where asstets > some (select assets
from Branch
where branch-city = “Mathura” )
1. Apart from > some others comparison could be < some , <= some , >= some , = some , <
> some.
2. Find the names of all branches that have assets greater than that of each branch located in
Mathura
select branch-name
from Branch
where asstets > all (select assets
from Branch
where branch-city = “Mathura” )
▪ Apart from > all others comparison could be < all , <= all , >= all , = all , < >all.
Views
In SQL create view command is used to define a view as follows:
create view v as <query expression>
where <query expression> is any legal query expression and v is the view name.
➢ The view consisting of branch names and the names of customers who have either an
account or a loan at the branch. This can be defined as follows:
➢ The attributes names may be specified explicitly within a set of round bracket after the name
of view.
➢ The view names may be used as relations in subsequent queries. Using the view
Allcustomer
Find all customers of Sadar branch
select customer-name
from All-customer
where branch-name= “Sadar”
➢ A create-view clause creates a view definition in the database which stays until a command
- drop view view-name - is executed.
Modification of Database
Deletion
❖ In SQL we can delete only whole tuple and not the values on any particular
attributes. The command is as follows:
Insertion
In SQL we either specify a tuple to be inserted or write a query whose result is a
set of tuples to be inserted. Examples are as follows:
Insert an account of account number A-9732 at the Sadar branch having balance
of Rs 1200
insert into Account
values(“Sadar”, “A-9732”, 1200)
the values are specified in the order in which the corresponding attributes are
listed in the relation schema.
SQL allows the attributes to be specified as part of the insert statement
insert into Account(account-number, branch-name, balance)
values(“A-9732”, “Sadar”, 1200)
insert into Account(branch-name, account-number, balance)
values(“Sadar”, “A-9732”, 1200)
Provide for all loan customers of the Sadar branch a new Rs 200 saving account
for each loan account they have. Where loan-number serve as the account number
for these accounts.
insert into Account
select branch-name, loan-number, 200
from Loan
where branch-name = “Sadar”
Updates
Used to change a value in a tuple without changing all values in the tuple.
Suppose that annual interest payments are being made, and all balances are to be
increased by 5 percent.
update Account
set balance = balance * 1.05
Suppose that accounts with balances over Rs10000 receive 6 percent interest,
whereas all others receive 5 percent.
update Account
set balance = balance * 1.06
where balance > 10000
update Account
set balance = balance * 1.05
where balance <= 10000
Data Definition Language
Data Types in SQL
char(n): fixed length character string, length n.
varchar(n): variable length character string, maximum length n.
int: an integer.
smallint: a small integer.
numeric(p,d): fixed point number, p digits( plus a sign), and d of the p digits are
to right of the decimal point.
real, double precision: floating point and double precision numbers.
float(n): a floating point number, precision at least n digits.
date: calendar date; four digits for year, two for month and two for day of month.
time: time of day n hours minutes and seconds.
Domains can be defined as
create domain person-name char(20).
the domain name person-name can be used to define the type of an attribute just like
built-in domain.
Schema Definition in SQL
create table command is used to define relations.
create table r (A1D1, A2D2,… , AnDn,
<integrity constraint1>,
…,
<integrity constraintk>)
where r is relation name, each Ai is the name of attribute, Di is the domain type of
values of Ai. Several types of integrity constraints are available to define in SQL.
Integrity Constraints
• Integrity Constraints guard against accidental damage to the database.
• Integrity constraints are predicates pertaining to the database.
• Domain Constraints:
• Predicates defined on the domains are Domain constraints.
• Simplest Domain constraints are defined by defining standard data types of the attributes
like Integer, Double, Float, etc.
• We can define domains by create domain clause also we can define the constraints on such
domains as follows:
create domain hourly-wage numeric(5,2)
constraint wage-value-test check(value >= 4.00)
• So we can use hourly-wage as data type for any attribute where DBMS will automatically
allow only values greater than or equal to 4.00.
• Other examples for defining Domain constraints are as follows:
create domain account-number char(10)
constraint account-number-null-test check(value not null)
create domain account-type char(10)
constraint account-type-test
check (value in ( “Checking”, “Saving”))
By using the later domain of two above the DBMS will allow only values for any attribute having
type as account-type i.e. Checking and Saving.
• Referential Integrity:
• Foreign Key: If two table R and S are related to each other, K1 and K2 are primary keys of
the two relations also K1 is one of the attribute in S. Suppose we want that every row in S
must have a corresponding row in R, then we define the K1 in S as foreign key. Example in
our original database of library we had a table for relation BORROWEDBY, containing two
fields Card No. and Acc. No. . Every row of BORROWEDBY relation must have
corresponding row in USER Table having same Card No. and a row in BOOK table having
same Acc. No.. Then we will define the Card No. and Acc. No. in BORROWEDBY relation
as foreign keys.
• In other way we can say that every row of BORROWEDBY relation must refer to some row
in BOOK and also in USER tables.
• Such referential requirement in one table to another table is called Referential Integrity.
LECTURE-25
Selections in QBE
QBE uses skeleton tables to represent table name and fieldnames like:
Table name Field1 Field2 …..
For selection, P operator along with variable name/constant name is used to display one or more
fields.
Example 1:
Consider the relation: student (name, roll, marks)
The following query can be represented as:
SQL: select name from student where marks>50;
student name roll marks
P.X >50
Here X is a constant; alternatively we can use _X as a variable.
Example 2:
For the relation given above
The following query can be represented as:
SQL: select * from student where marks>50 and marks <80;
student name roll marks
P. _X
CONDITION
_X>50 _X<80
Insertions in QBE:
Uses operator I. on the table.
Example: Consider the following query on Student table
SQL: insert into student values (‘abc’,10,60);
Deletions in QBE:
Uses operator D. on the table.
Example: Consider the following query on Student table
SQL: delete from student where marks=0;
Updation in QBE:
Uses operator U. on the table.
Example: Consider the following query on Student table
SQL: update student set mark=50 where roll=40;
Database design is a process in which you create a logical data model for a database, which store data
of a company. It is performed after initial database study phase in the database life cycle. You use
normalization technique to create the logical data model for a database and eliminate data
redundancy.
Normalization also allows you to organize data efficiently in a database and reduce anomalies during
data operation. Various normal forms, such as first, second and third can be applied to createa logical
data model for a database. The second and third normal forms are based on partial dependency and
transitivity dependency. Partial dependency occurs when a row of table is uniquely identified by one
column that is a part of a primary key. A transitivity dependency occurs when a non key column is
uniquely identified by values in another non-key column of a table.
3. Choice of a DBMS
The choice of dbms is governed by a no. of factors some technical other economic and still
other concerned with the politics of the organization.
The economics and organizational factors that offer the choice of the dbms are:
Software cost, maintenance cost, hardware cost, database creation and conversion cost,
personnel cost, training cost, operating cost.
Want to keep the semantics of the relation attributes clear. The information in a tuple should
represent exactly one fact or an entity. The hidden or buried entities are what we want to discover
and eliminate.
If nulls are likely (non-applicable) then consider decomposition of the relation into two or more
relations that hold only the non-null valued tuples.
Too much decomposition of relations into smaller ones may also lose information or generate
erroneous information
• Be sure that relations can be logically joined using natural join and the result doesn't
generate relationships that don't exist
Functional Dependencies
FD's are constraints on well-formed relations and represent a formalism on the infrastructure of
relation.
• X is a determinant
• X determines Y
• Y is functionally dependent on X
• X→Y
• X →Y is trivial if Y ⊆ X
A key constraint is a special kind of functional dependency: all attributes of relation occur on the
right-hand side of the FD:
Let R be
NewStudent(stuId, lastName, major, credits, status, socSecNo)
FDs in R include
ZipCode→AddressCity
ArtistName→BirthYear
Author, Title→PublDate
are all trivial FDs and will not contribute to the evaluation of normalization.
FD Axioms
FD manipulations:
The closure of F, denoted by F+, is the set of all functional dependencies logically implied by F.
The closure of F can be found by using a collection of rules called Armstrong axioms.
Reflexivity rule: If A is a set of attributes and B is subset or equal to A, then A→B holds.
Augmentation rule: If A→B holds and C is a set of attributes, then CA→CB holds
Transitivity rule: If A→B holds and B→C holds, then A→C holds.
Union rule: If A→B holds and A→C then A→BC holds
Decomposition rule: If A→BC holds, then A→B holds and A→C holds.
Pseudo transitivity rule: If A→B holds and BC→D holds, then AC→D holds.
Suppose we are given a relation schema R=(A,B,C,G,H,I) and the set of function dependencies
{A→B,A→C,CG→H,CG→I,B→H}
We list several members of F+ here:
1. A→H, since A→B and B→H hold, we apply the transitivity rule.
2. CG→HI. Since CG→H and CG→I , the union rule implies that CG→HI
3. AG→I, since A→C and CG→I, the pseudo transitivity rule implies that AG→I holds
Algorithm of compute F+ :
To compute the closure of a set of functional dependencies F:
F+ = F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1and f2 in F+
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F+
until F+ does not change any further
large.
LOSS LESS DECOMPOSITION
A decomposition of a relation scheme R<S,F> into the relation schemes Ri(1<=i<=n) is said to be a
lossless join decomposition or simply lossless if for every relation R that satisfies the FDs in F, the
natural join of the projections or R gives the original relation R, i.e,
R= R1( R) R2( R) …….. Rn( R)
If R is subset of R1( R ) R2( R ) …….. Rn( R)
Then the decomposition is called lossy.
DEPEDENCY PRSERVATION:
Given a relation scheme R<S,F> where F is the associated set of functional dependencies on the
attributes in S,R is decomposed into the relation schemes R1,R2,…Rn with the fds F1,F2…Fn, then
this decomposition of R is dependency preserving if the closure of F’ (where F’=F1 U F2 U … Fn)
Example:
Let R(A,B,C) AND F={A→B}. Then the decomposition of R into R1(A,B) and R2(A,C) is lossless
because the FD { A→B} is contained in R1 and the common attribute A is a key of R1.
Example:
Let R(A,B,C) AND F={A→B}. Then the decomposition of R into R1(A,B) and R2(B,C) is not
lossless because the common attribute B does not functionally determine either A or C. i.e, it is not
a key of R1 or R 2.
Example:
Let R(A,B,C,D) and F={A→B, A→C, C→D,}. Then the decomposition of R into R1(A,B,C) with
the FD F1={ A→B , A→C }and R2(C,D) with FD F2={ C→D} . In this decomposition all the
original FDs can be logically derived from F1 and F2, hence the decomposition is dependency
preserving also . the common attribute C forms a key of R2. The decomposition is lossless.
Example:
Let R(A,B,C,D) and F={A→B, A→C, A→D,}. Then the decomposition of R into R1(A,B,D) with
the FD F1={ A→B , A→D }and R2(B,C) with FD F2={ } is lossy because the common attribute B
is not a candidate key of either R1 and R2 .
In addition , the fds A→C is not implied by any fds R1 or R2. Thus the decomposition is not
dependency preserving.
Partial dependency:
Given a relation dependencies F defined on the attributes of R and K as a candidate key ,if X is a
proper subset of K and if F|= X→A, then A is said to be partial dependent on K
Normalization
While designing a database out of an entity–relationship model, the main problem existing in that
“raw” database is redundancy. Redundancy is storing the same data item in more one place. A
redundancy creates several problems like the following:
1. Extra storage space: storing the same data in many places takes large amount of disk space.
2. Entering same data more than once during data insertion.
3. Deleting data from more than one place during deletion.
4. Modifying data in more than one place.
5. Anomalies may occur in the database if insertion, deletion, modification etc are no done
properly. It creates inconsistency and unreliability in the database.
To solve this problem, the “raw” database needs to be normalized. This is a step by step process of
removing different kinds of redundancy and anomaly at each step. At each step a specific rule is
followed to remove specific kind of impurity in order to give the database a slim and clean look.
If a table contains non-atomic values at each row, it is said to be in UNF. An atomic value is
something that can not be further decomposed. A non-atomic value, as the name suggests, can be
further decomposed and simplified. Consider the following table:
In the sample table above, there are multiple occurrences of rows under each key Emp-Id. Although
considered to be the primary key, Emp-Id cannot give us the unique identification facility for any
single row. Further, each primary key points to a variable length record (3 for E01, 2 for E02 and 4
for E03).
As you can see now, each row contains unique combination of values. Unlike in UNF, this relation
contains only atomic values, i.e. the rows can not be further decomposed, so the relation is now in
1NF.
A relation is said to be in 2NF f if it is already in 1NF and each and every attribute fully depends on
the primary key of the relation. Speaking inversely, if a table has some attributes which is not
dependant on the primary key of that table, then it is not in 2NF.
Let us explain. Emp-Id is the primary key of the above relation. Emp-Name, Month, Sales and Bank-
Name all depend upon Emp-Id. But the attribute Bank-Name depends on Bank-Id, which is not the
primary key of the table. So the table is in 1NF, but not in 2NF. If this position can be removed into
another related relation, it would come to 2NF.
Bank-Id Bank-Name
B01 SBI
B02 UTI
After removing the portion into another relation we store lesser amount of data in two relations
without any loss information. There is also a significant reduction in redundancy.
A relation is said to be in 3NF, if it is already in 2NF and there exists no transitive dependency in
that relation. Speaking inversely, if a table contains transitive dependency, then it is not in 3NF, and
the table must be split to bring it into 3NF.
Such derived dependencies hold well in most of the situations. For example if we have
Roll → Marks
And
Marks → Grade
Then we may safely derive
Roll → Grade.
This third dependency was not originally specified but we have derived it.
The derived dependency is called a transitive dependency when such dependency becomes
improbable. For example we have been given
Roll → City
And
City → STDCode
If we try to derive Roll → STDCode it becomes a transitive dependency, because obviously the
STDCode of a city cannot depend on the roll number issued by a school or college. In such a case the
relation should be broken into two, each containing one of these two dependencies:
Roll → City
And
City → STD code
Boyce-Code Normal Form (BCNF)
A relationship is said to be in BCNF if it is already in 3NF and the left hand side of every dependency
is a candidate key. A relation which is in 3NF is almost always in BCNF. These could be same
situation when a 3NF relation may not be in BCNF the following conditions are found true.
The relation diagram for the above relation is given as the following:
The given relation is in 3NF. Observe, however, that the names of Dept. and Head of Dept. are
duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose the information that
Rao is the Head of Department of Chemistry.
The normalization of the relation is done by creating a new relation for Dept. and Head of Dept. and
deleting Head of Dept. form the given relation. The normalized relations are shown in the following.
Professor Code Department Percent Time
P1 Physics 50
P1 Mathematics 50
P2 Chemistry 25
P2 Physics 75
P3 Mathematics 100
When attributes in a relation have multi-valued dependency, further Normalization to 4NF and 5NF
are required. Let us first find out what multi-valued dependency is.
A multi-valued dependency is a typical kind of dependency in which each and every attribute
within a relation depends upon the other, yet none of them is a unique primary key.
We will illustrate this with an example. Consider a vendor supplying many items to many projects
in an organization. The following are the assumptions:
A multi valued dependency exists here because all the attributes depend upon the other and yet none
of them is a primary key having unique value.
1. If vendor V1 has to supply to project P2, but the item is not yet decided, then a row with a
blank for item code has to be introduced.
2. The information about item I1 is stored twice for vendor V3.
Observe that the relation given is in 3NF and also in BCNF. It still has the problem mentioned above.
The problem is reduced by expressing this relation as two relations in the Fourth Normal Form (4NF).
A relation is in 4NF if it has no more than one independent multi valued dependency or one
independent multi valued dependency with a functional dependency.
The table can be expressed as the two 4NF relations given as following. The fact that vendors are
capable of supplying certain items and that they are assigned to supply for some projects in
independently specified in the 4NF relation.
Vendor-Supply
Vendor Code Item Code
V1 I1
V1 I2
V2 I2
V2 I3
V3 I1
Vendor-Project
Vendor Code Project No.
V1 P1
V1 P3
V2 P1
V3 P2
These relations still have a problem. While defining the 4NF we mentioned that all the attributes
depend upon each other. While creating the two tables in the 4NF, although we have preserved the
dependencies between Vendor Code and Item code in the first table and Vendor Code and Item code
in the second table, we have lost the relationship between Item Code and Project No. If there were a
primary key then this loss of dependency would not have occurred. In order to revive this relationship
we must add a new table like the following. Please note that during the entire process of
normalization, this is the only step where a new table is created by joining two attributes, rather than
splitting them into separate tables.
Query processing includes translation of high-level queries into low-level expressions that can be used at the
physical level of the file system, query optimization and actual execution of the query to get the result. It is a
three-step process that consists of parsing and translation, optimization and execution of the query submitted by
the user.
When a query is first submitted (via an applications program), it must be scanned and parsed to
determine if the query consists of appropriate syntax.
Scanning is the process of converting the query text into a tokenized representation.
The tokenized representation is more compact and is suitable for processing by the parser.
This representation may be in a tree form.
The Parser checks the tokenized representation for correct syntax.
In this stage, checks are made to determine if columns and tables identified in the query exist in the
database and if the query has been formed correctly with the appropriate keywords and structure.
If the query passes the parsing checks, then it is passed on to the Query Optimizer.
For any given query, there may be a number of different ways to execute it.
Each operation in the query (SELECT, JOIN, etc.) can be implemented using one or more different
Access Routines.
For example, an access routine that employs an index to retrieve some rows would be more efficient
that an access routine that performs a full table scan.
The goal of the query optimizer is to find a reasonably efficient strategy for executing the query (not
quite what the name implies) using the access routines.
Optimization typically takes one of two forms: Heuristic Optimization or Cost Based Optimization
In Heuristic Optimization, the query execution is refined based on heuristic rules for reordering the
individual operations.
With Cost Based Optimization, the overall cost of executing the query is systematically reduced by
estimating the costs of executing several different execution plans.
Once the query optimizer has determined the execution plan (the specific ordering of access routines),
the code generator writes out the actual access routines to be executed.
With an interactive session, the query code is interpreted and passed directly to the runtime database
processor for execution.
It is also possible to compile the access routines and store them for later execution.
At this point, the query has been scanned, parsed, planned and (possibly) compiled. The
runtime database processor then executes the access routines against the database. The
results are returned to the application that made the query in the first place.
Any runtime errors are also returned.
Query Optimization
To enable the system to achieve (or improve) acceptable performance by choosing a better (if not the
best) strategy during the process of a query. One of the great strengths to the relational database.
1. A good automatic optimizer will have a wealth of information available to it that human
programmers typically do not have.
2. An automatic optimizer can easily reprocess the original relational request when the
organization of the database is changed. For a human programmer, reorganization would
involve rewriting the program.
3. The optimizer is a program, and therefore is capable of considering literally hundreds of
different implementation strategies for a given request, which is much more than a human
programmer can.
4. The optimizer is available to a wide range of users, in an efficient and cost-effective manner.
*A subset (say C) of a set of queries (say Q) is said to be a set of canonical forms for Q if and only if
every query Q is equivalent to just one query in C.
During this step, some optimization is already achieved by transforming the internal representation
to a better canonical form.
Possible improvements
a. Doing the restrictions (selects) before the join.
b. Reduce the amount of comparisons by converting a restriction condition to an equivalent
condition in conjunctive normal form- that is, a condition consisting of a set of restrictions that
are ANDed together, where each restriction in turn consists of a set of simple comparisons
connected only by OR's.
c. A sequence of restrictions (selects) before the join.
d. In a sequence of projections, all but the last can be ignored.
e. A restriction of projection is equivalent to a projection of a restriction.
f. Others
3. Choose candidate low-level procedures by evaluate the transformed query.
*Access path selection: Consider the query expression as a series of basic operations (join,
restriction, etc.), then the optimizer choose from a set of pre-defined, low-level
implementation procedures. These procedures may involve the user of primary key, foreign
key or indexes and other information about the database.
4. Generate query plans and choose the cheapest by constructing a set of candidate query plans first, then
choose the best plan. To pick the best plan can be achieved by assigning cost to each given
plan. The costs is computed according to the number of disk I/O's involved.
Transaction
• Active: the initial state, the transaction stays in this state while it is executing.
• Partially committed: after the final statement has been executed.
• Failed: when the normal execution can no longer proceed.
• Aborted: after the transaction has been rolled back and the database has been restored to its
state prior to the start of the transaction.
• Committed: after successful completion.
1. The lost update problem: A second transaction writes a second value of a data-item (datum)
on top of a first value written by a first concurrent transaction, and the first value is lost to
other transactions running concurrently which need, by their precedence, to read the first
value. The transactions that have read the wrong value end with incorrect results.
2. The dirty read problem: Transactions read a value written by a transaction that has been later
aborted. This value disappears from the database upon abort, and should not have been read
by any transaction ("dirty read"). The reading transactions end with incorrect results.
3. The incorrect summary problem: While one transaction takes a summary over the values of
all the instances of a repeated data-item, a second transaction updates some instances of that
data-item. The resulting summary does not reflect a correct result for any (usually needed
for correctness) precedence order between the two transactions (if one is executed before the
other), but rather some random result, depending on the timing of the updates, and whether
certain update results have been included in the summary or not.
1. Binary Lock
2. Share/Exclusive (Read/Write) Lock
→Binary Lock
We represent the current state( or value) of the lock associated with data item X as LOCK(X).
→A transaction T must issue the lock(X) operation before any read(X) or write(X) operations in T.
→A transaction T must issue the unlock(X) operation after all read(X) and write(X) operations in T.
→If a transaction T already holds the lock on item X, then T will not issue a lock(X) operation.
→If a transaction does not holds the lock on item X, then T will not issue an unlock(X) operation.
and Exclusive/Write Lock
The binary lock is too restrictive for data items because at most one transaction can hold on a given
item whether the transaction is reading or writing. To improve it we have shared and exclusive
locks in which more than one transaction can access the same item for reading purposes.i.e. the read
operations on the same item by different transactions are not conflicting.
In this types of lock, system supports two kinds of lock :
Shared Locks
If a transaction Ti has locked the data item A in shared mode, then a request from another
transaction Tj on A for :
Exclusive Locks
If a transaction Ti has locked a data item a in exclusive mode then request from some another
transaction Tj for
1. Data_item_name
2. LOCK
3. Number of Records and
4. Locking_transaction(s)
Again to save space, items that are not in the lock table are considered to be unlocked. The system
maintains only those records for the items that are currently locked in the lock table.
1) deadlock
2) starvation
Deadlock:
It is an indefinite wait situation in which a series of transactions wait for each other for unknown
amount of time, it obeys the following conditions:
→No Preemption
→Mutual Exclusion
→Circular Wait
Example:
For example, assume a set of transactions {T0, T1, T2, ...,Tn}. T0 needs a resource X to complete its
task. Resource X is held by T1, and T1 is waiting for a resource Y, which is held by T2. T2 is waiting
for resource Z, which is held by T0. Thus, all the processes wait for each other to release resources.
In this situation, none of the processes can finish their task. This situation is known as a deadlock.
Starvation:
It is a situation in which a transaction has locked a database for unfair means & all other
transactions are in indefinite waiting.
Starvation Deadlock
Starvation happens if same transaction is A deadlock is a condition in which two or more transaction
always chosen as victim. is waiting for each other.
It occurs if the waiting scheme for locked A situation where two or more transactions are unable to
items in unfair, giving priority to some proceed because each is waiting for one of the other to do
transactions over others. something.
Starvation is also known as lived lock. Deadlock is also known as circular waiting.
Avoidance: Avoidance:
->switch priorities so that every thread has ->Acquire locks are predefined order.
a chance to have high priority. ->Acquire locks at once before starting.
-> Use FIFO order among competing
request.
It means that transaction goes in a state It is a situation where transactions are waiting for each
where transaction never progress. other.
Timestamp Ordering
The timestamp-ordering protocol ensures serializability among transactions in their conflicting read
and write operations. This is the responsibility of the protocol system that the conflicting pair of
tasks should be executed according to the timestamp values of the transactions.
Avoiding deadlock:
A major disadvantage of locking is deadlock which can be avoided using timestamp ordering as
follows:
• Wait/Die
• Wound/Wait
Here is the table representation of resource allocation for each algorithm. Both of these algorithms
take process age into consideration while determining the best possible way of resource allocation
for deadlock avoidance.
Wait/Die Wound/Wait
Older process needs a resource held by younger
Older process waits Younger process dies
process
Younger process needs a resource held by older Younger process Younger process
process dies waits
Wait-Die Scheme
In this scheme, if a transaction requests to lock a resource (data item), which is already held with a
conflicting lock by another transaction, then one of the two possibilities may occur −
• If TS(Ti) < TS(Tj) − that is Ti, which is requesting a conflicting lock, is older than Tj − then
Ti is allowed to wait until the data-item is available.
• If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later with a
random delay but with the same timestamp.
This scheme allows the older transaction to wait but kills the younger one.
Wound-Wait Scheme
In this scheme, if a transaction requests to lock a resource (data item), which is already held with
conflicting lock by some another transaction, one of the two possibilities may occur −
• If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back − that is Ti wounds Tj. Tj is restarted
later with a random delay but with the same timestamp.
• If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.
This scheme, allows the younger transaction to wait; but when an older transaction requests an item
held by a younger one, the older transaction forces the younger one to abort and release the item.
In both the cases, the transaction that enters the system at a later stage is aborted.
Recovery
Types of Failure
Failures may be
A System Crash A hardware, software or network error causes the transaction to fail.
Transaction or System Some operation in the transaction may cause the failure or the user may
error interrupt the transaction.
Local Errors or Conditions occur during the transaction that results in transaction
Exceptions cancellation.
Concurrency Control Several transactions may be in deadlock so the transaction may be aborted
Enforcement to be restarted later.
Disk Failure Read Write error on the physical disk.
Physical Problems This can be any range of physical problems, such as power failure,
mounting wrong disk or tape by operator, wiring problems etc
Catastrophe Situations Large scale threats to the system and the data for example fire, cyclone,
security breaches etc.
Transaction errors, system errors, system crashes, concurrency problems and local errors or
exceptions are the more common causes of system failure. The system must be able to recover from
such failures without loss of data.
Log-Based Recovery
COMMIT
Signals the successful end of a transaction
• Any changes made by the transaction should be saved
• These changes are now visible to other transactions
ROLLBACK
Signals the unsuccessful end of a transaction
• Any changes made by the transaction should be undone
• It is now as if the transaction never existed
Log-Based Recovery
The most widely used structure for recording database modifications is the log. The log is a sequence
of log records and maintains a history of all update activities in the database. There are several types
of log records.
An update log record describes a single database write:
• Transactions identifier.
• Data-item identifier.
• Old value.
• New value.
Other special log records exist to record significant events during transaction processing, such as the
start of a transaction and the commit or abort of a transaction. We denote the various types of log
records as:
• <Ti start>.Transaction Ti has started.
• <Ti, Xj, V1, V2> Transaction Ti has performed a write on data item Xj. Xj had value V1
before write, and will have value V2 after the write.
• < Ti commit> Transaction Ti has committed.
• < Ti abort> Transaction Ti has aborted.
Whenever a transaction performs a write, it is essential that the log record for that write be created
before the database is modified. Once a log record exists, we can output the modification that has
already been output to the database. Also we have the ability to undo a modification that has already
been output to the database, by using the old-value field in the log records.
For log records to be useful for recovery from system and disk failures, the log must reside on stable
storage. However, since the log contains a complete record of all database activity, the volume of
data stored in the log may become unreasonable large.
Conflict Serializability
Two instructions of two different transactions may want to access the same data item in order to
perform a read/write operation. Conflict Serializability deals with detecting whether the
instructions are conflicting in any way, and specifying the order in which these two instructions
will be executed in case there is any conflict. A conflict arises if at least one (or both) of the
instructions is a write operation. The following rules are important in Conflict Serializability:
1. If two instructions of the two concurrent transactions are both for read operation, then they
are not in conflict, and can be allowed to take place in any order.
2. If one of the instructions wants to perform a read operation and the other instruction wants
to perform a write operation, then they are in conflict, hence their ordering is important. If
the read instruction is performed first, then it reads the old value of the data item and after
the reading is over, the new value of the data item is written. It the write instruction is
performed first, then updates the data item with the new value and the read instruction
reads the newly updated value.
3. If both the transactions are for write operation, then they are in conflict but can be allowed
to take place in any order, because the transaction do not read the value updated by each
other. However, the value that persists in the data item after the schedule is over is the one
written by the instruction that performed the last write.
It may happen that we may want to execute the same set of transaction in a different schedule on
another day. Keeping in mind these rules, we may sometimes alter parts of one schedule (S1) to
create another schedule (S2) by swapping only the non-conflicting parts of the first schedule. The
conflicting parts cannot be swapped in this way because the ordering of the conflicting instructions
is important and cannot be changed in any other schedule that is derived from the first. If these
two schedules are made of the same set of transactions, then both S1 and S2 would yield the same
result if the conflict resolution rules are maintained while creating the new schedule. In that case
the schedule S1 and S2 would be called Conflict Equivalent.
View Serializability:
This is another type of serializability that can be derived by creating another schedule out of an
existing schedule, involving the same set of transactions. These two schedules would be called
View Serializable if the following rules are followed while creating the second schedule out of the
first. Let us consider that the transactions T1 and T2 are being serialized to create two different
schedules S1 and S2 which we want to be View Equivalent and both T1 and T2 wants to access
the same data item.
1. If in S1, T1 reads the initial value of the data item, then in S2 also, T1 should read the
initial value of that same data item.
2. If in S1, T1 writes a value in the data item which is read by T2, then in S2 also, T1
should write the value in the data item before T2 reads it.
3. If in S1, T1 performs the final write operation on that data item, then in S2 also, T1
should perform the final write operation on that data item.
Except in these three cases, any alteration can be possible while creating S2 by modifying S1.
• Attributes - Attributes are data which defines the characteristics of an object. This data
may be simple such as integers, strings, and real numbers or it may be a reference to a
complex object.
• Methods - Methods define the behavior of an object and are what was formally called
procedures or functions.
• Objects don't require assembly and disassembly saving coding time and execution time to
assemble or disassemble objects.
• Reduced paging
• Easier navigation
• Better concurrency control - A hierarchy of objects may be locked.
• Data model is based on the real world.
• Works well for distributed architectures.
• Less code required when applications are object oriented.
• Each object has a unique ID and is defined as a subclass of a base class, using inheritance to
determine attributes.
• Virtual memory mapping is used for object storage and management.
Data Warehouse
The star schema architecture is the simplest data warehouse schema. It is called a star schema
because the diagram resembles a star, with points radiating from a center. The center of the star
consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a
star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite
the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is
recommended by Oracle.
→Fact Tables
A fact table typically has two types of columns: foreign keys to dimension tables and measures
those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
→Dimension Tables
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a
dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of
each of the dimension tables are part of the composite primary key of the fact table. Dimensional
attributes help to describe the dimensional value. They are normally descriptive, textual values.
Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region(markets, cities) , clients, products, times, channels.
→While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two. Data mining software
analyzes relationships and patterns in stored transaction data based on open-ended user
queries. Several types of analytical software are available: statistical, machine learning, and
neural networks. Generally, any of four types of relationships are sought:
• Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers visit and
what they typically order. This information could be used to increase traffic by having daily
specials.
• Clusters: Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or consumer
affinities.
• Associations: Data can be mined to identify associations. The beer-diaper example is an
example of associative mining.
• Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example,
an outdoor equipment retailer could predict the likelihood of a backpack being purchased
based on a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:
• Extract, transform, and load transaction data onto the data warehouse system.
• Store and manage the data in a multidimensional database system.
• Provide data access to business analysts and information technology professionals.
• Analyze the data by application software.
• Present the data in a useful format, such as a graph or table.
• Association
Association is one of the best-known data mining technique. In association, a pattern is
discovered based on a relationship between items in the same transaction. That’s is the
reason why association technique is also known as relation technique. The association
technique is used in market basket analysis to identify a set of products that customers
frequently purchase together.
Retailers are using association technique to research customer’s buying habits. Based on
historical sale data, retailers might find out that customers always buy crisps when they buy
beers, and, therefore, they can put beers and crisps next to each other to save time for
customer and increase sales.
• Classification
Classification is a classic data mining technique based on machine learning. Basically,
classification is used to classify each item in a set of data into one of a predefined set of
classes or groups. Classification method makes use of mathematical techniques such as
decision trees, linear programming, neural network and statistics. In classification, we
develop the software that can learn how to classify the data items into groups. For example,
we can apply classification in the application that “given all records of employees who left
the company, predict who will probably leave the company in a future period.” In this case,
we divide the records of employees into two groups that named “leave” and “stay”. And
then we can ask our data mining software to classify the employees into separate groups.
• Clustering
Clustering is a data mining technique that makes a meaningful or useful cluster of objects
which have similar characteristics using the automatic technique. The clustering technique
defines the classes and puts objects in each class, while in the classification techniques,
objects are assigned into predefined classes. To make the concept clearer, we can take book
management in the library as an example. In a library, there is a wide range of books on
various topics available. The challenge is how to keep those books in a way that readers can
take several books on a particular topic without hassle. By using the clustering technique,
we can keep books that have some kinds of similarities in one cluster or one shelf and label
it with a meaningful name. If readers want to grab books in that topic, they would only have
to go to that shelf instead of looking for the entire library.
• Prediction
The prediction, as its name implied, is one of a data mining techniques that discovers the
relationship between independent variables and relationship between dependent and
independent variables. For instance, the prediction analysis technique can be used in the sale
to predict profit for the future if we consider the sale is an independent variable, profit could
be a dependent variable. Then based on the historical sale and profit data, we can draw a
fitted regression curve that is used for profit prediction.
• Sequential Patterns
Sequential patterns analysis is one of data mining technique that seeks to discover or
identify similar patterns, regular events or trends in transaction data over a business period.
In sales, with historical transaction data, businesses can identify a set of items that
customers buy together different times in a year. Then businesses can use this information to
recommend customers buy it with better deals based on their purchasing frequency in the
past.
• Decision trees
The A decision tree is one of the most common used data mining techniques because its
model is easy to understand for users. In decision tree technique, the root of the decision tree
is a simple question or condition that has multiple answers. Each answer then leads to a set
of questions or conditions that help us determine the data so that we can make the final
decision based on it. For example, We use the following decision tree to determine whether
or not to play tennis:
Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it
is rainy, we should only play tennis if the wind is the week. And if it is sunny then we
should play tennis in case the humidity is normal.
Parallel Database
A parallel database system performs parallel operations, such as loading data, building indexes
and evaluating queries.
Parallel databases can be roughly divided into two groups,
a) Multiprocessor architecture:
→Shared memory architecture
Where multiple processors share the main memory space.
Distributed Database:
→A centralized distributed database management system (DDBMS) manages the database as if it
were all stored on the same computer. The DDBMS synchronizes all the data periodically and, in
cases where multiple users must access the same data, ensures that updates and deletes performed
on the data at one location will be automatically reflected in the data stored elsewhere.
→The users and administrators of a distributed system, should, with proper implementation, interact
with the system as if the system was centralized. This transparency allows for the functionality
desired in such a structured system without special programming requirements, allowing for any
number of local and/or remote tables to be accessed at a given time across the network.
→Data distribution transparency requires that the user of the database should not have to know how
the data is fragmented (fragmentation transparency), know where the data they access is actually
located (location transparency), or be aware of whether multiple copies of the data exist (replication
transparency).
→Heterogeneity transparency requires that the user should not be aware of the fact that they are
using a different DBMS if they access data from a remote site. The user should be able to use the
same language that they would normally use at their regular access point and the DDBMS should
handle query language translation if needed.
→Transaction transparency requires that the DDBMS guarantee that concurrent transactions do not
interfere with each other (concurrency transparency) and that it must also handle database recovery
(recovery transparency).
→Performance transparency mandates that the DDBMS should have a comparable level of
performance to a centralized DBMS. Query optimizers can be used to speed up response time.
Advantages of DDBMS's
Disadvantages of DDBMS
- Increased Cost
-Integrity control more difficult,
-Lack of standards,
-Database design more complex.
- Complexity of management and control. Applications must recognize data location and they must
be able to stitch together data from various sites.
- Technologically difficult: Data integrity, transaction management, concurrency control, security,
backup, recovery, query optimization, access path selection are all issues that must be addressed and
resolved
- Security lapses have increased instances when data are in multiple locations.
- Lack of standards due to the absence of communication protocols can make the processing and
distribution of data difficult.
- Increased storage and infrastructure requirements because multiple copies of data are required at
various separate locations which would require more disk space.
- Increased costs due to the higher complexity of training.
- Requires duplicate infrastructure (personnel, software and licensing, physical
location/environment) and these can sometimes offset any operational savings.