CS8492 DataBase Management Systems Question Bank Watermark
CS8492 DataBase Management Systems Question Bank Watermark
1
REGULATIONS - 2017
SYLLABUS
OBJECTIVES:
To learn the fundamentals of data models and to represent a database system using ER diagrams.
To study SQL and relational database design.
To understand the internal storage structures using different file and indexing
techniques which will help in physical DB design.
To understand the fundamental concepts of transaction processing- concurrency
control techniques and recovery procedures.
To have an introductory knowledge about the Storage and Query processing
Techniques
2
Unit –1
RELATIONAL DATABASES
PART-A
1. What is the purpose of Database Management System? (NOV 2014)
Real world entity
Relation-based tables
Isolation of data and application
Less redundancy
Consistency
Query Language
ACID Properties
Multiuser and concurrent Access
Multiple Views
Security
2.What is Data Definition Language? Give example. (NOV 2016, APR 2018)
Data Definition Language (DDL) – specifies constructs for schema definition, relation
definition, integrity constraints, views and schema modification.
Data Definition Language (DDL) is a standard for commands that define the different
structures in a database. DDL statements create, modify, and remove database objects such
as tables, indexes, and users. Common DDL statements are CREATE, ALTER, and DROP.
3. Write the Characteristics that distinguish the Database approach with the File-based
Approach? (MAY 2015, Nov 2019)
Database Approach File-Based Approach
A database System contains not only the Data definition is a part of application
database itseld but also the descriptions of programs
data structure and constraints ( meta-data).
In the database approach, the data structure is The structure of the data files is defined
stored in the system catalog not in the in the application programs so if a user
programs. wants to change the structure of a file,
all the programs that access that file
might need to be changed as well.
A multiuser database system allows multiple Same data is managed more than once
user access to the database at the same time. this leads to redundancy and
They have concurrency control strategies and inconsistency.
integrity checks for several user access.
5. Differentiate File Processing System with Database Management System. (NOV 2016)
1. A database management system coordinates both the physical and the logical access to the
data, whereas a file-processing system coordinates only the physical access.
2. A database management system reduces the amount of data duplication by ensuring that a
physical piece of data is available to all programs authorized to have access to it, whereas data
written by one program in a file-processing system may not be readable by another program.
3. A database management system is designed to allow flexible access to data (i.e., queries),
whereas a file-processing system is designed to allow predetermined access to data (i.e.,
compiled programs).
4. A database management system is designed to coordinate multiple users accessing the same
data at the same time. A file-processing system is usually designed to allow one or more
programs to access different data files at the same time. In a file-processing system, a file can
be accessed by two programs concurrently only if both programs have read-only access to the
file.
7. DBMS provide back up and recovery. When data is lost in file system then it not recover.
8. What is a data model? List the types of data model used. (April/May-2011 , May 2019)
A database model is the theoretical foundation of a database and fundamentally determines in
which manner data can be stored, organized, and manipulated in a database system. It thereby
defines the infrastructure offered by a particular database system. The most popular example of a
database model is the relational model. 4
Types of data model used
-relationship
-relational model
With One-to-One Relationship in SQL Server, for example, a person can have only one passport.
The One-to-Many relationship is defined as a relationship between two tables where a row from one
table can have multiple matching rows in another table. This relationship can be created
using Primary key-Foreign key relationship.
In the One-to-Many Relationship in SQL Server, for example, a book can have multiple authors.
11. What do you meant by simple and composite attribute? (DEC 2013)
Simple attributes are atomic values, which cannot be divided further. For example, a student's
phone number is an atomic value of 10 digits.
A composite attribute consists of a group of values from more than one domain. For example, the
Address attribute consists of several domains such as house number, street number, city, country,
etc.
12. Define query? (DEC 2013)
A query is a statement requesting the retrieval of information. The portion of DML
that involves information retrieval is called a query language.
13. Define database management system? List various applications of DBMS? (MAY 2019)
Database management system (DBMS) is a collection of interrelated data and a set of
programs to access those data.
a) Banking
b) Airlines
c) Universities
d) Credit card transactions
e) Tele communication
f) Finance g) Sales
h) Manufacturing
i) Human resources
14.Differentiate between static and Dynamic SQL? (NOV 2016 , NOV 2015, NOV 2014) (or)
What is Static SQL how it is different from Dynamic SQL? (NOV 2017) (or)
5
Describe about the static SQL and Dynamic SQL in detail? (Apr 2019)
Static ( Embedded ) SQL Dynamic (Interactive) SQL
In static SQL how databases will be In dynamic SQL, how database will
accessed is predetermined in the be accessed is determined at run
embedded SQL statement. time
It is more swift and efficient. It is less It is less swift and efficient. It is more
flexible. flexible.
SQL statements are compiled at SQL statements are compiled at run
compile time time
16.Explain “Query Optimization”? State the need of Query Optimization?(JUN 2016, MAY 2015)
Query optimization refers to the process of finding the lowest –cost method of evaluating
a given query.
Query Optimization is a crucial and difficult part of the overall query processing. The
main objective of query optimization is to minimize the following cost function:
It generates an optimal evaluation plan ( with lowest cost) for the query plan.
17.Why SQL allow duplicate tuples in a table or in a query result? ( NOV 2015)
By default query result and table allow duplicate records. so to avoid unnecessary
checks during SELECT, INSERT, UPDATE or DELETE SQL allow duplicate tuples in
a table or in a query result.
SELECT returns data according to WHERE filter (or all data, if it is omitted). It simply
checks WHERE conditions for each rows. If table has duplicates and they comply with
WHERE clause, the rows are selected. we can add DISTINCT to avoid duplicates.
6
18. Give a brief description on DCL command? (NOV 2014)
Commands:
i)Grant : Gives a privilege to user
ii) Revoke: Takes back privileges granted from user.
19.What are Primary key constraints? (JUN 2013)
The PRIMARY KEY constraint uniquely identifies each record in a database table. Primary
keys must contain UNIQUE values, and cannot contain NULL values. A table can have
only one primary key, which may consist of single or multiple fields.
Data integrity = making sure the data is correct and not corruptData security = making sure
only the people who should have access to the data are the only ones who can access the
data. also, keeping straight who can read the data and who can write data.
Pure integrity of data refers to the property which determines that data, once stored, has not
been altered in an unauthorised way -- either by a person, or by the malfunctioning of
hardware.
22.Which operators are called as unary operators and why are they called so? (NOV 2013)
The unary operators are :
31. What are aggregate functions? And list the aggregate functions supported by SQL?
(NOV 2018)
Aggregate functions are functions that take a collection of values as input and return a single
value.
Aggregate functions supported by SQL are Average: avg
Minimum: min
Maximum: max
Total: sum
Count: count
8
32. What is the use of group by clause?
Group by clause is used to apply aggregate functions to a set of tuples.The attributes given
in the group by clause are used to form groups.Tuples with the same value on all attributes in
the group by clause are placed in one group.
9
41. Mention the two forms of integrity constraints in ER model?
Key declarations
Form of a relationship
A domain is a set of values that may be assigned to an attribute .all values that appear in a
column of a relation must be taken from the same domain.
46. What is the need for triggers? List the requirements needed to design a trigger.
Triggers are useful mechanisms for alerting humans or for starting certain tasks
automatically when certain conditions are met.
49. Write a SQL statement to find the names and loan numbers of all customers who have a loan
at XYZ branch? (NOV 2018)
1
0
50. What are the disadvantages of file processing system?
The disadvantages of file processing systems are
a) Data redundancy and inconsistency
b) Difficulty in accessing data
c) Data isolation
d) Integrity problems
e) Atomicity problems
f) Concurrent access anomalies
54. Define the terms 1) physical schema 2) logical schema 3)Conceptual Schema
Physical schema: The physical schema describes the database design at the physical
level, which is the lowest level of abstraction describing how the data are actually stored.
Logical schema: The logical schema describes the database design at the logical level,
which describes what data are stored in the database and what relationship exists among the data.
conceptual schema:
The schemas at the view level are called subschemas that describe different views of the
database.
56. What is storage manager? What are the components of storage Manager?
A storage manager is a program module that provides the interface between the low level
1
data stored in a database and the application programs and queries submitted to the system.
1
The storage manager components include
a) Authorization and integrity manager
b) Transaction manager
c) File manager
d) Buffer manager
57. What is the purpose of storage manager? List the data structures implemented by the
storage manager.
The storage manager is responsible for the following
a) Interaction with he file manager
b) Translation of DML commands in to low level file system commands
c) Storing, retrieving and updating data in the database
The relational model uses a collection of tables to represent both data and the
relationships among those data. The relational model is an example of a record based model.
elation is a subset of a Cartesian product of list domains.
1
uple variable is a variable whose domain is the set of all tuples.
2
.
or each attribute there is a set of permitted values called the domain of that
attribute.
62. What is a primary key, candidate key,super key and foreign key? (APR 2018)
Primary key is chosen by the database designer as the principal means of identifying an
entity in the entity set.
A super key is a set of one or more attributes that collectively allows us to identify
uniquely an entity in the entity set.
A relation schema r1 derived from an ER schema may include among its attributes the
primary key of another relation schema r2.this attribute is called a foreign key from r1
referencing r2.
The project operation is a unary operation that returns its argument relation with certain
attributes left out. Projection is denoted by pie (p).
64. Write short notes on tuple relational calculus and domain relational calculus
The tuple relational calculation is anon procedural query language. It describes the desired
information with out giving a specific procedure for obtaining that information.
A query or expression can be expressed in tuple relational calculus as {t | P (t)}
The domain relational calculus uses domain variables that take on values from an
attribute domain rather than values for entire tuple.
1
3
66. List the disadvantages of relational database system
Repetition of data
Inability to represent certain information.
Candidate key: A super key such that no proper subset is a super key within the relation
Primary key: The candidate key that is selected to identify tuples uniquely within the relation,
the candidate keys which are not selected as Primary Key are called "Alternate keys"
PART - B
1. Explain select, project and Cartesian product operations in relational algebra with an
example? (NOV 2016, APR 2018)
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula
which may use connectors like and, or, and not. These terms may use relational operators like −
=, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.
For example −
Notation − r Χ s
r Χ s = { q t | q ∈ r and t ∈ s}
users users
Logical Schema 1
Conceptual or Logical Level
5
Instance – the actual content of the database at a particular point in time is called the
instance of the schema
Data Independence
The ability to change a schema at one level without affecting the higher level schemas is
called as data independence. There are two types of data independence:
o Physical Data Independence – is the ability to modify the physical or internal
schema without changing the logical or external schemas.
o Logical Data Independence – is the ability to modify logical or conceptual schema
without changing the external schemas or application programs.
3. With the help of a neat block diagram explain the basic architecture of a database
management system? (NOV 2015) ( or)
Briefly explain about Database System Architecture? (JUN 2016) (or)
Explain the overall architecture of the database system in detail? (MAY 2017) (OR)
State and explain the architecture of DBMS? (NOV 2017, Nov 2019)
The functional components of a database system can be broadly divided into two
Categories namely,
Storage manager components
Query Processor components
1
6
DATABASE SYSTEM
ARCHITECTURE
Naïve Users are unsophisticated users who interact with the system by invoking one of the
application programs that have been written previously. The typical user interface for naïve
users is a forms interface where the user can fill in appropriate fields of the form. Naïve users
may also simply read reports generated from the database.
Sophisticated Users interact with the system without writing programs. Instead they form
requests in a database query language. They submit such queries to query processor, whose
function is to break down DML statements into instructions that the storage manager
understands.
Specialized Users are sophisticated users who write specialized database applications that do
not fit into the traditional data processing framework.
Database Administrator: The person who has central control over the system is called
1 of DBA include:
database administrator (DBA). The functions
7
Schema Definition: the DBA creates the original database schema by executing a set
of data definition statements in the DDL.
Storage Structure and access-method definition: the DBA is also responsible for
defining the storage structure and access methods.
Schema and physical organization modification: the DBA carries out changes to
the schema and physical organization to reflect the changing needs of the
organization, or to alter the physical organization to improve performance.
4. What are the advantages of having a centralized control of data? Illustrate your answer
with suitable example? (NOV 2015)
A centralised database (sometimes abbreviated CDB) is a database that is located, stored, and
maintained in a single location. This location is most often a central computer or database system,
for example a desktop or server CPU, or a mainframe computer. All of the information stored on
the CBS is accessible from a large number of different points, which in turn creates a significant
amount of both advantages and disadvantages.
A centralized database consists of a single data server into which all data are stored and from which
all data are retrieved. All the data reside at a single location and all applications must retrieve all
data from that location.
The centralized approach consists of a central server into which all forecast data are stored. At some
predefined time, software on this central server requests data from each of the local data servers
scattered throughout the country. These data are received, processed and stored, possibly at lower
spatial and temporal resolution than the data from which it was derived.
Typical Examples:
In most cases, a centralised database would be used by an organisation (e.g. a business company) or
an institution (e.g. a university.) Users access a centralised database through a computer network
which is able to give them access to the central CPU, which in turn maintains to the database itself.
DBMS itself was a centralized DBMS where all the DBMS functionality, application program
execution and user interface processing carried out in one machine.
User management system.
Central Documents management system.
ADVANTAGES
Centralised databases hold a substantial amount of advantages against other types of databases.
Some of them are listed below:
Data integrity is maximised and data redundancy is minimised,[6] as the single storing place
of all the data also implies that a given set of data only has one primary record. This aids in the
maintaining of data as accurate and as consistent as possible and enhances data reliability.
Generally bigger data security, as the single data storage location implies only a one possible
place from which the database can be attacked and sets of data can be stolen or tampered with.
Better data preservation than other types of1 databases due to often-included fault-tolerant
setup. 9
Easier for using by the end-user due to the simplicity of having a single database design.
Generally easier data portability and database administration.
More cost effective than other types of database systems as labour, power supply and
maintenance costs are all minimised.
Data kept in the same location is easier to be changed, re-organised, mirrored, or analysed.
All the information can be accessed at the same time from the same location.[7]
Updates to any given set of data are immediately received by every end-user.
Decreased Risk: With Centralized data management, all edits and manipulation to core data are
housed and stored centrally. This model allows for staunch controls, detailed audit trails, and enables
business users to access consistent data.
Data Consistency: When data feeds are managed in a central repository, an organization can achieve
consistent data management and distribution throughout its global offices and internal systems.
Data Quality: A data-centric approach enables the establishment of a data standard across an
enterprise, allowing organizations to make better business assessments.
Operational Efficiency: When one business unit controls an organization data centrally, the
resources previously devoted to data management can be redirected back to core business needs.
Single Point of Entry: By introducing single point of entry for data, this allows changes from data
vendors to be implemented once, rather than in multiple instances.
Cost Saving: With data management centralized, costs attributed to vendor relationships are better
controlled, minimizing any redundancy in market data contracts and their associated costs.
Factors for Adoption of Centralized Approach:
Data can be organized in single point, by introducing single point of entry for data Database
Administrator can implement the data only once instead of in multiple sites.
Data consistency can achieve by introducing data-centric approach.
Centralized database approach is suitable for establishment of data standards across an enterprise.
For batter security purpose Centralized database approach is suitable.
For quick efficient searching Centralized approach is good one.
For controlled access to the database repository.
.DATA MODELS
A Data Model is a collection of tools for describing
Data
Data relationships
Data semantics and
Data constraints
The various Data Models are
Relational model
Entity-Relationship data model (mainly for database design)
Object-based data models (Object-oriented and Object-relational)
Semi-structured data model (XML)
Relational Model
The relational model uses a collection of 2 tables to represent both data and relationships
0
among the data. Each table has multiple columns, and each column has a unique name. The
below table called customer table, shows, for example, that the customer identified by customer-
id 100 is named john and lives at 12 anna st. in Chennai and also shows his account number
The relational model is an example of a record-based model. The record-based models are
so named because the database is structured in fixed-format records of several types. Each table
contains records of particular type. Each record type defines a fixed number of fields, or
attributes. The columns of the table correspond to the attributes of the record type.
The relational model is at a lower level abstraction than the E-R model. Database designs are
often carried out in the E-R model, and then translated to the relational model.
Figure b
Object
based
data
models
Ob
ject based
data
models are
categorize
d into object-oriented data model and object-relational data model. The object–oriented data model
can be seen as extending the E-R model with notions of encapsulation, methods or functions and
object identity. The object-relational model extends the relational data model by including object
orientation and constructs to deal with added data types.
Semi-structured data model (XML)
Extensible markup Language is defined by the WWW Consortium (W3C). It was
originally intended as a document markup language and not a database language. It has the ability to
specify new tags, and to create nested tag structures which made XML a great way to exchange
data, not just documents. XML has become the basis for all new generation data interchange
formats. A wide variety of tools is available for parsing, browsing and querying XML
documents/data
Hierarchical Model: 2
1
Hierarchical database model is one of the oldest models, dating from 1950’s. The
hierarchical model assumes that a tree structure is the most frequently occurring relationship. The
hierarchical model organizes data elements as tabular rows, one for each instance of an entity.
Consider a company’s organizational structure. At the top we have general manager (GM). Under
him he has a several deputy general managers (DGM). Each DGM take care of a couple of
departments and each department will have a manger and many employees. When represented in
hierarchical model there will be separate rows for representing the GM, each DGM, each
Department, each Manager and each Employee. The row position implies a relationship to other
rows. A given employee belongs to the department that is the closest above it in the list and the
department belongs to the manager immediately above it and so on.
The hierarchical model represents relationships in a lineraized tree. It is possible to locate a
set of employees working for say, Manager X by first locating Manager X and then including every
employee in the list after X and before the next occurrence of a manager or the end of the list.
Network Model:
The network model replaces the hierarchical tree with a graph, thus, allowing more general
connections among the nodes. The main difference of the network model from the hierarchical
model is its ability to handle many-to-many relationships. In other words, it allows a record to have
more than one parent. Suppose an employee works for two departments. The strict hierarchical
arrangement is not possible here and the tree becomes a more generalized graph-a network. Logical
proximity fails because it is not possible to place a data item simultaneously in two locations in the
list. Although it is possible to handle such situations in a hierarchical model, it becomes more
complicated and difficult to comprehend. The network model was evolved to specifically handle
non-hierarchical relationships.
In network database terminology, a relationship is a set. Each set is made of at least two
types of records: an owner record and a member record.
6.Explain in detail about E. F. Codd‘s Twelve Rules for Relational Databases?
Codd’s twelve rules call for a language that can be used to define, manipulate, and query the
data in the database, expressed as a string of characters. Some references to the twelve rules
include a thirteenth rule – or rule zero:
1. Information Rule: All information in the database should be represented in one and only one
way – as values in a table
2. Guaranteed Access Rule: Each and every datum (atomic value) is guaranteed to be logically
accessible by resorting to a combination of table name, primary key value, and column name.
3. Systematic Treatment of Null Values: Null values (distinct from empty character string or a
string of blank characters and distinct from zero or any other number) are supported in the fully
relational DBMS for representing missing information in a systematic way, independent of data
type.
6. View Updating Rule: All views that are theoretically updateable are also updateable by the
system.
7. High-Level Insert, Update, and Delete: The capability of handling a base relation or a derived
relation as a single operand applies not only to the retrieval of data, but also to the insertion, update,
and deletion of data.
8. Physical Data Independence: Application programs and terminal activities remain logically
unimpaired whenever any changes are made in either storage representation or access methods.
9. Logical Data Independence: Application programs and terminal activities remain logically
unimpaired when information preserving changes of any kind that theoretically permit
10. Integrity Independence: Integrity constraints specific to a particular relational database must
be definable in the relational data sublanguage and storable in the catalog, not in the application
programs.
11. Distribution Independence: The data manipulation sublanguage of a relational DBMS must
enable application programs and terminal activities to remain logically unimpaired whether and
whenever data are physically centralized or distributed.
Relational Algebra
Relational algebra is a procedural query language, which takes instances of relations as input and
yields instances of relations as output. It uses operators to perform queries. An operator can be
either unary or binary. They accept relations as their input and yield relations as their output.
Relational algebra is performed recursively on a relation and intermediate results are also
considered relations. 2
3
The fundamental operations of relational algebra are as follows −
Select
Project
Union
Set different
Cartesian product
Rename
We will discuss all these operations in the following sections.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula
which may use connectors like and, or, and not. These terms may use relational operators like −
=, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.
For example −
Where r and s are either database relations or relation result set (temporary relation).
Notation − r − s
Notation − r Χ s
r Χ s = { q t | q ∈ r and t ∈ s}
Notation − ρ x (E)
Set intersection
Assignment
2
Natural join 5
Relational Calculus
In contrast to Relational Algebra, Relational Calculus is a non-procedural query language, that is, it
tells what to do but never explains how to do it.
Notation − {T | Condition}
For example −
TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀).
For example −
Notation −
Where a1, a2 are attributes and P stands for formulae built by inner attributes.
For example −
Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also
involves relational operators.
The expression power of Tuple Relation Calculus and Domain Relation Calculus is equivalent to
Relational Algebra.
JOIN Operation
2
6
– The sequence of cartesian product followed by select is used quite commonly to
identify and select related tuples from two relations, a special operation, called JOIN.
It is denoted by a
– This operation is very important for any relational database with more than a single
relation, because it allows us to process relationships among relations.
– The general form of a join operation on two relations R(A1, A2, . . ., An) and S(B1, B2,
. . ., Bm) is:
R <join condition>S where R and S can be any relations that result from general
relational algebra expressions.
Example: Suppose that we want to retrieve the name of the manager of each department. To get the
manager’s name, we need to combine each DEPARTMENT tuple with the EMPLOYEE tuple
whose SSN value matches the MGRSSN value in the department tuple. We do this by using the join
operation.
DEPT_MGR DEPARTMENT MGRSSN=SSN EMPLOYEE
EQUIJOIN Operation
The most common use of join involves join conditions with equality comparisons only. Such
a join, where the only comparison operator used is =, is called an EQUIJOIN. In the result of an
EQUIJOIN we always have one or more pairs of attributes (whose names need not be identical) that
have identical values in every tuple. The JOIN seen in the previous example was EQUIJOIN.
The set of operations including select s, project p , union È, set difference - , and
Cartesian product X is called a complete set because any other relational algebra expression can be
expressed by a combination of these five operations.
For example: 2
7
R Ç S = (R È S ) – ((R - S) È (S - R))
R <join condition>S = s <join condition> (R X S)
2
8
8.Consider the given relation schema . ( MAY 2017, NOV 2018 )
Employee(empno,name,office,age)
Books(isbn,title,authors,publisher)
Loan(empno,isbn,date)
(a) Find the name of all employees who have borrowed a book published by McGraw-Hill.
(b) Find the name of all employees who have borrowed all book published by McGraw-
Hill.
(c) Find the names of employees who have borrowed more than five different books
published by McGraw-Hill.
(d) For each publisher, find the name of employees who have borrowed more than five
books of that publisher.
a. select name from employee e, books b, loan l where e.empno = l.empno and l.isbn = b.isbn
and b.publisher = ‘McGrawHill’
b. select name from employee e join loan l on e.empno=l.empno join (select isbn from books
where publisher = 'McGrawHill') x on l.isbn=x.isbn group by e.empno,name having count(*)=
(select count(*) from books where publisher=’McGrawHill’)
The DDL gets as input some instructions and generates some output. The output of the DDL
is placed in the data dictionary, which contains metadata- that is data about data. The data
dictionary is considered to be special type of table, which can only be accessed and updated by the
database system itself. A database system consults the data dictionary before reading or modifying
actual data.
ii) Data Manipulation Language:
A data manipulation language (DML) is a language that enables users to access or
manipulate data organized by the appropriate data model. The types of access are:
o Retrieval of information stored in the database
o Insertion of new information into the database
o Deletion of information from the database
o Modification of information stored in the database
There are basically two types:
Procedural DMLs require a user 3to specify what data are needed and how to get
0
those data.
Declarative DMLs (also referred to as nonprocedural DMLs) require a user to specify
what data are needed without specifying how to get those data.
A query is a statement requesting the retrieval of information. The portion of a DML that
involves information retrieval is called a query language.
Embedded SQL:
An embedded SQL program must be processed by a special preprocessor prior to compilation.
The preprocessor replaces embedded SQL requests with host-language declarations and procedure
calls that allow run-time execution of the database accesses. Then, the resulting program is compiled
by the host language compiler. To identify embedded SQL requests to the preprocessor, we use the
EXEC SQL statement. It has the form
EXEC SQL <embedded SQL statement> END-EXEC
The exact syntax for embedded SQL requests depends on the language in which SQL is
embedded. For instance, a semicolon is used instead of END-EXEC when SQL is embedded in C.
The java embedding of SQL called (SQLJ) uses the syntax
#SQL{<embedded SQL statement>}
The statement SQL INCLUDE is placed in the program to identify the place where the
preprocessor should insert the special variables used for communication between the program and
the database system. Variables of the host language can be used within embedded SQL statements,
but they must be preceded by a colon (:) to distinguish them from SQL variables.
Before executing any SQL statements, the program must first connect to the database. This is
done using
EXEC SQL connect to server user user-name END-EXEC
Group by clause is used to apply aggregate functions to a set of tuples.The attributes given
in the group by clause are used to form groups.Tuples with the same value on all attributes in
the group by clause are placed in one group.
Aggregate Functions
– A type of request that cannot be expressed in the basic relational algebra is to specify
mathematical aggregate functions on collections of values from the database.
– Examples of such functions include retrieving the average or total salary of all
employees or the total number of employee tuples. These functions are used in simple
statistical queries that summarize information from the database tuples.
– Common functions applied to collections of numeric values include SUM,
AVERAGE, MAXIMUM, and MINIMUM. The COUNT function is used for
counting tuples or values.
3
1
Example:
1. Consider the following SQL query on the EMPLOYEE relation:
SELECT Lname, Fname
FROM EMPLOYEE
WHERE Salary > ( SELECT MAX (Salary)
FROM EMPLOYEE
WHERE Dno=5 );
This query retrieves the names of employees (from any department in the company) who
earn a salary that is greater than the highest salary in department 5. The query includes a nested
subquery and hence would be decomposed into two blocks.
2. Find the name of all employees who have borrowed all book published by McGraw-
Hill.
Select name from employee e join loan l on e.empno=l.empno join (select isbn from books where
publisher = 'McGrawHill') x on l.isbn=x.isbn group by e.empno,name having count(*)= (select
count(*) from books where publisher=’McGrawHill’)
11.Describe the six clauses in the syntax of an SQL query, and show what type of constructs can be
specified in each of the six clauses. Which of the six clauses are required and which are optional?
(NOV 2015)
There are six clauses that can be used in an SQL statement. These six clauses are SELECT,
FROM, WHERE, GROUP BY, HAVING, and ORDER BY. Clauses must be coded in a
specific sequence.
1. SELECT column name(s)*
2. FROM table or views
3. WHERE conditions or predicates are met
4. GROUP BY subsets of rows
5. HAVING a common condition as a group 3
6. ORDER BY a sorting method 2
column name(s) are more correctly referred to as elements, because the SELECT statement
displays both columns that exist in the table and columns that may be generated by SQL
as a result of a query.
Example query
The SELECT clause is where you list the columns you're interested in. The SELECT displays
what you put here.
In the example, the perkey column, as well as the sum of the dollar column, are selected.
The FROM clause indicates the table you're getting your information from. You can list more
than one table. The number of tables you could list is specific to your operating system. In
the example, both columns are selected from the Sales table.
SELECT and FROM are required; the rest of these clauses are optional and serve to filter or
limit, aggregate or combine, and control the sort.
The WHERE clause is where you indicate a condition. This helps you filter unwanted data
from the results. WHERE gives you a subset of the rows in a table. In the example, only
rows with a perkey value of less than 50 are selected.
GROUP BY allows you to group your data to achieve more meaningful results. Instead of
getting a total sum of the dollar sales for all the rows selected, you can break down sales
by perkey to get the daily sales total. This is done in the example by indicating GROUP
BY perkey.
HAVING puts a condition on your groups. In the example, only those days that have a total
dollar amount greater than 8,000 are returned.
ORDER BY orders your result rows. You can choose to order results by ASC (ascending) or
DESC (descending) order. The default is ASC.
The SELECT statement is the most common usage of data manipulation language (DML).
Other DML statements (UPDATE, INSERT, and DELETE) and the other two
components of
SQL (data definition language and control language) .
Purpose:
constraints)
ommitting
Data
adding functions, joins, etc., to a view, it allows us to present exactly the data we want to the
user.
SELECT title FROM Movie WHERE studioName = ‗Paramount‘ AND year = ‗1979‘;
Example:
Movie (title, year, length, inColor, studioName, producerC#) MovieExec (name, address, cert#,
netWorth)
CREATE VIEW MovieProd AS SELECT title, name FROM Movie, MovieExec WHERE
producerC# = cert#; SELECT name FROM MovieProd WHERE title = ‗Gone With the Wind‘;
SELECT name FROM Movie, MovieExec WHERE producerC# = cert# AND title = ‗The War Of
the World‘;
3
TE – fixed-length date/time in dd-mm-yy form
5
13.Explain about Data Definition Language? (JUN 2016)
Data Definition Language:
SQL, however, uses a collection of imperative verbs whose effect is to modify the schema of
the database by adding,changing, or deleting definitions of tables or other objects.
These statements can be freely mixed with other SQL statements, so the DDL is not truly a
separate language.
CREATE statements
Create - To make a new database, table, index, or stored procedure.
A commonly used CREATE command is the CREATE TABLE command. The typical usage
is:
RDBMS specific functionality For example, the command to create a table named employees
with a few sample columns would be:
DROP statements
A DROP statement in SQL removes an object from a relational database management system
(RDBMS). The types of objects that can be dropped depends on which RDBMS is being
used, but most support the dropping of tables,
3 users, and databases. Some systems (such as
PostgreSQL) allow DROP and other DDL6 commands to occur inside of a transaction and
thus be rolled back.
For example, the command to drop a table named employees would be:
DROP employees;
The DROP statement is distinct from the DELETE and TRUNCATE statements, in that
DELETE and TRUNCATE do not remove the table itself. For example, a DELETE statement
might delete some (or all) data from a table while leaving the table itself in the database,
whereas a DROP statement would remove the entire table from the database.
ALTER statements
For example, the command to add (then remove) a column named bubbles for an existing table
named sink would be:
Rename statement
Finally, another kind of DDL sentence in SQL is one used to define referential integrity
relationships, usually implemented as primary key and foreign key tags in some columns of
the tables. These two statements can be included inside a CREATE TABLE or an ALTER
TABLE sentence.
The fetch statement requires one host-language variable for each attribute of the result
relation. In the example query cn holds the customer-name and cc holds the customer-city.
To obtain all tuples of the result, the program must contain a loop to iterate over tuples.
When the program executes an open statement on a cursor, the cursor is set to point to the first tuple
of the result. Each time it executes a fetch statement, the cursor is updated to point to the next tuple
of the result. When no further tuples remain to be processed, the variable called SQLSTATE in the
SQL communication area (SQLCA) gets set to ‘02000’ to indicate no more data is available.
The close statement causes the database system to delete the temporary relation that holds
the result of the query.
EXEC SQL close c END-EXEC
Updates through Cursors
Embedded SQL can update tuples fetched by cursor by declaring that the cursor is for update
as shown below:
declare c cursor for
select * from account where branch-name = ‘Perryridge’
for update
We then iterate through the tuples by performing fetch operations on the cursor, and after
fetching each tuple the following code is executed.
update account set balance = balance + 100 where current of c
INSERT INTO EMPLOYEE VALUES ( ‘Richard’, ‘K’, ‘Marini’, ‘653298653’, ‘1962-12-30’, ‘98
Oak Forest, Katy, TX’, ‘M’, 37000, ‘653298653’, 4 );
Example:
DELETE FROM EMPLOYEE WHERE Lname=‘Brown’;
For example, to change the location and controlling department number of project number 10 to
‘Bellaire’ and 5, respectively.
UPDATE PROJECT SET Plocation = ‘Bellaire’, Dnum = 5 WHERE Pnumber=10;
Bdate Address
1965-01-09 731Fondren, Houston,
TX
16.Explain the various ways in which
Select Statement can be used?
4
2
UNIT – 2
DATABASE DESIGN
PART - A
1.Why 4NF in Normal Form is more desirable than BCNF? (NOV 2014)
Because 4NF minimize the redundancy as well as make storage management. Redundancy is
reduced as we normalize it further and this avoids consistency problems.
ER model is a graphical representation of real world objects with their attributes and relationship. It
makes the system easily understandable. This model is considered as a top down approach for
designing a requirement.
6. Define Boyce codd normal form? Why BCNF Stricter then 3NF? (Nov 2019)
A relation schema R is in BCNF with respect to a set F of functional dependencies if, for
all functional dependencies in F.
BCNF is stricter than 3NF because each and every4 BCNF is relation to 3NF but every 3NF is not
relation to BCNF. 4. BCNF non-transitionally depends
3 on individual candidate key but there is
no such requirement in 3NF.Hence BCNF is stricter than 3NF.
To test relations to see whether they are legal under a given set of functional dependencies. To
specify constraints on the set of legal relations.
13. What are the desirable properties of decomposition? (MAY 2017, Nov 2017)
Lossless join and dependency preserving are the two desirable properties of decomposition.
17. Define the terms Entity set and Relationship set? (MAY 2019)
Entity set: The set of all entities of the same type is termed as an entity set.
Relationship set: The set of all relationships of the same type is termed as a relationship set.
Stored attributes: The attributes stored in a data base are called stored attributes.
Derived attributes: The attributes that are derived from the stored attributes are called derived
attributes.
Key attribute: An entity type usually has an attribute whose values are distinct
from each individual entity in the collection. Such an attribute is called a key attribute.
Value set: Each simple attribute of an entity type is associated with a value set
that specifies the set of values that may be assigned to that attribute for each individual
entity.
Entity type: An entity type defines a collection of entities that have the same attributes.
Entity set: The set of all entities of the same type is termed as an entity set.
• Total: The participation of an entity set E in a relationship set R is said to be total if every
entity in E participates in at least one relationship in R.
• Partial: if only some entities in E participate in relationships in R, the participation of entity
set E in relationship R is said to be partial.
25. What is the significance of “participation role name” in the description of relationship
types? (Nov 2019)
The Participation role is the part that every entity participates in a relationship. This role is
important to use role name in the depiction of relationship type when the similar entity type
participates more than once in a relationship type in various roles. The role names are necessary
in recursive relationships.
PART - B
1. Construct an E-R diagram for a car insurance company whose customers own one or
more cars each. Each car has associated with it zero to any number of recorded
accidents. Each insurance policy covers one or more cars, and has one or more
premium payments associated with it. Each payment is for a particular period of time
and has an associated due date, and the date when the payment was received? (NOV
2016, NOV 2018)
4
6
appropriate tables for the above ER Diagram :
Car insurance tables:
person (driver-id, name, address)
2.Discuss the correspondence between the ER model construct and the relational model
constructs. Show how each ER model construct can be mapped to the relational model.
Discuss the option for mapping EER model construct? (MAY 2017), (NOV 2019)
The model uses the concept of a mathematical relation-which looks somewhat like a table of
values-as its basic building block, and has its theoretical basis in set theory and first order
predicate logic.
The relational model represents the database a collection of relations. Each relation resembles
a table of values or, to some extent, a “flat”
4 file of records. When a relation is thought of
as a table of values, each row in the table
7 represents a collection of related data values. In
the relation model, each row in the table represents a fact that typically corresponds to a
real-world entity or relationship. The table name and column names are used to help in
interpreting the meaning of the values in each row. In the formal relational model
terminology, a row is called a tuple, a column header is called an attribute, and the table is
called a relation. The data type describing the types of values that can appear in each
column is represented by domain of possible values.
ER Model:
The first stage of information system design uses these models during the requirements
analysis to describe information needs or the type of information that is to be stored in a
database. In the case of the design of an information system that is based on a database,
the conceptual data model is, at a later stage (usually called logical design), mapped to a
logical data model, such as the relational model; this in turn is mapped to a physical model
during physical design. We create a relational schema from an entity-relationship(ER)
schema.
In the case of the design of an information system that is based on a database, the conceptual
data model is, at a later stage (usually called logical design), mapped to a logical data
model, such as the relational model; this in turn is mapped to a physical model during
physical design. Sometimes, both of these phases are referred to as "physical design". Key
elements of this model are entities, attributes, identifiers and relationships.
4
8
Mapping of regular entity types:
For each regular entity type E in the ER schema, create a relation R that includes all the
simple attributes of E. Include only the simple component attributes of a composite
attribute. Choose one of the key attributes of E as primary key for R. If the chosen key of
E is composite, the set of simple attributes that form it will together the primary key of R.
If multiple keys were identified for E during the conceptual design, the information describing
the attributes that form each additional key is kept in order to specify secondary (unique)
keys of relation R. Knowledge about keys is also kept for indexing purpose and other
types of analyses.
The relation that is created from the mapping of entity types are sometimes called entity
relations because each tuple represents an entity instance.
Functional dependency (FD) is a constraint between two sets of attributes from the
database.
For any two tuples t1 and t2 in r , if t1[X]=t2[X], we must also have t1[Y]=t2[Y].
Notation:
Functional dependencies allow us to express constraints that cannot be expressed using super
keys. Consider the schema:
loan-number branch-name
o test relations to see if they are legal under a given set of functional dependencies.
o We say that F holds on R if all legal relations on R satisfy the set of functional dependencies F.
Note: A specific instance of a relation schema may satisfy a functional dependency even if the
functional dependency does not hold on all legal instances.
Example
Employee
5
214-45-2398 Lance Smith Engineer
0 Product
Note: Name is functionally dependent on SSN because an employee‘s name can be uniquely
determined from their SSN. Name does not determine SSN, because more than one employee can
have the same name.
Keys
Whereas a key is a set of attributes that uniquely identifies an entire tuple, a functional
dependency allows us to express constraints that uniquely identify the values of certain
attributes.
However, a candidate key is always a determinant, but a determinant doesn‘t need to be a
key.
Axioms
Before we can determine the closure of the relation, Student, we need a set of rules.
Developed by Armstrong in 1974, there are six rules (axioms) that all possible functional
dependencies may be derived from them.
holds.
Properties of FDs
-values always cause the redundancy of Y-values.
Given an instance r of a relation R, we can only determine that some FD is not satisfied by
R, but can not determine if an FD is satisfied by R.
5,A car rental company maintains a database for all vehicles in its current fleet. For all vehicles, it
includes the vehicle identification number, license number, manufacturer, model, date of
purchase and color. Special data are included for certain types of vehicles.
Trucks: cargo capacity
Sports car: horsepower, renter age requirement
Vans: number of passengers
Off-road vehicle: ground clearance,drivetrain(four or two-wheeler drive)
Construct an ER model for the car rental company database.(NOV 2015)
6.State the need for normalization of a Database and Explain the various Normal Forms (1st,
2nd,3rd, BCNF, 4th, 5th and Domain-key) with suitable examples? (JUN 2015, APR 2018)
(OR) 5
Exemplify multivalue dependency and fourth normal 2 form (4NF) and join dependency and fifth
normal form(5NF) ? (NOV 2019)
(OR)
What is Normalization? Explain in detail about all Normal Forms (MAY 2019, NOV 2014)
Normalization of Database
Database Normalisation is a technique of organizing the data in the database. Normalization is a
systematic approach of decomposing tables to eliminate data redundancy and undesirable
characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that puts
data into tabular form by removing duplicated data from the relation tables.
Normalization is used for mainly two purpose,
Updation Anamoly : To update address of a student who occurs twice or more than twice in
a table, we will have to update S_Address column in all the rows, else data will become
inconsistent.
Insertion Anamoly : Suppose for a new admission, we have a Student id(S_id), name and
address of a student but if student has not opted for any subjects yet then we have to
insert NULL there, leading to Insertion Anamoly.
Deletion Anamoly : If (S_id) 401 has only one subject and temporarily he drops it, when we
delete that row, entire student record will be deleted along with it.
5
3
Normalization Rule
Normalization rule are divided into following normal form.
Alex 14 Maths
Stuart 17 Maths
In First Normal Form, any row must not have a column in which more than one value is saved, like
separated with commas. Rather than that, we must separate such data into multiple rows.
Student Table following 1NF will be :
Adam 15 Biology
Adam 15 Maths
5
4
Alex 14 Maths
Stuart 17 Maths
Using the First Normal Form, data redundancy increases, as there will be many columns with same
data in multiple rows but each row as a whole will be unique.
Student Age
Adam 15
Alex 14
Stuart 17
In Student Table the candidate key will be Student column, because all other column i.e Age is
dependent on it.
New Subject Table introduced for 2NF will be :
Student Subject
Adam Biology
Adam Maths
5
5
Alex Maths
Stuart Maths
In Subject Table the candidate key will be {Student, Subject} column. Now, both the above tables
qualifies for Second Normal Form and will never suffer from Update Anomalies. Although there are
a few complex cases in which table in Second Normal Form suffers Update Anomalies, and to
handle those scenarios Third Normal Form is there.
In this table Student_id is Primary key, but street, city and state depends upon Zip. The dependency
between zip and other fields is called transitive dependency. Hence to apply 3NF, we need to move
the street, city and state to new table, with Zip as primary key.
New Student_Detail Table :
Address Table :
5
6
Boyce and Codd Normal Form (BCNF)
Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with
certain type of anamoly that is not handled by 3NF. A 3NF table which does not have multiple
overlapping candidate keys is said to be in BCNF. For a table to be in BCNF, following conditions
must be satisfied:
5
7
If we observe the data in the table above it satisfies 3NF. But LECTURER and BOOKS are two
independent entities here. There is no relationship between Lecturer and Books. In the above
example, either Alex or Bosco can teach Mathematics. For Mathematics subject , student can refer
either 'Maths Book1' or 'Maths Book2'. i.e.;
SUBJECT-->BOOKS
This is a multivalued dependency on SUBJECT. If we need to select both lecturer and books
recommended for any of the subject, it will show up (lecturer, books) combination, which implies
lecturer who recommends which book. This is not correct.
5
8
Now if we want to know the lecturer names and books recommended for any of the subject, we will
fire two independent queries. Hence it removes the multi-valued dependency and confusion around
the data. Thus the table is in 4NF.
It's in 4NF
If we can decompose table further to eliminate redundancy and anomaly, and when we re-
join the decomposed tables by means of candidate keys, we should not be losing the
original data or any new record set should not arise. In simple words, joining two or more
decomposed table should not lose records nor create new records.
Consider an example of different Subjects taught by different lecturers and the lecturers taking
classes for different semesters.
Note: Please consider that Semester 1 has Mathematics, Physics and Chemistry and Semester 2 has
only Mathematics in its academic year!!
In above table, Rose takes both Mathematics and Physics class for Semester 1, but she does not take
Physics class for Semester 2. In this case, combination of all these 3 fields is required to identify a
valid data. Imagine we want to add a new class - Semester3 but do not know which Subject and who
will be taking that subject. We would be simply inserting a new entry with Class as Semester3 and
leaving Lecturer and subject as NULL. As we discussed above, it's not a good to have such entries.
Moreover, all the three columns together act as a primary key, we cannot leave other two columns
blank!
Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and when
join them by using keys, it should yield correct record. Here, we can represent each lecturer's
Subject area and their classes in a better way. We can divide above table into three - (SUBJECT,
LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS)
5
9
Now, each of combinations is in three different tables. If we need to identify who is teaching which
subject to which semester, we need join the keys of each table and get the result.
For example, who teaches Physics to Semester 1, we would be selecting Physics and Semester1
from table 3 above, join with table1 using Subject to filter out the lecturer names. Then join with
table2 using Lecturer to get correct lecturer name. That is we joined key columns of each table to get
the correct data. Hence there is no lose or new data - satisfying 5NF condition.
7. Draw E-R diagram for the “Restaurant menu ordering system” that will facilitate the food
items ordering and services with in a restaurant. The entire restaurant scenario is detailed as
follows. The customer is able to view the food items menu, call the waiter, place orders and
obtain the final bill through the computer kept in their table. The waiters through their
wireless tablet PC are able to initializw a table for customer, control the table functions to
assist customers, orders, send orders to food preparation staff(chef) and finalize the customers
bill. The food preparation staffs(chefs), with their touch-display interfaces to the system, are
able to view orders sent to the kitchen by waiters. During preparation, they are able to let the
waiter know the status of each item and can send notifications when items are completed. The
system should have full accountability and logging facilities and should support supervisor
actions to account for exceptional circumstances such as meal being refunded or walked out
on? (May 2015).
6
0
8.Explain first normal form, second normal form, third normal form and BCNF with an
example? (NOV 2016, Nov 2019)
1st Normal Form:
The Requirements:
The requirements to satisfy the 1st NF:
Each table has a primary key: minimal set of attributes which can uniquely identify a
record
The values in each column of a table are atomic (No multi-value attributes allowed).
There are no repeating groups: two columns do not store similar information in the same table
Example 2:
6
2
Database Abraham Henry F. 0072958863 MySQL, 1168 McGraw-
System Silberschatz Korth Computers Hill
Concepts
Table 1 problems
This table is not very efficient with storage.
Second, our subject field contains more than one piece of information. With more
than one value in a single field, it would be very difficult to search for all books on a given
subject.
Table 2
We now have two rows for a single book. Additionally, we would be violating the
Book table:
Each table has a primary key, used for joining tables together when querying the data. A primary
key value must be unique with in the table (no two books can have the same ISBN
number), and a primary key is also an index, which speeds up data retrieval based on the primary key.
Now to define relationships between the tables
• a form is in 2NF if and only if it is in 1NF and has no attributes which require only part of the key to
uniquely identify them
• where a key has more than one attribute, check that each non-key attribute depends on the whole
key and not part of the key.
• for each subset of the key which determines an attribute or group of attributes create a new form.
Move the dependant attributes to the new form..
• Add the part key to new form, making it the primary key.
Every non-key attribute is fully dependent on each candidate key of the relation.
Second Normal Form (or 2NF) deals with redundancy of data in vertical columns.
6
5
Here is a list of attributes in a table that is in First Normal Form:
Department
Project_Name
Employee_Name
Emp_Hire_Date
Project_Manager
Project_Name and Employee_Name are the candidate key for this table. Emp_Hire_Date and
Project_Manager are partially depend on the Employee_Name, but not depend on the Project_Name.
Therefore, this table will not satisfy the Second Normal Form
In order to satisfy the Second Normal Form, we need to put the Emp_Hire_Date and Project_Manager to
other tables. We can put the Emp_Hire_Date to the Employee table and put the Project_Manager to the
Project table.
Department Project
Project_Name Project_ID
Employee_Name Project_Name
Project_Manager
Employee
Employee_ID
Employee_Name
Employee_Hire_Date
Now, the Department table will only have the candidate key left.
An attribute C is transitively dependent on attribute A if there exists an attribute B such that A B and B
C, then A C.
Here is the Second Normal Form of the table for the invoice table:
It violates the Third Normal Form because there will be redundancy for having multiple invoice
number for the same customer. In this example, Jones had both invoice 1001 and 1003.
To solve the problem, we will have to have another table for the Customers
By having Customer table, there will be no transitive relationship between the invoice number and
the customer name and address. Also, there will not be redundancy on the customer information
6
7
Boyce-Codd Normal Form:
• If there is only one candidate key then 3NF and BCNF are the same
Conversion to BCNF
Stream has been put into BCNF but we have lost the FD {Student, Course} {Time}
Decomposition Properties
6
8
• Dependency preservation: It is desirable that FDs are preserved when splitting relations
up
Converting to BCNF
1) The determinant, Offering#, becomes part of the key and the dependant attribute T_Code, becomes
a non key attribute. So the Dependency diagram is now
S_Num, Offering#
2) There are problems with this structure as T_Code is now dependant on only part of the key. This
violates the rules for 2NF, so the table needs to be divided with the partial dependency becoming
a new table. The dependencies would then be
3) The original table is divided into two new tables. Each is in 3NF and in BCNF.
Student Review:
OfferingTeacher:
Offering# T_code#
6
9
01764 FIT104
01765 PIT305
01789 PIT107
Anomalies:
INSERT: We cannot record the city for a supplier_no without also knowing the supplier_name
DELETE: If we delete the row for a given supplier_name, we lose the information that the supplier_no is
associated with a given city.
9. Design and draw an E-R diagram for university database? (Nov 2019)
7
0
10. Explain with suitable example, the constraints of specialization and generalization in ER data
modeling.(Nov 2019).
The ER Model has the power of expressing database entities in a conceptual hierarchical manner. As the
hierarchy goes up, it generalizes the view of entities, and as we go deep in the hierarchy, it gives us the
detail of every entity included.
Going up in this structure is called generalization, where entities are clubbed together to represent a more
generalized view. For example, a particular student named Mira can be generalized along with all the
students. The entity shall be a student, and further, the student is a person. The reverse is
called specialization where a person is a student, and that student is Mira.
Generalization:
As mentioned above, the process of generalizing entities, where the generalized entities contain the
properties of all the generalized entities, is called generalization. In generalization, a number of entities are
brought together into one generalized entity based on their similar characteristics. For example, pigeon,
house sparrow, crow and dove can all be generalized as Birds.
Specialization:
Specialization is the opposite of generalization. In specialization, a group of entities is divided into sub-
groups based on their characteristics. Take a group ‘Person’ for example. A person has name, date of birth,
gender, etc. These properties are common in all persons, human beings. But in a company, persons can be
identified as employee, employer, customer, or vendor, based on what role they play in the company.
7
1
Similarly, in a school database, persons can be specialized as teacher, student, or a staff, based on what
role they play in school as entities.
Inheritance :
We use all the above features of ER-Model in order to create classes of objects in object-oriented
programming. The details of entities are generally hidden from the user; this process known
as abstraction.
Inheritance is an important feature of Generalization and Specialization. It allows lower-level entities to
inherit the attributes of higher-level entities.
For example, the attributes of a Person class such as name, age, and gender can be
inherited by lower-level entities such as Student or Teacher.
7
2
UNIT- 3
TRANSACTIONS
PART-A
1. What is transaction?
Collections of operations that form a single logical unit of work are called transactions.
3. What are the properties of transaction? (Nov’2014, May’2015 & May 2016)
(OR)
What are ACID Properties?(Dec 2017)
7
4
Committed - Transaction successfully completed and all write operations made
permanent in the database.
7
5
It provides a mechanism for conversion from exclusive lock to shared lock is known as
downgrade.
17. What are the two methods for dealing deadlock problem?
The two methods for dealing deadlock problem is deadlock detection and deadlock
recovery.
7
6
the database while the transaction is still in the active state. Data modifications written by
active transactions are called uncommitted modifications.
32. Differentiate strict two phase locking protocol and rigorous two phase locking
protocol. (May/June 2016)
In strict two phase locking protocol all exclusive mode locks taken by a
transaction is held until that transaction commits.
Rigorous two phase locking protocol requires that all locks be held until the
transaction commits.
34. What are the time stamps associated with each data item?
• W-timestamp (Q) denotes the largest time stamp if any transaction that
executed WRITE (Q) successfully.
• R-timestamp (Q) denotes the largest time stamp if any transaction that executed READ
(Q) successfully.
7
7
35. What is Serializability? How is it tested? (May 2014, Nov’2014, Nov’2016, &May 2018)
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
Precedence graph is used to test the serializability.
The schedule in which the transactions execute one after the other is called Serial Schedule. It is
consistent in nature. For Example: Consider two transactions T1 and T2. All operations in T1 is
executed after that all transactions in T2 are executed.
41. What type of lock is needed for insert and delete operations? (May 2017)
42. What is the difference between shared lock and exclusive lock? (May 2018)
7
8
43. What is rigorous two phase locking protocol? (Dec 2013)
This is stricter two phase locking protocol . Here all locks are to be held until the transaction commits.
45.List the responsibilities of a DBMS has whenever a transaction is submitted to the system for execution?
(Nov 2019)
Begin the transaction. Execute a set of data manipulations and/or queries. If no errors occur then commit
the transaction and end it. If errors occur then roll back the transaction and end it.
46.Brief any two violations that may occur if a transaction executes a lower isolation level than
serializable? (Nov 2019)
Lost updates.
Dirty read (or uncommitted data).
Unrepeatable read (or inconsistent retrievals).
7
9
PART-B
1. Briefly explain about Two phase commit and three phase commit protocols. (Nov’ 2014,
May 2015 & May 2016)
(OR)
Explain two phase commit protocol with an example?
Possible Failures
Site Failure
Coordinator Failure
Network Partition
Drawbacks:
higher overheads
Assumptions may not be satisfied in practice.
1. Conflict serializability
2. View serializability
1. Conflict serializability
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there
exists some item Q accessed by both li and lj, and at least one of these instructions wrote
Q.
8
1
1. li = read(Q), lj = read(Q).li and lj don’t conflict.
2. li = read(Q), lj = write(Q). They conflict.
8
2
3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict
• If a schedule S can be transformed into a schedule S´ by a series of swaps of non-
conflicting instructions, we say that S and S´ are conflict equivalent.
• We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
schedule
2. View serializability
Let S and S´ be two schedules with the same set of transactions. S and S´ are view
equivalent if the following three conditions are met, for each data item Q,
Active - Reading and Writing data items, if something wrong happens during reading and
writing aborts to Failed.
Partially Committed - All reading and writing operations are done aborts to Failed when
rollback occurs or committed when commit occurs.
• Atomicity: Either all operations of the transaction are properly reflected in the database
or none are.
• Durability: After a transaction completes successfully, the changes it has made to the
database persist, even if there are system failures.
First, a transaction performing read or write operation using I/O devices may not be using
the CPU at a particular point of time. Thus, while one transaction is performing I/O
operations, the CPU can process another transaction. This is possible because CPU and I/O
system in the computer system are capable of operating in parallel. This overlapping of I/O
and CPU activities reduces the amount of time for which the disks and processors are idle
and, thus, increases the throughput of the system (the number of transactions executed in a
given amount of time).
This property of DBMS allows many transactions to access the same database at the same
time without interfering with each other.
The primary goal of concurrency is to ensure the atomicity of the execution of transactions
in a multi-user database environment. Concurrency controls mechanisms attempt to
interleave (parallel) READ and WRITE operations of multiple transactions so that the
interleaved execution yields results that are identical to the results of a serial schedule
execution.
A lost update problem occurs when two transactions that access the same database items have
their operations in a way that makes the value of some database item incorrect.
In other words, if transactions T1 and T2 both read a record and then update it, the effects of the
first update will be overwritten by the second update.
Example:
Consider the situation given in figure that shows operations performed by two transactions,
Transaction- A and Transaction- B with respect to time.
----- t0 ----
Read X t1 ----
---- t2 Read X
Update X t3 ----
---- t4 Update X
---- t5 ----
A dirty read problem occurs when one transaction updates a database item and then the
transaction fails for some reason. The updated database item is accessed by another transaction
before it is changed back to the original value. In other words, a transaction T1 updates a record,
which is read by the transaction T2.
Then T1 aborts and T2 now has values which have never formed part of the stable database.
Example:
---- t0 ----
---- t1 Update X
Read X t2 ----
---- t3 Rollback
---- t4 ----
Unrepeatable read (or inconsistent retrievals) occurs when a transaction calculates some
summary (aggregate) function over a set of data while other transactions are updating the data.
The problem is that the transaction might read some data before they are changed and other data
after they are changed, thereby yielding inconsistent results.
In an unrepeatable read, the transaction T1 reads a record and then does some other processing
during which the transaction T2 updates the record. Now, if T1 rereads the record, the new value
will be inconsistent with the previous value.
Example:
Consider the situation given in figure that shows two transactions operating on three accounts:
----- t0 ----
---- t7 COMMIT
Read Balance of Acc-3
Deadlock
System is deadlocked if there is a set of transactions such that every transaction in the set is
waiting for another transaction in the set.
Avoid deadlock:
Approach1
– Require that each transaction locks all its data items before it begins execution
either all are locked in one step or none are locked.
– Disadvantages
• Hard to predict, before transaction begins, what data item need to be
locked.
• Data item utilization may be very low.
Approch2
– Assign a unique timestamp to each transaction.
– These timestamps only to decide whether a transaction should wait or rollback.
Schemes:
1. Wait-die scheme
2. Wound-wait scheme
Wait-die scheme
– non preemptive technique
– When transaction Ti request a data item currently held by Tj, Ti is allowed to wait
only if it has a timestamp smaller than that of Tj. otherwise ,Ti rolled back(dies)
– older transaction may wait for younger one to release data item. Younger
transactions never wait for older ones; they are rolled back instead.
– A transaction may die several times before acquiring needed data item
Example.
• Transaction T1,T2,T3 have time stamps 5,10,15,respectively.
• if T 1 requests a data item held by T2,then T1 will wait.
• If T3 request a data item held by T2,then T3 will be rolled back
Wound-wait scheme
- Preemptive technique
- When transaction Ti requests a data item currently held by Tj,Ti is allowed to wait
only if it has a timestamp larger than that of Tj. Otherwise Tj is rolled back
- Older transaction wounds (forces rollback) of younger transaction instead of
waiting for it. Younger transactions may wait for older ones.
Example
• Transaction T1, T2, T3 have time stamps 5, 10, 15 respectively.
• if T1 requests a data item held by T2,then the data item will be preempted from
T2,and T2 will be rolled back.
• If T3 requests a data item held by T2, then T3 will wait.
The common solution is to roll back one or more transactions to break the deadlock.
Three action need to be taken
a. Selection of victim
b. Rollback
c. Starvation
Selection of victim
i. Set of deadlocked transations,must determine which transaction to roll back to
break the deadlock.
ii. Consider the factor minimum cost
Rollback
- once we decided that a particular transaction must be rolled back, must determine how
far this transaction should be rolled back
- Total rollback
- Partial rollback
Starvation
Ensure that a transaction can be picked as victim only a finite number of times.
Lock Granularity:
A database is basically represented as a collection of named data items. The size of the data item
chosen as the unit of protection by a concurrency control program is called GRANULARITY.
Locking can take place at the following level:
Database level.
Table level.
Page level.
Row (Tuple) level.
Attributes (fields) level.
Crash Recovery
Though we are living in highly technologically advanced era where hundreds of satellite
monitor the earth and at every second billions of people are connected through
information technology, failure is expected but not every time acceptable.
DBMS is highly complex system with hundreds of transactions being executed every
second. Availability of DBMS depends on its complex architecture and underlying
hardware or system software. If it fails or crashes amid transactions being executed, it is
expected that the system would follow some sort of algorithm or techniques to recover
from crashes or failures.
Failure Classification
To see where the problem has occurred we generalize the failure into various categories,
as follows:
TRANSACTION FAILURE
Logical errors: where a transaction cannot complete because of it has some code error
or any internal error condition
System errors: where the database system itself terminates an active transaction
because DBMS is not able to execute it or it has to stop because of some system
condition. For example, in case of deadlock or resource unavailability systems aborts
an active transaction.
SYSTEM CRASH
There are problems, which are external to the system, which may cause the system to
stop abruptly and cause the system to crash. For example interruption in power supply,
failure of underlying hardware or software failure.
Examples may include operating system errors.
DISK FAILURE:
In early days of technology evolution, it was a common problem where hard disk drives
or storage drives used to fail frequently.
Disk failures include formation of bad sectors, unreachability to the disk, disk head
crash or any other failure, which destroys all or part of disk storage.
Storage Structure
We have already described storage system here. In brief, the storage structure can be
divided in various categories:
Volatile storage: As name suggests, this storage does not survive system crashes
and mostly placed very closed to CPU by embedding them onto the chipset itself
for examples: main memory, cache memory. They are fast but can store a small
amount of information.
Nonvolatile storage: These memories are made to survive system crashes. They
are huge in data storage capacity but slower in accessibility. Examples may include,
hard disks, magnetic tapes, flash memory, non-volatile (battery backed up) RAM.
When a system crashes, it many have several transactions being executed and various
files opened for them to modifying data items. As we know that transactions are made
of various operations, which are atomic in nature. But according to ACID properties of
DBMS, atomicity of transactions as a whole must be maintained that is, either all
operations are executed or none.
It should check the states of all transactions, which were being executed.
A transaction may be in the middle of some operation; DBMS must ensure the
atomicity of transaction in this case.
It should check whether the transaction can be completed now or needs to be rolled
back.
No transactions would be allowed to left DBMS in inconsistent state.
There are two types of techniques, which can help DBMS in recovering as well as
maintaining the atomicity of transaction:
Maintaining the logs of each transaction, and writing them onto some stable storage
before actually modifying the database.
Maintaining shadow paging, where are the changes are done on a volatile memory
and later the actual database is updated.
Log-Based Recovery
Deferred database modification: All logs are written on to the stable storage and
database is updated when transaction commits.
CHECKPOINT
Keeping and maintaining logs in real time and in real environment may fill out all the
memory space available in the system. At time passes log file may be too big to be
handled at all. Checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in storage disk. Checkpoint declares a point before
which the DBMS was in consistent state and all the transactions were committed.
RECOVERY
When system with concurrent transaction crashes and recovers, it does behave in the
following manner:
The recovery system reads the logs backwards from the end to the last Checkpoint.
It maintains two lists, undo-list and redo-list.
If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn,
Commit>, it puts the transaction in redo-list.
If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it
puts the transaction in undo-list.
All transactions in undo-list are then undone and their logs are removed. All
transaction in redo-list, their previous logs are removed and then redone again and log
saved.
10. Consider the following schedules. The actions are listed in the order they are schedule,
and prefixed with transaction name.
S1: T1: R(X), T2: R(x), T1: W(Y), T2: W(Y), T1: R(Y), T2: R(Y)
S2:T3: R(X), T1: R(X), T1: W(Y), T2: R (Z), T2: W (Z), T3: R (Z)
For each of the schedules, answer the following questions:
i. What is the precedence graph for the schedule?
ii. Is the schedule conflict-serializable? If so, what are all the conflict equivalent
serial schedules?
iii. Is the schedule view-serializable? If so, what are all the view equivalent serial
schedules? (Apr/May 2015)
1. Serizability (or view) cannot be decided but NOT conflict serizability. It is recoverable
and avoid cascading aborts; NOT strict
2. It is serializable, conflict-serializable, and view-serializable regardless which action
(commit or abort) follows It is NOT avoid cascading aborts, NOT strict; We cannot
decide whether it's recoverable or not, since the abort/commit sequence of these two
transactions are not specified.
3. It is the same with 2.
4. Serizability (or view) cannot be decided but NOT conflict serizability. It is NOT
avoided cascading aborts, NOT strict; we cannot decide whether it's recoverable or
not, since the abort/commit sequence of these transactions are not specified.
5. It is serializable, conflict-serializable, and view-serializable; It is recoverable and avoid
cascading aborts; it is NOT strict.
6. It is NOT serializable, NOT view-serializable, and NOT conflict-serializable; it is
recoverable and avoids cascading aborts; It is NOT strict.
7. It belongs to all classes
8. It is serializable, NOT view-serializable, NOT conflict-serializable; it is NOT
recoverable, therefore NOT avoid cascading aborts, NOT strict.
9. It is serializable, view-serializable, and conflict-serializable; It is NOT recoverable,
therefore NOT avoid cascading aborts, NOT strict.
10. It belongs to all above classes.
11. It is NOT serializable and NOT view-serializable, NOT conflict-serializable; it is
recoverable, avoid cascading aborts and strict.
12. It is NOT serializable and NOT view-serializable, NOT conflict-serializable; it is
recoverable, but NOT avoid cascading aborts, NOT strict.
Lock-based Protocols
Database systems equipped with lock-based protocols use a mechanism by which any
transaction cannot read or write data until it acquires an appropriate lock on it. Locks are of
two kinds –
Binary Locks − A lock on a data item can be in two states; it is either locked or
unlocked.
Shared/exclusive − This type of locking mechanism differentiates the locks based on
their uses. If a lock is acquired on a data item to perform a write operation, it is an
exclusive lock. Allowing more than one transaction to write on the same data item
would lead the database into an inconsistent state. Read locks are shared because no data
value is being changed.
There are four types of lock protocols available −
Two-phase locking has two phases, one is growing, where all the locks are being
acquired by the transaction; and the second phase is shrinking, where the locks held by
the transaction are being released.
To claim an exclusive (write) lock, a transaction must first acquire a shared (read) lock
and then upgrade it to an exclusive lock.
Strict Two-Phase Locking
The first phase of Strict-2PL is same as 2PL. After acquiring all the locks in the first
phase, the transaction continues to execute normally. But in contrast to 2PL, Strict-2PL
does not release a lock after using it. Strict-2PL holds all the locks until the commit point
and releases all the locks at a time.
Timestamp-based Protocols
The most commonly used concurrency protocol is the timestamp based protocol. This
protocol uses either system time or logical counter as a timestamp.
Lock-based protocols manage the order between the conflicting pairs among transactions
at the time of execution, whereas timestamp-based protocols start working as soon as a
transaction is created.
Every transaction has a timestamp associated with it, and the ordering is determined by
the age of the transaction. A transaction created at 0002 clock time would be older than
all other transactions that come after it. For example, any transaction 'y' entering the
system at 0004 is two seconds younger and the priority would be given to the older one.
In addition, every data item is given the latest read and write-timestamp. This lets the
system know when the last ‘read and write’ operation was performed on the data item.
12. Explain about Locking Protocols. (May 2016 & Nov’2016) (or)
Briefly describe two phase locking in concurrency control techniques. (Nov’2016)
(OR)
Explain the two phase locking protocol with an example?(May 2018, Nov 2019)
(OR)
Differentiate strict two phase locking and rigorous two phase locking protocol with an
example? (Dec 2018)
Two-Phase Locking (2PL) is a concurrency control method which divides the execution
phase of a transaction into three parts.
It ensures conflict serializable schedules.
If read and write operations introduce the first unlock operation in the transaction, then it
is said to be Two-Phase Locking Protocol.
1. In Growing Phase, a transaction obtains locks, but may not release any lock.
2. In Shrinking Phase, a transaction may release locks, but may not obtain any lock.
Conservative Two – Phase Locking Protocol is also called as Static Two – Phase
Locking Protocol.
This protocol is almost free from deadlocks as all required items are listed in advanced.
It requires locking of all data items to access before the transaction starts.
UNIT-4
IMPLEMENTATION TECHNIQUES
PART-A
2. Compare sequential access devices versus random access devices with an example
Sequential access devices
Must be accessed from the beginning Eg:- tape storage
Access to data is much slower Cheaper than disk
4. Draw the storage device hierarchy according to their speed and their cost.
Cache , Main memory ,Flash memory, Magnetic disk ,Optical disk ,Magnetic tapes
23. What are the factors to be taken into account when choosing a RAID level?
o Monetary cost of extra disk storage requirements.
o Performance requirements in terms of number of I/O operations o Performance
when a disk has failed.
o Performances during rebuild.
27. Distinguish between fixed length records and variable length records?
Fixed length records
Every record has the same fields and field lengths are fixed.
Variable length records
File records are of same type but one or more of the fields are of varying size.
28. What are the ways in which the variable-length records representedin database systems? (Nov
2018)
Storage of multiple record types in a file.
Record types that allow variable lengths for one or more fields. Record types that
allow repeating fields.
29. Explain the use of variable length records.
They are used for Storing of multiple record types in a file.
Used for storing records that has varying lengths for one or more fields. Used for
storing records that allow repeating fields
30. What is the use of a slotted-page structure and what is the information present in
the header?
The slotted-page structure is used for organizing records within a single block. The header
contains the following information.
The number of record entries in the header. The end of free space
An array whose entries contain the location and size of each record.
31. What are the two types of blocks in the fixed –length representation? Define them.
• Anchor block: Contains the first record of a chain.
• Overflow block: Contains the records other than those that are the first record of a chain.
51. Differentiate between static hashing and dynamic hashing. (Nov’2014 , Dec 2014, May 2015 &
Dec 2015)
In static hashing, when a search-key value is Hash function, in dynamic hashing, is made
provided, the hash function always computes to produce a large number of values and only
the same address. a few are used initially.
53. What can be done to reduce the occurrences of bucket overflows in a hash file
organization?
To reduce bucket overflow the number of bucket is chosen to be (nr/fr)*(1+d).
We handle bucket overflow by using
• Overflow chaining(closed hashing)
• Open hashing
56. Give an example of a join that is not a simple equi-join for which partitioned
parallelism can be used. (Nov/Dec 2015)
58. List out the mechanisms to avoid collision during hashing. (Dec’2016)
The various mechanisms to avoid collision during hashing are
Open Hashing
- Separate Chaining
B Tree
B+ Tree
The leaf nodes of the three store The leaf nodes of the tree stores
Data pointers to records rather than the actual record rather than
actual records. pointers to records.
Space These trees waste space There trees do not waste space.
In B tree, the leaf node cannot In B+ tree, leaf node data are
Function of leaf nodes
store using linked list. ordered in a sequential linked list.
Here in B tree the search is not that Here in B+ tree the searching
Search accessibility
easy as compared to a B+ tree. becomes easy.
62. What is called query processing? What are the steps involved in query processing? (APR 2018, NOV
2017)
Query processing refers to the range of activities involved in extracting data from a database.
64. What is called a query evaluation plan? How do you measure the cost of query evaluation? A
sequence of primitive operations that can be used to evaluate ba query is a query evaluation plan or a query
execution plan.
The cost of a query evaluation is measured in terms of a number of different resources including disk
accesses, CPU time to execute a query, and in a distributed database system the cost of communication
69. Explain nested loop join? What is meant by block nested loop join?
Nested loop join consists of a pair of nested for loops. r is the outer relation and s is the inner
relation.
Block nested loop join is the variant of the nested loop join where every block of the inner
relation is paired with every block of the outer relation. With in each pair of blocks every tuple in one
block is paired with every tuple in the other blocks to generate all pairs of tuples.
75.What cost components are used most often as the basic for cost function? (MAY 2017)
This is type of indexing which is based on sorted indexing values. Various ordered indices are primary indexing,
secondary indexing.
77.What are data fragmentation? State the various fragmentation with an example?(Dec 2017)
• Fragmentation
– The system partitions the relation into several fragment and stores each fragment
at different sites
– Two approaches
• Horizontal fragmentation
• Vertical fragmentation
Horizontal fragmentation
Splits the relation by assigning each tuple of r to one or more fragments
relation r is partitioned into a number of subsets, r1 ,r2,…..rn and can be
reconstruct the original relation using union of all fragments, that is
r = r1 U r2 U……U rn
• Vertical fragmentation
– Splits the relation by decomposing scheme R of relation and reconstruct the
original relation by using natural join of all fragments. that is
r = r1 r2 …… rn
In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B.
Hash function is used to locate records for access, insertion as well as deletion.
Example:
-There are 10 buckets,
-The hash function returns the sum of the binary representations of the characters
modulo 10
– E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
The structure of the leaf nodes of a B+ tree of order 'b' is as follows: Each leaf node is of the form : <<K1,
D1>, <K2, D2>, ….., <Kc-1, Dc-1>, Pnext> Every leaf node has : K1 < K2 < …. < Kc-1, c <= b. Each leaf
node has at least \ceil(b/2) values. All leaf nodes are at same level.
1.What is meant by semantic query optimization? How does it differ from other query optimization
technique? Give example? (MAY 2017)
Semantic query optimization is the process of determining the set of semantic transformations that
results in a semantically equivalent query with a lower execution cost.
Two queries are semantically equivalent if they return the same answer for any database state
satisfying a given set of integrity constraints.
A semantic transformation transforms a given query into a semantically equivalent one.
ODB-QOptimizer determines more specialized classes to be accessed and reduces the number of
factors by applying the Integrity Constraint Rules.
Semantic query optimization is the process of transforming a query issued by a user into a different query
which, because of the semantics of the application, is guaranteed to yield the correct answer for all states of
the database. While this process has been successfully applied in centralised databases, its potential for
distributed and heterogeneous systems is enormous, as there is the potential to eliminate inter-site joins
which are the single biggest cost factor in query processing. Further justification for its use is provided by
the fact that users of heterogeneous databases typically issue queries through high-level languages which
may result in very inefficient queries if mapped directly, without consideration of the semantics of the
system. Even if this is not the case, users cannot be expected to be familiar with the semantics of the
component databases, and may consequently issue queries which are unnecessarily complicated.
A different approach to query optimization, called semantic query optimization, has been suggested. This
technique, which may be used in combination with the techniques discussed previously, uses constraints
specified on the database schema—such as unique attributes and other more complex constraints—in order
to modify one query into another query that is more efficient to execute. We will not discuss this approach
in detail but we will illustrate it with a simple example. Consider the SQL query:
This query retrieves the names of employees who earn more than their supervisors. Suppose that we had a
constraint on the database schema that stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the existence of this constraint, it does not need to
execute the query at all because it knows that the result of the query will be empty. This may save
considerable time if the constraint checking can be done efficiently. However, searching through many
constraints to find those that are applicable to a given query and that may semantically optimize it can also
be quite time-consuming. With the inclusion of active rules and additional metadata in database systems,
semantic query optimization techniques are being gradually incorporated into the DBMSs.
The term quote semantic query optimization quote (SQO) denotes a methodology whereby queries
against databases are optimized using semantic information about the database objects being queried. The
result of semantically optimizing a query is another query which is syntactically different to the original,
but semantically equivalent and which may be answered more efficiently than the original. SQO is
distinctly different from the work performed by the conventional SQL optimizer. The SQL optimizer
generates a set of logically equivalent alternative execution paths based ultimately on the rules of relational
algebra. However, only a small proportion of the readily available semantic information is utilised by
current SQL optimizers. Researchers in SQO agree that SQO can be very effective.
However, after some twenty years of research into SQO, there is still no commercial implementation. In
this thesis we argue that we need to quantify the conditions for which SQO is worthwhile. We investigate
what these conditions are and apply this knowledge to relational database management systems (RDBMS)
with static schemas and infrequently updated data. Any semantic query optimizer requires the ability to
reason using the semantic information available, in order to draw conclusions which ultimately facilitate the
recasting of the original query into a form which can be answered more efficiently. This reasoning engine is
currently not part of any commercial RDBMS implementation.
2.Briefly explain about Query Processing? (Apr 2016, Nov 2019) (or)
Result of query
The query optimizer module has the task of producing a good execution plan, and the code
generator generates the code to execute that plan. The runtime database processor has the task of running
(executing) the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime error
results, an error message is generated by the runtime database processor.
ℑMAX Salary(σDno=5(EMPLOYEE))
and the outer block into the expression:
πLname,Fname(σSalary>c(EMPLOYEE))
The query optimizer would then choose an execution plan for each query block.
3.Discuss about the Join order optimization and Heuristic optimization algorithms? (APR 2015)
The join order is fixed if any join logical files are referenced. The join order is also fixed if the
OPNQRYF JORDER(*FILE) parameter is specified or the query options file (QAQQINI)
FORCE_JOIN_ORDER parameter is *YES.
Otherwise, the following join ordering algorithm is used to determine the order of the tables:
1. Determine an access method for each individual table as candidates for the primary dial.
2. Estimate the number of rows returned for each table based on local row selection.
If the join query with row ordering or group by processing is being processed in one
step, then the table with the ordering or grouping columns is the primary table.
3. Determine an access method, cost, and expected number of rows returned for each join
combination of candidate tables as primary and first secondary tables.
The join order combinations estimated for a four table inner join would be:
1-2 2-1 1-3 3-1 1-4 4-1 2-3 3-2 2-4 4-2 3-4 4-3.
4. Choose the combination with the lowest join cost and number of selected rows or both.
5. Determine the cost, access method, and expected number of rows for each remaining table
joined to the previous secondary table.
6. Select an access method for each table that has the lowest cost for that table.
7. Choose the secondary table with the lowest join cost and number of selected rows or both.
8. Repeat steps 4 through 7 until the lowest cost join order is determined.
Heuristic Optimization
4.Give a detailed description about Query Processing and optimization. Explain the cost estimation of
Query Optimization? (Nov 2014) (or)
Explain the catalog information for cost estimation for selection and sorting operation in database?
(NOV 2017) (or)
How does a DBMS represent a relational query evaluation plan? (NOV 2018)
* Given relational algebra expression may have many equivalent expressions E.g.
• Many possible ways to estimate cost, for instance disk accesses, CPU time, or even
Communication overhead in a distributed or parallel system.
Typically disk access is the predominant cost, and is also relatively easy to estimate. Therefore
number of block transfers from disk is used as a measure of the actual cost of evaluation. It is
assumed that all transfers of blocks have the same cost.
• Costs of algorithms depend on the size of the buffer in main memory, as having more memory
reduces need for disk access. Thus memory size should be a parameter while estimating cost;
often use worst case estimates.
• We refer to the cost estimate of algorithm A as EA. We do not include cost of writing output to
disk.
Selection Operation
File scan – search algorithms that locate and retrieve records that fulfill a selection condition.
Algorithm A1 (linear search):
Scan each file block and test all records to see whether they satisfy the selection condition.
– Cost estimate (number of disk blocks scanned) EA1 = br
– If selection is on a key attribute, EA1 = (br / 2) (stop on finding record)
– Linear search can be applied regardless of
1 selection condition, or
2 ordering of records in the file, or
3 availability of indices
Since Indices speed query processing why might they not be kept on several search keys? List
as many reasons as possible? (NOV 2018)
• Disadvantage to pointer structure; space is wasted in all records except the first in a a
chain.
• Solution is to allow two kinds of block in file:
– Anchor block – contains the first records of chain
Overflow block – contains records other than those that are the first records of chains.
6.Define RAID and Briefly Explain RAID techniques. (Nov’2014, May 2015, Nov’2015,
May 2016 & Nov’2016)
(OR)
What is RAID? Briefly discuss about RAID? (May 2019, Nov 2019)
–
• RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among
all N + 1 disks, rather than storing data in N disks and parity in 1 disk.
• RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant
information to guard against multiple disk failures.
– Better reliability than Level 5 at a higher cost; not used as widely.
Read-write head
-Positioned very close to the platter surface
-Reads or writes magnetically.
-Surface of platter divided into circular tracks
-Each track is divided into sectors.
-A sector is the smallest unit of data that can be read or written.
-Head-disk assemblies
-multiple disk platters on a single spindle (typically 2 to 4)
-ne head per platter, mounted on a common arm.
Cylinder i consists of ith track of all the platters
Disk controller – interfaces between the computer system and the disk drive hardware.
– accepts high-level commands to read or write a sector
– initiates actions such as moving the disk arm to the right track and actually
reading or writing the data
– Ensures successful writing by reading back sector after writing it.
• Optical storage
– non-volatile, data is read optically from a spinning disk using a laser
– CD-ROM and DVD most popular forms
– Write-one, read-many (WORM) optical disks are available (CD-R and DVD-R)
– Multiple write versions also available (CD-RW, DVD-RW)
– Reads and writes are slower than with magnetic disk
• Tape storage
– non-volatile, Used mainly for backup, for storage of infrequently used
information, and as an off-line medium for transferring information from one
system to another..
• Hold large volumes of data and provide high transfer rates
– sequential-access – much slower than disk
• Very slow access time in comparison to magnetic disks and optical disks
– very high capacity (300 GB tapes available)
– storage costs much cheaper than disk, but drives are expensive
MULTIDIMENSIONAL DATABASES:
PARALLEL DATABASES:
In an ordered index, index entries are stored sorted on the search key value.
Primary index: in a sequentially ordered file, the index whose search key specifies the
sequential order of the file.
Secondary index: an index whose search key specifies an order different from the
sequential order of the file.
Dense index — Index record appears for every search-key value in the file.
Sparse Index: contains index records for only some search-key values.
– Applicable when records are sequentially ordered on search-key
To locate a record with search-key value K we:
– Find index record with largest search-key value < K
– Search file sequentially starting at the record to which the index record points
Multilevel index
– If primary index does not fit in memory, access becomes expensive.
– To reduce number of disk accesses to index records, treat primary index kept on
disk as a sequential file and construct a sparse index on it.
– outer index – a sparse index of primary index
– inner index – the primary index file
– If even outer index is too large to fit in main memory, yet another level of index
can be created, and so on.
Secondary Indices
- Index record points to a bucket that contains pointers to all the actual records with that
particular searchkey value.
- Secondary indices have to be dense
Hashing:
Static hashing:
A bucket is a unit of storage containing one or more records.
In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B.
Hash function is used to locate records for access, insertion as well as deletion.
Example:
-There are 10 buckets,
-The hash function returns the sum of the binary representations of the characters
modulo 10
– E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
Dynamic hashing:
• Good for database that grows and shrinks in size
• Allows the hash function to be modified dynamically
• Extendable hashing – one form of dynamic hashing
– Hash function generates values over a large range — typically b-bit
integers, with b=32.
– At any time use only a prefix of the hash function to index into a table of
bucket addresses.
– Let the length of the prefix be i bits, 0 i 32.
– Bucket address table size = 2i. Initially i = 0
– Value of i grows and shrinks as the size of the database grows and shrinks.
– Multiple entries in the bucket address table may point to a bucket.
– Thus, actual number of buckets is < 2i
– The number of buckets also changes dynamically due to coalescing and
splitting of buckets.
10.Explain about B+ trees indexing concepts with an example (Nov’2014 & May 2016)
(OR)
Explain the B+ tree indexes on multiple keys with a suitable example? (Dec 2017) (OR)
Describe the structure of B+ tree and give the algorithm for search in the B+ tree with example? (Apr
2019)
A B+-tree is a rooted tree satisfying the following properties:
A B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and
deletions in logarithmic amortized time. Unlike self-balancing binary search trees, it is
optimized for systems that read and write large blocks of data. It is most commonly used
in database and file systems.
• Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates
redundant storage of search keys.
• Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer
field for each search key in a nonleaf node must be included.
• If the root is not a leaf node, it must have at least two children.
• For a tree of order n, each node except the root and leaf nodes must have between
n/2 and n pointers .
• The number of key values contained in a non leaf node is 1 less than the number of
pointers.
• The tree must always be balanced ie every path from the root node to
Static hashing:
A bucket is a unit of storage containing one or more records.
In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function.
Hash function h is a function from the set of all search-key values K to the set of
all bucket addresses B.
Hash function is used to locate records for access, insertion as well as deletion.
Example:
-There are 10 buckets,
-The hash function returns the sum of the binary representations of the characters
modulo 10
– E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
Dynamic hashing:
• Good for database that grows and shrinks in size
• Allows the hash function to be modified dynamically
• Extendable hashing – one form of dynamic hashing
– Hash function generates values over a large range — typically b-bit
integers, with b=32.
– At any time use only a prefix of the hash function to index into a table of
bucket addresses.
– Let the length of the prefix be i bits, 0 i 32.
– Bucket address table size = 2i. Initially i = 0
– Value of i grows and shrinks as the size of the database grows and shrinks.
– Multiple entries in the bucket address table may point to a bucket.
– Thus, actual number of buckets is < 2i
– The number of buckets also changes dynamically due to coalescing and
splitting of buckets.
A data dictionary contains metadata i.e data about the database. The data dictionary is very important as it
contains information such as what is in the database, who is allowed to access it, where is the database
physically stored etc. The users of the database normally don't interact with the data dictionary, it is only
handled by the database administrators.
The data dictionary in general contains information about the following:
Names of all the database tables and their schemas.
Details about all the tables in the database, such as their owners, their security constraints, when
they were created etc.
Physical information about the tables such as where they are stored and how.
Table constraints such as primary key attributes, foreign key information etc.
Information about the database views that are visible.
This is a data dictionary describing a table that contains employee details.
Computation of Joins :
When we want to join two tables, say P and Q, each tuple in P has to be compared with each tuple in Q to
test if the join condition is satisfied. If the condition is satisfied, the corresponding tuples are concatenated,
eliminating duplicate fields and appended to the result relation. Consequently, this is the most expensive
operation.
The common approaches for computing joins are −
Nested-loop Approach
This is the conventional join approach. It can be illustrated through the following pseudocode (Tables P
and Q, with tuples tuple_p and tuple_q and joining attribute a) −
For each tuple_p in P
For each tuple_q in Q
If tuple_p.a = tuple_q.a Then
Concatenate tuple_p and tuple_q and append to Result
End If
Next tuple_q
Next tuple-p
Basically, for each block of the outer table (r), scan the entire inner table (s). „
Requires quadratic time, O(n2) „
Improved when buffer is used.
Example of Nested Loop Join:
Sort-merge Approach
In this approach, the two tables are individually sorted based upon the joining attribute and then the sorted
tables are merged. External sorting techniques are adopted since the number of records is very high and
cannot be accommodated in the memory. Once the individual tables are sorted, one page each of the sorted
tables are brought to the memory, merged based upon the joining attribute and the joined tuples are written
out.
Hash-join Approach
This approach comprises of two phases: partitioning phase and probing phase. In partitioning phase, the
tables P and Q are broken into two sets of disjoint partitions. A common hash function is decided upon.
This hash function is used to assign tuples to partitions. In the probing phase, tuples in a partition of P are
compared with the tuples of corresponding partition of Q. If they match, then they are written out.
UNIT- 5
ADVANCED TOPICS
PART-A
A distributed database management system consists of loosely coupled sites(computer) that share
no physical components and each site is associated with a database system.
2. What are various fragmentations? State Various fragmentations with example?(Dec 2017)
There are two types of fragmentation. They are Horizontal Fragmentation and Vertical
Fragmentation.
Horizontal fragmentation :
Splits the relation by assigning each tuple of r to one or more fragments
relation r is partitioned into a number of subsets, r1 ,r2,…..rn and can be
reconstruct the original relation using union of all fragments, that is
r = r1 U r2 U……U rn
Vertical fragmentation :
– Splits the relation by decomposing scheme R of relation and reconstruct the
original relation by using natural join of all fragments. that is
r = r1 r2 …… rn
Client want all or nothing transactions and Transfer either happens or nothing at all.
4. How does the concept of an object in the object oriented model differ from the concept of an
entity in an entity relationship model? (Dec 2016)
The XML Schema are used to represent the structure of XML document.
The goal or purpose of XML Schema is to define the building blocks of an XML document.
These can be used as an alternative to XML DTD.
6. Define XQuery?
10. What are two approaches to store a relation in the distributed databases? (May 2004)
Replication: System maintains multiple copies of data, stored in different sites, for fast retrieval
and fault tolerance.
Fragmentation: Relation is partitioned into several fragments stored in distinct sites.
19.What is substitutability?
Any method of a class-say A can equally well be invoked with any object belonging
to any subclasses B of A. This characteristic leads to code reuse, since the messages,
methods, and functions do not have to be written again for objects of class B.
21.What is DAG?
The class-subclass relationship is represented by a directed acyclic graph.eg: employees
can be temporary or permanenet.we may create subclasses temporary and permanenet, of the
class employee.
There is potential ambiguity if the same variable or method can be inherited from more
than one superclass.eg: student class may have a variable dept identifying a student's
department, and the teacher class may correspondingly have a variable dept identifying a
teacher's department.
25.What is a value?
A data value is used for identity. This form of identity is used in relational
systems.eg: The primary key value of a tuple identifies the tuple.
26.What is a Name?
A user-supplied name is used for identity. This form of identity is used for files in file
systems. The user gives each file a name that uniquely identifies it, regardless of its contents.
27.What is a Built-in?
A notation of identity is built-into the data model or programming language and no user-
supplied identifier is required. This form of identity is used in objectoriented systems.
44.What is threat?
A threat is any situation, event or personnel that will adversely affect the database
security and smooth and efficient functioning of organization.
A threat may be caused by a situation or event involving a person, action or
circumstance that is likely to bring harm to organization
48.List some security violations (or) name any forms of malicious acess.
Unauthorized Reading of data
Unauthorized modification of data
Unauthorized destruction of data.
49.List the types of authorization.
Read authorization
Write authorization
Update authorization
Drop authorization
authentication techniques:
Challenge response scheme
Digital Signatures
Non repudiation
60. Write about the four types (Star, Snowflake, Galaxy and Fact Constellation) of Data
warehouse schemas. (May’2015)
1. Star schema: The star schema architecture is the simplest data warehouse schema. It
is called a star schema because the diagram resembles a star, with points radiating from a
center. The center of the star consists of fact table and the points of the star are the
dimension tables.
2. Fact constellation: is a measure of online analytical processing, which a collection of
multiple fact tables is sharing dimension tables, viewed as a collection of stars. This is an
improvement over Star schema.
3. Snowflake schema: In computing, a snowflake schema is a logical arrangement of
tables in a multidimensional database such that the entity relationship diagram resembles
asnowflake shape. The snowflake schema is represented by centralized fact tables which
are connected to multiple dimensions..
4. Galaxy Schema : It is the combination of both star schema and snowflake schema.
63. Can we have more than one constructor in a class? If yes, explain the need for such a
situation. (Nov/Dec 2015)
Yes, default constructor and constructor with parameter.
68. How does the concept of an object in object oriented model differ from ER model in
differ from the concept of an entity relationship model. (Nov’2016)
The object-oriented paradigm is based on encapsulation of data and code related to an
object in to a single unit, whose contents are not visible to the outside world. The main terms
are i) class ii) object iii) association and iv) attributes.
ERD stands for Entity Relationship Diagram. It works as an important component of a
conceptual data model. ERD is often used to graphically represent the logical structure of a
database. The main terms are i) entity ii) instance of an entity iii) relationship and iv)
attributes.
Database management systems must support atomic transactions. Object-oriented databases are
geared toward engineering and design applications. Nested transactions provide a more direct
support of project development for these applications.
70.How spatial databases are more helpful then active databases? (Nov 2018)
A spatial database is a database that is optimized for storing and querying data that represents
objects defined in a geometric space. Most spatial databases allow the representation of simple
geometric objects. Such databases can be useful for websites that wish to identify the locations of
their visitors for customization.
The spatial database has proper ways to manage and store spatial objects and spatial relationships.
Therefore, access a spatial database to process a spatial query is more efficient.
72. Compare sequential access devices versus random access devices with an example? (May
2019)
Sequential Access to a data file means that the computer system reads or writes information to the file
sequentially, starting from the beginning of the file and proceeding step by step. Random Access to a file
means that the computer system can read or write information anywhere in the data file. This type of
operation is also called “Direct Access”.
Sequential drive stores files and data in a specific order, while a random access drive puts them all over the
place.
Example:
The old fashioned tape drive is a sequential drive. Although tape drives are no longer used in modern PCs,
some companies still use them to create durable backup archives.
Meanwhile, a modern disk drive can be programmed to store data either sequentially or through random
access. CDs and DVDs can also use both methods. A music CD, for example, is sequential, but one with a
database on it might use random access in order to fit more data onto the disk.
73. List information types of documents necessary for relevance ranking of documents in IR? (Nov
2019)
Relevant document means a document that— Based on 8 documents 8. relevant document means any
invoice, account, drawing, plan, technical specification or other document relating to the approved
operation.
The allocation schema is a description of where the data(fragments)are to be located, taking account of any
replication.
PART –B
1.Briefly explain about Two phase commit and three phase commit protocols. (Nov’ 2014,
May 2015 & May 2016)
(OR)
Explain two phase commit protocol with an example?(Nov 2017)
Possible Failures
Site Failure
Coordinator Failure
Network Partition
Drawbacks:
higher overheads
Assumptions may not be satisfied in practice.
• Consider a relation r, there are two approaches to store this relation in the distributed DB.
– Replication
– Fragmentation
• Replication
– The system maintains several identical replicas(copies) of the relation at different
site.
– Full replication- copy is stored in every site in the system.
• Advantages and disadvantages
– Availability
– Increased parallelism
Increased overhead update
• Fragmentation
– The system partitions the relation into several fragment and stores each fragment
at different sites
– Two approaches
• Horizontal fragmentation
• Vertical fragmentation
Horizontal fragmentation
Splits the relation by assigning each tuple of r to one or more fragments
relation r is partitioned into a number of subsets, r1 ,r2,…..rn and can be
reconstruct the original relation using union of all fragments, that is
r = r1 U r2 U……U rn
• Vertical fragmentation
– Splits the relation by decomposing scheme R of relation and reconstruct the
original relation by using natural join of all fragments. that is
r = r1 r2 …… rn
3.Write short notes on Spatial Databases. (Nov’2014, Nov’2015 , Nov’2016 &
Spatial databases:
Metrological databases:
For weather information, are three dimentional,since temperatures and other
meteorological information are related to three dimensional spatial points.
Spatial characteristics:
The two dimensional geometric concepts such as pointers ,lines and lines
segments,circles,polygon,arcs are used to spatial characteristics of objects.
Spatial operations:
Spatial operations are needed to operate on objects.
eg: To compute the distance between two objects.
Dynamic spatial characteristics that change over time, such as police vehicles, ambulances or
fire trucks.
Categories of spatial queries:
1. Range query
Finds the objects of a particular type that are within a given spatial area or within a particular
distance from a given location.
eg:find all schools within the Bombay city area.
2.Nearest neighbor query or adjacency:
It finds an object of a particular type that is closer to a given location.
eg:find the school that is closest to your house.
3.Spatial joins or overlays:
The objects are two types based on some spatial condition, such as the objects intersecting
or overlapping spatially or being within a certain distance of one another .
eg:- Find aall cities that fall on a major highway or finds all homes that are within two miles of a
lake.
Spatial storage structures:
1.R-trees
2.quad trees
1.R-tree:
R-trees can easily answer queries, such as find all objects in a given area by limiting the
tree search to those subtrees whose rectangles intersect with the area given in the query.
space
subspace 1 subspace 2
area area
subspace :- rectangular sub spaced with needed objects.
2. Quad tree:
Quad tree generally divide each space subspace into equally sized areas .Each subspaces used to identify the
positions of various objects.
4.Compare and contrast object oriented and XML databases? (Dec 2018)
As the name implies, the main feature of object-oriented databases is allowing the definition of
objects, which are different from normal database objects. Objects, in an object-oriented database,
reference the ability to develop a product, then define and name it. The object can then be
referenced, or called later, as a unit without having to go into its complexities. This is very similar
to objects used in object-oriented programming.
A real-life parallel to objects is a car engine. It is composed of several parts: the main cylinder
block, the exhaust system, intake manifold and so on. Each of these is a standalone component; but
when machined and bolted into one object, they are now collectively referred to as an engine.
Similarly, when programming one can define several components, such as a vertical line
intersecting a perpendicular horizontal line while both lines have a graded measurement. This
object can then be collectively labeled a graph. When utilizing the ability to plot components, there
is no need to first define a graph; but rather the instance of the created graph can be called.
XML Databases:
An XML database is a database that stores data in XML format. This type of database is suited for
businesses with data in XML format and for situations where XML storage is a practical way to
archive data, metadata and other digital resources.
This data can be queried, transformed, exported and returned to a calling system. XML databases
are a flavor of document-oriented databases which are in turn a category of NoSQL database.
There are a number of reasons to directly specify data in XML or other document formats such
as JSON. For XML in particular, they include:
DTD:
The XML Document Type Declaration, commonly known as DTD, is a way to describe XML language
precisely. DTDs check vocabulary and validity of the structure of XML documents against grammatical
rules of appropriate XML language.
An XML DTD can be either specified inside the document, or it can be kept in a separate document and
then liked separately.
Syntax:
Basic syntax of a DTD is as follows −
<!DOCTYPE element DTD identifier
[
declaration1
declaration2
........
]>
In the above syntax,
The DTD starts with <!DOCTYPE delimiter.
An element tells the parser to parse the document from the specified root element.
DTD identifier is an identifier for the document type definition, which may be the path to a file on
the system or URL to a file on the internet. If the DTD is pointing to external path, it is
called External Subset.
The square brackets [ ] enclose an optional list of entity declarations called Internal Subset.
Internal DTD:
53
A DTD is referred to as an internal DTD if elements are declared within the XML files. To refer it as
internal DTD, standalone attribute in XML declaration must be set to yes. This means, the declaration
works independent of an external source.
Syntax:
Following is the syntax of internal DTD −
<!DOCTYPE root-element [element-declarations]>
where root-element is the name of root element and element-declarations is where you declare the
elements.
Example:
Following is a simple example of internal DTD −
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
<!DOCTYPE address [
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
]>
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
Let us go through the above code −
Start Declaration − Begin the XML declaration with the following statement.
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
DTD − Immediately after the XML header, the document type declaration follows, commonly referred to
as the DOCTYPE −
<!DOCTYPE address [
The DOCTYPE declaration has an exclamation mark (!) at the start of the element name. The DOCTYPE
informs the parser that a DTD is associated with this XML document.
DTD Body − The DOCTYPE declaration is followed by body of the DTD, where you declare elements,
attributes, entities, and notations.
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone_no (#PCDATA)>
54
Several elements are declared here that make up the vocabulary of the <name> document. <!ELEMENT
name (#PCDATA)> defines the element name to be of type "#PCDATA". Here #PCDATA means parse-
able text data.
End Declaration − Finally, the declaration section of the DTD is closed using a closing bracket and a
closing angle bracket (]>). This effectively ends the definition, and thereafter, the XML document follows
immediately.
Rules:
The document type declaration must appear at the start of the document (preceded only by the
XML header) − it is not permitted anywhere else within the document.
Similar to the DOCTYPE declaration, the element declarations must start with an exclamation
mark.
The Name in the document type declaration must match the element type of the root element.
External DTD:
In external DTD elements are declared outside the XML file. They are accessed by specifying the system
attributes which may be either the legal .dtd file or a valid URL. To refer it as external
DTD, standalone attribute in the XML declaration must be set as no. This means, declaration includes
information from the external source.
Syntax:
Following is the syntax for external DTD −
<!DOCTYPE root-element SYSTEM "file-name">
where file-name is the file with .dtd extension.
Example:
The following example shows external DTD usage −
<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
<!DOCTYPE address SYSTEM "address.dtd">
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
The content of the DTD file address.dtd is as shown −
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
Types
You can refer to an external DTD by using either system identifiers or public identifiers.
55
System Identifiers
A system identifier enables you to specify the location of an external file containing DTD declarations.
Syntax is as follows −
<!DOCTYPE name SYSTEM "address.dtd" [...]>
As you can see, it contains keyword SYSTEM and a URI reference pointing to the location of the
document.
Public Identifiers:
Public identifiers provide a mechanism to locate DTD resources and is written as follows −
<!DOCTYPE name PUBLIC "-//Beginning XML//DTD Address Example//EN">
As you can see, it begins with keyword PUBLIC, followed by a specialized identifier. Public identifiers
are used to identify an entry in a catalog. Public identifiers can follow any format, however, a commonly
used format is called Formal Public Identifiers, or FPIs.
XML SCHEMA
XML Schema is commonly known as XML Schema Definition (XSD). It is used to describe and validate
the structure and the content of XML data. XML schema defines the elements, attributes and data types.
Schema element supports Namespaces. It is similar to a database schema that describes the data in a
database.
Syntax:
We need to declare a schema in your XML document as follows −
Example:
The following example shows how to use schema −
56
XML - Elements are the building blocks of XML document. An element can be defined within an XSD as
follows −
<xs:element name = "x" type = "y"/>
Definition Types
You can define XML schema elements in the following ways −
Simple Type:
Simple type element is used only in the context of the text. Some of the predefined simple types are:
xs:integer, xs:boolean, xs:string, xs:date. For example −
<xs:element name = "phone_number" type = "xs:int" />
Complex Type:
A complex type is a container for other element definitions. This allows you to specify which child
elements an element can contain and to provide some structure within your XML documents. For example
−
<xs:element name = "Address">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
In the above example, Address element consists of child elements. This is a container for
other <xs:element> definitions, that allows to build a simple hierarchy of elements in the XML document.
Global Types:
With the global type, you can define a single type in your document, which can be used by all other
references. For example, suppose you want to generalize the person and company for different addresses
of the company. In such case, you can define a general type as follows −
<xs:element name = "AddressType">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Now let us use this type in our example as follows −
57
<xs:element name = "Address1">
<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone1" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
throughout the document. However, in spite of this disadvantage, an XML representation has significant advantages
58
Just as SQL is the dominant language for querying relational data, XML is becoming the dominant format for data
exchange.
<bank>
<account>
<account-number> A-101 </account-number>
<branch-name> Downtown </branch-name>
<balance> 500 </balance>
</account>
<account>
<account-number> A-102 </account-number>
<branch-name> Perryridge </branch-name>
<balance> 400 </balance>
</account>
<account>
<account-number> A-201 </account-number>
<branch-name> Brighton </branch-name>
<balance> 900 </balance>
</account>
<customer>
<customer-name> Johnson </customer-name>
<customer-street> Alma </customer-street>
<customer-city> Palo Alto </customer-city>
</customer>
<customer>
<customer-name> Hayes </customer-name>
<customer-street> Main </customer-street>
<customer-city> Harrison </customer-city>
</customer>
<depositor>
<account-number> A-101 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
<depositor>
<account-number> A-201 </account-number>
<customer-name> Johnson </customer-name>
</depositor>
<depositor>
<account-number> A-102 </account-number>
<customer-name> Hayes </customer-name>
</depositor>
</bank>
59
7.Explain the necessary characteristics a system must satisfy to be considered as an object
oriented database management system? (May 2018)
Advantages of OODBMS:
Extensibility. OODBMSs allow new data types to be built from existing types. The ability to factor
out common properties of several classes and form them into a superclass that can be shared with
subclasses can greatly reduce redundancy within system is regarded as one of the main advantages
of object orientation.
The first criterion translates into five features: persistence, secondary storage management,
concurrency, recovery and an ad hoc query facility.
The second one translates into eight features: complex objects, object identity, encapsulation, types or
classes, inheritance, overriding combined with late binding, extensibility and computational completeness.
8.Write short notes on Multimedia databases. (Nov’2015 , Nov 2017 & May 2018)
60
MULTIMEDIA CONCEPTS:
Multimedia databases that allow users to store and query different types of multimedia
information.It includes
images :- pictures,drawings
videoclips :- movies,newsreels,home video
audio clips :- song,phone msg,speeches
documents :- book ,articles
Content based retrieval:
The multimedia source is being retrieval based on its containing certain objects or activities.
Identifying the contents of multimedia sources is difficult and time –consuming task.
There are two type of approaches:
Automatic analysis
Manual identification
Automatic analysis:-
To identify certain mathematical characteristics of their contents. This approach depends
on the type of multimedia source(image,text,video,audio).
Manual identification :
This approach can be applied to all the different multimedia sources, but it requires a
manual preprocessing phase where a person has to scan each multimedia source to
identify and catalog the object and activities it contains so that they can be used to index
these sources.
Characteristics of type of multimedia sources:-
Image:
An image is typically stored either in raw form as a set of pixels or cell values or
61
Image -> represented by an m by n grid of cells.
Types:-
black and white image pixel ----------- > one bit
grey scale or color image---------pixel is multiple bits
Mathematical transforms:-
Used to reduce the number of cells stored but still maintain the main image characteristics.
Distance function to compare the given image with the stored images and their segments.
Indexes can be created to group together stored images that are close in the distance metric so as
to limit the search space.
2. Transformation approach:-
Measures the image similarity by having a small number of transformations that can
transform one image cell to match the other image.Transform include rotations,transformations
and scaling.
video source:-
A video source is represented as a sequence of frames where each frame is a still image
.
video- video segments -> sequence of contiguous frame that include
same objects / activities.
Each video segment can be used to index the segments an indexing technique called
frame segment trees has been proposed for video indexing.
The index includes both objects [persons,houses,cars] and activities [driving,talking].
62
Text/documents source:
Text/document source is basically the full text of some article ,book or magazine
These sources are typically indexed by identifying the keywords that appear in the text
and their relative frequencies.
Techniques have been developed to reduce the number of keywords to those that are
relevant to the collection.
Which is based on matrix transformation ,can be used for reducing the number of
keywords.
2. Telescoping vector trees:- or TV trees,can be used to group similar documents.
Audiosources:-
Includes stored recorded messages ,such as speeches ,class presentations,even
surveillance recording of phase messages or conversions by law enforcement.
Discrete transforms can be used to identify the main chaacteristics of certain person’s
voice.
Audio characteristics include loudness,intensity,pitch and clarity.
Object-oriented databases allows referential sharing through the support of object identity
and inheritance.
63
The object-oriented paradigm is illustrated below:
Complex Objects
Complex objects are built by applying constructors to simpler objects including: sets, lists
and tuples. An example is illustrated below:
Encapsulation
Encapsulation is derived from the notion of Abstract Data Type (ADT). It is motivated by
the need to make a clear distinction between the specification and the implementation of
an operation. It reinforces modularity and provides a form of logical data independence.
100
Class
A class object is an object which acts as a template.
It specifies:
A structure that is the set of attributes of the instances
A set of operations
A set of methods which implement the operations
Instantiation means generating objects, Ex. 'new' operation in C++
Persistence of objects: Two approaches
An implicit characteristic of all objects
An orthogonal characteristic - insert the object into a persistent collection of objects
101
Inheritance
A mechanism of reusability, the most powerful concept of OO programming
Association
Association is a link between entities in an application
In OODB, associations are represented by means of references between objects
a representation of a binary association
a representation of a ternary association
reverse reference
102
103
ADVANTAGES OF OODB
-An integrated repository of information that is shared by multiple users, multiple
products, multiple applications on multiple platforms.
- It also solves the following problems:
1. The semantic gap: The real world and the Conceptual model is very similar.
XML database
An XML database is a data persistence software system that allows data to be stored in
XML format. These data can then be queried, exported and serialized into the desired
format. XML databases are usually associated with document-oriented databases.
Two major classes of XML database exist:[1]
1. XML-enabled: these may either map XML to traditional database structures
(such as a relational database[2]), accepting XML as input and rendering XML as
output, or more recently support native XML types within the traditional
database. This term implies that the database processes the XML itself (as
opposed to relying on middleware).
2. Native XML (NXD): the internal model of such databases depends on XML and
uses XML documents as the fundamental unit of storage, which are, however, not
necessarily stored in the form of text files.
104
11.Explain about Distributed Databases and their characteristics, functions and
advantages and disadvantages. (May’2016)
Replicated DBMS: A DDBMS that keeps and controls replicate data, such as Relations, in
multiple databases.
Each site on the network is able to process local Transactions (i.e. - access data only in that
single site)
Each site may participate in the execution of Global Transactions (i.e. - access data in several
sites) which requires communication among the sites.
Note 1: The above can be thought of: Local Applications & Global Applications
Note 2: This scheme is transparent to users.
Homogeneous DDBMS: This is the case when the application programs are independent of
how the db is distributed; i.e. if the distribution of the physical data can be altered without
having to make alterations to the application programs. Here, all sites use the same DBMS
product - same schemata and same data dictionaries.
Heterogeneous DDBMS: This is the case when the application programs are dependent on
the physical location of the stored data; i.e. application programs must be altered if data is
moved from one site to another. Here, there are different kinds of DBMSs (i.e. Hierarchical,
Network, Relational, Object., etc.), with different underlying data models.
Characteristics of a DDBMS
Also:
• Transaction Manager (TM)
105
• Data Manager (DM)
• Transaction Coordinator (TC)
NOTE: a Distributed Data Processing System is a system where the application programs run
on distributed computers which are linked together by a data transmission network.
Advantages of DDBMSs
More accurately reflects organizational structure
Shareability and Local Autonomy (enforces global and local policies)
Availability and Reliability (failed central db vs failed node)
Performance (process/data migration and speed)
Economics
Modular growth
Integration (with older systems)
Disadvantages of DDBMSs
Complexity (Replication overhead, etc)
Maintenance Costs (of sites)
Security (Network Security)
Integrity Control (More complex)
Lack of Standards
Lack of Experience and Misconceptions
Database Design more complex
Mobile Databases
Recent advances in portable and wireless technology led to mobile computing, a new
dimension in data communication and processing.
Portable computing devices coupled with wireless communications allow clients to access
data from virtually anywhere and at any time.
There are a number of hardware and software problems that must be resolved before the
capabilities of mobile computing can be fully utilized.
Some of the software problems – which may involve data management, transaction
management, and database recovery – have their origins in distributed database systems.
In mobile computing, the problems are more difficult, mainly:
The limited and intermittent connectivity afforded by wireless communications.
The limited life of the power supply(battery).
The changing topology of the network.
10
6
In addition, mobile computing introduces new architectural possibilities and challenges.
10
7
Mobile Computing Architecture
The general architecture of a mobile platform is illustrated in Fig.
It is distributed architecture where a number of computers generally referred to as Fixed
Hosts and Base Stations are interconnected through a high-speed wired network.
Fixed hosts are general purpose computers configured to manage mobile units.
Base stations function as gateways to the fixed network for the Mobile Units.
Wireless Communications –
The wireless medium have bandwidth significantly lower than those of a wired network.
The current generation of wireless technology has data rates range from the tens to hundreds
of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless
Ethernet, popularly known as WiFi).
Modern (wired) Ethernet, by comparison, provides data rates on the order of hundreds of
megabits per second.
10
8
The other characteristics distinguish wireless connectivity options:
interference, locality of access, range, support for packet switching, seamless roaming
throughout a geographical region.
Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the frequency
spectrum, which may cause interference with other appliances, such as cordless telephones.
Modern wireless networks can transfer data in units called packets, that are used in wired
networks in order to conserve bandwidth.
Client/Network Relationships –
Mobile units can move freely in a geographic mobility domain, an area that is
circumscribed by wireless network coverage.
To manage entire mobility domain is divided into one or more smaller domains, called cells,
each of which is supported by at least one base station.
Mobile units be unrestricted throughout the cells of domain, while maintaining
information access contiguity.
The communication architecture described earlier is designed to give the mobile unit the
impression that it is attached to a fixed network, emulating a traditional client-server
architecture.
Wireless communications, however, make other architectures possible. One alternative is a
mobile ad-hoc network (MANET)
In a MANET, co-located mobile units do not need to communicate via a fixed network, but
instead, form their own using cost-effective technologies such as Bluetooth.
In a MANET, mobile units are responsible for routing their own data, effectively acting as
base stations as well as clients.
Moreover, they must be robust enough to handle changes in the network topology, such as
the arrival or departure of other mobile units.
MANET applications can be considered as peer-to-peer, meaning that a mobile unit is
simultaneously a client and a server.
Transaction processing and data consistency control become more difficult since there is no
central control in this architecture.
Resource discovery and data routing by mobile units make computing in a MANET even
more complicated.
10
9
Sample MANET applications are multi-user games, shared whiteboard, distributed calendars,
and battle information sharing.
Characteristics of Mobile Environments
The characteristics of mobile computing include:
Communication latency
Intermittent connectivity
Limited battery life
Changing client location
The server may not be able to reach a client
A client may be unreachable because it is dozing – in an energy-conserving state in which
many subsystems are shut down – or because it is out of range of a base station.
In either case, neither client nor server can reach the other, and modifications must be made
to the architecture in order to compensate for this case.
Proxies for unreachable components are added to the architecture.
For a client (and symmetrically for a server), the proxy can cache updates intended for the
server.
Mobile computing poses challenges for servers as well as clients.
The latency involved in wireless communication makes scalability a problem.
Since latency due to wireless communications increases the time to service each client
request, the server can handle fewer clients.
One way servers relieve this problem is by broadcasting data whenever possible.
A server can simply broadcast data periodically.
Broadcast also reduces the load on the server, as clients do not have to maintain active
connections to it.
Client mobility also poses many data management challenges.
Servers must keep track of client locations in order to efficiently route messages to them.
Client data should be stored in the network location that minimizes the traffic necessary to
access it.
The act of moving between cells must be transparent to the client.
11
0
The server must be able to gracefully divert the shipment of data from one base to another,
without the client noticing.
Client mobility also allows new applications that are location-based.
WEB DATABASES:
Web database is the online database that can be only accessed using the web form-based
interface.
Web Basics
The Web consists of computers on the Internet connected to each other in a specific way
The Web has a client/server architecture
Web browsers
Also called browsers
Programs used to connect client-side computers to the Internet
Web servers
Run special Web server software
Listener
Component included in Web server software
Monitors for messages sent to it from client browsers
Web page
Usually a file with an .htm or .html extension that contains Hypertext Markup
Language (HTML) tags and text
HTML
Document layout language (not a programming language)
Defines structure and appearance of Web pages
Allows Web pages to embed hypertext links to other Web pages
Communication protocols
Agreements between sender and receiver regarding how data are sent and
interpreted
Internet is built on two network protocols:
11
1
Transmission Control Protocol (TCP)
Internet Protocol (IP)
Interaction b/w Browser and Server is governed by the HTTP protocol
(Request/Response Tx)
HTTP is stateless!
(Will discuss in more detail when processing forms)
Web Address
Packets
Data that can be routed independently through Internet
Domain name
Represents an IP address
A domain names server maintains tables with domain names matched to their IP
addresses
Internet Service Providers (ISPs)
Provide commercial Internet access
Hypertext Transfer Protocol
Communication protocol used on the Web
Web address
Also called Uniform Resource Locator (URL)
Internet URLs
Specify a Web server or domain name
Specify communication protocol as first part of URL
File URL
HTML file stored on user’s hard drive
11
2
Database-driven Web site Architecture
60
Client-side processing
Some processing is done on the client workstation, either to form the request for
the dynamic Web page or to create or display the dynamic Web page
Eg Javascript code to validate user input.
Often needs to be “executed” by the Browser.
14.Explain about Threats and risks in database security.
(OR)
Present an Overview of Database Security. (May 2018)
Database security concerns the use of a broad range of information security controls to
protect databases (potentially including the data, the database applications or stored
functions, the database systems, the database servers and the associated network links)
against compromises of their confidentiality, integrity and availability. It involves various
types or categories of controls, such as technical, procedural/administrative and
physical. Database security is a specialist topic within the broader realms of computer
security, information security and risk management.
Types of Security
Database security is a broad area that addresses many issues, including the following:
■ Various legal and ethical issues regarding the right to access certain information
■ Policy issues at the governmental, institutional, or corporate level as to what kinds of
information should not be made publicly available—for example, credit ratings and personal
medical records.
■ System-related issues such as the system levels at which various security functions should
be enforced—for example, whether a security function should be handled at the physical
hardware level, the operating system level, or the DBMS level.
■ The need in some organizations to identify multiple security levels and to categorize the
data and users based on these classifications—for example, top secret, secret, confidential,
and unclassified.
Threats to databases
Loss of integrity
Loss of availability
Loss of confidentiality
To protect databases against these types of threats four kinds of countermeasures can be
implemented:
Access control: user accounts, passwords
Inference control: when statistical databases are used
Flow control: to avoid sensitive data reaching unauthorized users
Encryption
Threats
Excessive and Unused Privileges
Privilege Abuse
SQL Injection
Malware
Weak Audit Trail
Storage Media Exposure
61
Exploitation of Vulnerabilities
Misconfigured Databases
Unmanaged Sensitive Data
Denial of Service
62
Security risk to Database System:
Security risks to database systems include, for example:
Unauthorized or unintended activity or misuse by authorized database users, database
administrators, or network/systems managers, or by unauthorized users or hackers
(e.g. inappropriate access to sensitive data, metadata or functions within databases, or
inappropriate changes to the database programs, structures or security
configurations);
Malware infections causing incidents such as unauthorized access, leakage or
disclosure of personal or proprietary data, deletion of or damage to the data or
programs, interruption or denial of authorized access to the database, attacks on other
systems and the unanticipated failure of database services;
Overloads, performance constraints and capacity issues resulting in the inability of
authorized users to use databases as intended;
Physical damage to database servers caused by computer room fires or floods,
overheating, lightning, accidental liquid spills, static discharge, electronic
breakdowns/equipment failures and obsolescence;
Design flaws and programming bugs in databases and the associated programs and
systems, creating various security vulnerabilities (e.g. unauthorized privilege
escalation), data loss/corruption, performance degradation etc.;
Data corruption and/or loss caused by the entry of invalid data or commands,
mistakes in database or system administration processes, sabotage/criminal damage.
An object-relational database is a database management system similar to a relational database, but with an
object-oriented database model: objects, classes and inheritance are directly supported in database schemas
as well as within the query language. The popular RDBMS on the market today is Oracle. For the main
features of Object-relational database systems are listed below:
63
Inheritance:
In all ORDBMS, inheritance of user defined types is supported to derive new sub-types or sub-classes
which would therefore inherit all attributes and methods from the parent type.
XQuery is a query-based language to retrieve data stored in the form of XML. XQuery is to XML what
SQL is to a database.
XQuery is a functional language that is used to retrieve information stored in XML format. XQuery can be
used on XML documents, relational databases containing data in XML formats, or XML Databases.
XQuery 3.0 is a W3C recommendation from April 8, 2014.
The definition of XQuery as given by its official documentation is as follows −
XQuery is a standardized language for combining documents, databases, Web pages and almost anything
else. It is very widely implemented. It is powerful and easy to learn. XQuery is replacing proprietary
middleware languages and Web Application development languages. XQuery is replacing complex Java or
C++ programs with a few lines of code. XQuery is simpler to work with and easier to maintain than many
other alternatives.
Characteristics:
Functional Language − XQuery is a language to retrieve/querying XML based data.
Analogous to SQL − XQuery is to XML what SQL is to databases.
XPath based − XQuery uses XPath expressions to navigate through XML documents.
64
Universally accepted − XQuery is supported by all major databases.
W3C Standard − XQuery is a W3C standard.
Benefits of XQuery:
Using XQuery, both hierarchical and tabular data can be retrieved.
XQuery can be used to query tree and graphical structures.
XQuery can be directly used to query webpages.
XQuery can be directly used to build webpages.
XQuery can be used to transform xml documents.
XQuery is ideal for XML-based databases and object-based databases. Object databases are much
more flexible and powerful than purely tabular databases.
XPATH:
XQuery is XPath compliant. It uses XPath expressions to restrict the search results on XML
collections.
XPath Examples:
We will use the books.xml file and apply XQuery to it.
books.xml
<?xml version="1.0" encoding="UTF-8"?>
<books>
<book category="JAVA">
<title lang="en">Learn Java in 24 Hours</title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="DOTNET">
<title lang="en">Learn .Net in 24 hours</title>
<author>Peter</author>
<year>2011</year>
<price>40.50</price>
</book>
<book category="XML">
<title lang="en">Learn XQuery in 24 hours</title>
<author>Robert</author>
<author>Peter</author>
<year>2013</year>
<price>50.00</price>
</book>
65
<book category="XML">
<title lang="en">Learn XPath in 24 hours</title>
<author>Jay Ban</author>
<year>2010</year>
<price>16.50</price>
</book>
</books>
We have given here three versions of an XQuery statement that fulfil the same objective of
displaying the book titles having a price value greater than 30.
XQuery – Version 1
(: read the entire xml document :)
let $books := doc("books.xml")
for $x in $books/books/book
where $x/price > 30
return $x/title
Output
<title lang="en">Learn .Net in 24 hours</title>
<title lang="en">Learn XQuery in 24 hours</title>
XQuery – Version 2
(: read all books :)
let $books := doc("books.xml")/books/book
for $x in $books
where $x/price > 30
return $x/title
Output
<title lang="en">Learn .Net in 24 hours</title>
<title lang="en">Learn XQuery in 24 hours</title>
XQuery – Version 3
(: read books with price > 30 :)
let $books := doc("books.xml")/books/book[price > 30]
for $x in $books
return $x/title
66
Output
<title lang="en">Learn .Net in 24 hours</title>
<title lang="en">Learn XQuery in 24 hours</title>
Types:
System Privileges
Schema Object Privileges
Table Privileges
67
View Privileges
Procedure Privileges
Type Privileges
In some cases it is desirable to grant a privilege to a user temporarily. For example,
The owner of a relation may want to grant the SELECT privilege to a user for a specific
task and then revoke that privilege once the task is completed.
Hence, a mechanism for revoking privileges is needed. In SQL, a REVOKE command is
included for the purpose of canceling privileges.
Suppose that the DBA creates four accounts
A1, A2, A3, A4 and wants only A1 to be able to create base relations. Then the DBA
must issue the following GRANT command in SQL
GRANT CREATETAB TO A1;
In SQL the same effect can be accomplished by having the DBA issue a CREATE
SCHEMA command as follows:
CREATE SCHAMA EXAMPLE AUTHORIZATION A1;
User account A1 can create tables under the schema called EXAMPLE.
Suppose that A1 creates the two base relations EMPLOYEE and
DEPARTMENT
A1 is then owner of these two relations and hence all the relation privileges on
each of them.
Suppose that A1 wants to grant A2 the privilege to insert and delete tuples in both
of these relations, but A1 does not want A2 to be able to propagate these privileges
to additional accounts:
Information retrieval:
Information retrieval (IR) may be defined as a software program that deals with the organization, storage,
retrieval and evaluation of information from document repositories particularly textual information. The
system assists users in finding the information they require but it does not explicitly return the answers of
the questions. It informs the existence and location of documents that might consist of the required
information. The documents that satisfy user’s requirement are called relevant documents. A perfect IR
system will retrieve only relevant documents.
With the help of the following diagram, we can understand the process of information retrieval (IR) −
It is clear from the above diagram that a user who needs information will have to formulate a request in the
form of query in natural language. Then the IR system will respond by retrieving the relevant output, in
the form of documents, about the required information.
Classical Problem in Information Retrieval (IR) System
The main goal of IR research is to develop a model for retrieving information from the repositories of
documents. Here, we are going to discuss a classical problem, named ad-hoc retrieval problem, related to
the IR system.
In ad-hoc retrieval, the user must enter a query in natural language that describes the required information.
Then the IR system will return the required documents related to the desired information. For example,
suppose we are searching something on the Internet and it gives some exact pages that are relevant as per
our requirement but there can be some non-relevant pages too. This is due to the ad-hoc retrieval problem.
Aspects of Ad-hoc Retrieval
Followings are some aspects of ad-hoc retrieval that are addressed in IR research −
How users with the help of relevance feedback can improve original formulation of a query?
How to implement database merging, i.e., how results from different text databases can be merged
into one result set?
How to handle partly corrupted data? Which models are appropriate for the same?
Information Retrieval (IR) Model
Mathematically, models are used in many scientific areas having objective to understand some
phenomenon in the real world. A model of information retrieval predicts and explains what a user will find
in relevance to the given query. IR model is basically a pattern that defines the above-mentioned aspects of
retrieval procedure and consists of the following −
A model for documents.
A model for queries.
A matching function that compares queries to documents.
Mathematically, a retrieval model consists of −
D − Representation for documents.
R − Representation for queries.
F − The modeling framework for D, Q along with relationship between them.
R (q,di) − A similarity function which orders the documents with respect to the query. It is also called
ranking.
Types of Information Retrieval (IR) Model
An information model (IR) model can be classified into the following three models −
Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical knowledge that
was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three classical IR
models.
Non-Classical IR Model
It is completely opposite to classical IR model. Such kind of IR models are based on principles other than
similarity, probability, Boolean operations. Information logic model, situation theory model and
interaction models are the examples of non-classical IR model.
Alternative IR Model
It is the enhancement of classical IR model making use of some specific techniques from some other
fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are the example of
alternative IR model.
Design features of Information retrieval (IR) systems
Let us now learn about the design features of IR systems −
Inverted Index
The primary data structure of most of the IR systems is in the form of inverted index. We can define an
inverted index as a data structure that list, for every word, all documents that contain it and frequency of
the occurrences in document. It makes it easy to search for ‘hits’ of a query word.
Stemming
Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base
form of words by chopping off the ends of words. For example, the words laughing, laughs, laughed
would be stemmed to the root word laugh.
In our subsequent sections, we will discuss about some important and useful IR models.
The Boolean Model
It is the oldest information retrieval (IR) model. The model is based on set theory and the Boolean algebra,
where documents are sets of terms and queries are Boolean expressions on terms. The Boolean model can
be defined as −
D − A set of words, i.e., the indexing terms present in a document. Here, each term is either present
(1) or absent (0).
Q − A Boolean expression, where terms are the index terms and operators are logical products −
AND, logical sum − OR and logical difference − NOT
F − Boolean algebra over sets of terms as well as over sets of documents
If we talk about the relevance feedback, then in Boolean IR model the Relevance prediction can be
defined as follows −
R − A document is predicted as relevant to the query expression if and only if it satisfies the query
expression as −
(( ˅ )˄ ˄˜ ℎ )
We can explain this model by a query term as an unambiguous definition of a set of documents.
For example, the query term “economic” defines the set of documents that are indexed with the
term “economic”.
Now, what would be the result after combining terms with Boolean AND Operator? It will define a
document set that is smaller than or equal to the document sets of any of the single terms. For example, the
query with terms “social” and “economic” will produce the documents set of documents that are indexed
with both the terms. In other words, document set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR operator? It will define a
document set that is bigger than or equal to the document sets of any of the single terms. For example, the
query with terms “social” or “economic” will produce the documents set of documents that are indexed
with either the term “social” or “economic”. In other words, document set with the union of both the sets.
Advantages of the Boolean Mode
The advantages of the Boolean model are as follows −
The simplest model, which is based on sets.
Easy to understand and implement.
It only retrieves exact matches
It gives the user, a sense of control over the system.
Disadvantages of the Boolean Model
The disadvantages of the Boolean model are as follows −
The model’s similarity function is Boolean. Hence, there would be no partial matches. This can be
annoying for the users.
In this model, the Boolean operator usage has much more influence than a critical word.
The query language is expressive, but it is complicated too.
No ranking for retrieved documents.
Vector Space Model
Due to the above disadvantages of the Boolean model, Gerard Salton and his colleagues suggested a
model, which is based on Luhn’s similarity criterion. The similarity criterion formulated by Luhn states,
“the more two representations agreed in given elements and their distribution, the higher would be the
probability of their representing similar information.”
Consider the following important points to understand more about the Vector Space Model −
The index representations (documents) and the queries are considered as vectors embedded in a
high dimensional Euclidean space.
The similarity measure of a document vector to a query vector is usually the cosine of the angle
between them.
Cosine Similarity Measure Formula
Cosine is a normalized dot product, which can be calculated with the help of the following formula –
Term Weighting
Term weighting means the weights on the terms in vector space. Higher the weight of the term, greater
would be the impact of the term on cosine. More weights should be assigned to the more important terms
in the model. Now the question that arises here is how can we model this.
One way to do this is to count the words in a document as its term weight. However, do you think it would
be effective method?
Another method, which is more effective, is to use term frequency (tfij), document frequency
(dfi) and collection frequency (cfi).
Here,
N = documents in the collection
nt = documents containing term t
User Query Improvement
The primary goal of any information retrieval system must be accuracy − to produce relevant documents
as per the user’s requirement. However, the question that arises here is how can we improve the output by
improving user’s query formation style. Certainly, the output of any IR system is dependent on the user’s
query and a well-formatted query will produce more accurate results. The user can improve his/her query
with the help of relevance feedback, an important aspect of any IR model.
Relevance Feedback
Relevance feedback takes the output that is initially returned from the given query. This initial output can
be used to gather user information and to know whether that output is relevant to perform a new query or
not. The feedbacks can be classified as follows −
Explicit Feedback
It may be defined as the feedback that is obtained from the assessors of relevance. These assessors will
also indicate the relevance of a document retrieved from the query. In order to improve query retrieval
performance, the relevance feedback information needs to be interpolated with the original query.
Assessors or other users of the system may indicate the relevance explicitly by using the following
relevance systems −
Binary relevance system − This relevance feedback system indicates that a document is either
relevant (1) or irrelevant (0) for a given query.
Graded relevance system − The graded relevance feedback system indicates the relevance of a
document, for a given query, on the basis of grading by using numbers, letters or descriptions. The
description can be like “not relevant”, “somewhat relevant”, “very relevant” or “relevant”.
Implicit Feedback
It is the feedback that is inferred from user behavior. The behavior includes the duration of time user spent
viewing a document, which document is selected for viewing and which is not, page browsing and
scrolling actions, etc. One of the best examples of implicit feedback is dwell time, which is a measure of
how much time a user spends viewing the page linked to in a search result.
Pseudo Feedback
It is also called Blind feedback. It provides a method for automatic local analysis. The manual part of
relevance feedback is automated with the help of Pseudo relevance feedback so that the user gets
improved retrieval performance without an extended interaction. The main advantage of this feedback
system is that it does not require assessors like in explicit relevance feedback system.
Consider the following steps to implement this feedback −
Step 1 − First, the result returned by initial query must be taken as relevant result. The range of
relevant result must be in top 10-50 results.
Step 2 − Now, select the top 20-30 terms from the documents using for instance term frequency(tf)-
inverse document frequency(idf) weight.
Step 3 − Add these terms to the query and match the returned documents. Then return the most
relevant documents.
Access control
The security mechanism of a DBMS must include provisions for
restricting access to the database as a whole
This function is called access control and is handled by creating user
accounts and passwords to control login process by the DBMS.
Inference control
The security problem associated with databases is that of controlling
the access to a statistical database, which is used to provide
statistical information or summaries of values based on various
criteria.
The countermeasures to statistical database security problem is
called inference control measures.
Flow control
Another security is that of flow control, which prevents information from
flowing in such a way that it reaches unauthorized users.
Encryption
A final security issue is data encryption, which is used to protect sensitive data (such
as credit card numbers) that is being transmitted via some type communication
network.
The data is encoded using some encoding algorithm.
20.Discuss about the Access control mechanism and Cryptography methods to secure the
Database. (Nov’2014 & May 2015)
Mandatory Access Control and Role-Based Access Control for Multilevel Security
□The discretionary access control techniques of granting and revoking privileges on
relations have traditionally been the main security mechanism for relational database
systems.
□This is an all-or-nothing method:
□A user either has or does not have a certain privilege.
□In many applications, and additional security policy is needed that classifies data and users
based on security classes.
□This approach as mandatory access control would typically be combined with the
discretionary access control mechanisms.
□Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified
(U), where TS is the highest level and U the lowest: TS ≥ S ≥ C ≥ U
□The commonly used model for multilevel security, known as the Bell-LaPadula model,
classifies each subject (user, account, program) and object (relation, tuple, column, view,
operation) into one of the security classifications, T, S, C, or U:
□Clearance (classification) of a subject S as class(S) and to the classification of an object
O as class (O).
□Two restrictions are enforced on data access based on the subject/object classifications:
□Simple security property: A subject S is not allowed read access to an object O unless
class(S) ≥ class (O).
□A subject S is not allowed to write an object O unless class(S) ≤ class(O). This known as
the star property (or * property).
Digital Signatures:
□A digital signature is an example of using encryption techniques to provide authentication
services in e-commerce applications.
□A digital signature is a means of associating a mark unique to an individual with a body of
text.
□The mark should be unforgettable, meaning that others should be able to check that the
signature does come from the originator.
□A digital signature consists of a string of symbols.
□Signature must be different for each use.
□This can be achieved by making each digital signature a function of the message that it is
signing, together with a time stamp.
□Public key techniques are the means creating digital signatures
21.Suppose an Object Oriented database had an object A, which references object B, which in
turn references object C. Assume all objects are on disk initially? Suppose a program first
dereferences A, then dereferences B by following the reference from A, and then finally
dereferences C. Show the objects that are represented in memory after each dereference,
along with their state. (Nov’2015)
Dereferencing:
• A -> B = the B attribute of the object referred to by reference A.
• Example
– Find the B served by Joe.
SELECT b -> name
FROM c1
WHERE c -> name = 'Joe';
22.Describe the GRANT functions and explain how it relates to security. What types of
privileges may be granted? How rights could be revoked? (Nov/Dec 2015)
Object Query Language (OQL) is a query language standard for object-oriented databases modeled
after SQL. OQL was developed by the Object Data Management Group (ODMG). Because of its overall
complexity nobody has ever fully implemented the complete OQL. OQL has influenced the design of
some of the newer query languages like JDOQL and EJB QL, but they can't be considered as different
flavors of OQL.
Simple query:
The following example illustrates how one might retrieve the database:
SELECT pc.cpuspeed
FROM PCs pc
WHERE pc.ram > 64;
Note the use of the keyword partition , as opposed to aggregation in traditional SQL
or
Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who purchase items that cost $100 or more on
an average; and budget spenders as customers who purchase items at less than $100 on an average. The
mining of discriminant descriptions for customers from each of these categories can be specified in the
DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object variables.
Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer credit rating where the classes are determined by the
attribute credit_rating, and mine classification is determined as classifyCustomerCreditRating.
analyze credit_rating
Prediction
The syntax for prediction is −
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
-set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all