0% found this document useful (0 votes)
22 views24 pages

Adbms Aditi

ADMS NOTES SELECTIVE

Uploaded by

Anjan Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

Adbms Aditi

ADMS NOTES SELECTIVE

Uploaded by

Anjan Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Advanced Database Management

System Assignment

Submitted to
Sir
Mr. Prashant Bhardwaj
Department of Computer Science & Engineering

Submitted by
Aditi Majumder
Section - B
Enrollment no.- 16UCS011
Branch - 20/04/2020

National Institute of Technology, Agartala.


ADVANCED DATABASE MANAGEMENT SYSTEM ASSIGNMENT

1. (a) Discuss the main characteristics of the database approach


and how it differs from traditional file system.
Answer:

The main characteristics of the database approach are:-

● Self-describing nature of a database system: A fundamental


characteristic of the database approach is that the database
system contains not only the database itself but also a complete
definition or description of the database structure and
constraints. This definition is stored in the DBMS catalog, which
contains information such as the structure of each file, the type
and storage format of each data item, and various constraints on
the data. The information stored in the catalog is called meta-
data, and it describes the structure of the primary database. In
traditional file processing, data definition is typically part of
the application programs themselves. Hence, these programs are
constrained to work with only one specific database, whose
structure is declared in the application programs.

▪ Insulation between programs and data, and data abstraction:


In traditional file processing, the structure of data files is
embedded in the application programs, so any changes to the
structure of a file may require changing all programs that access
that file. By contrast, DBMS access programs do not require such
changes in most cases. The structure of data files is stored in
the DBMS catalog separately from the access programs. This
property is called program-data independence. In some types of
database systems, such as object-oriented and object-relational
systems, users can define operations on data as part of the
database definitions. The implementation of the operation is
specified separately and can be changed without affecting the
interface. User application programs can operate on the data by
invoking these operations through their names and arguments,
regardless of how the operations are implemented. This may be
termed program-operation independence. The characteristic that
allows program-data independence and program-operation
independence is called data abstraction.

● Support of multiple views of the data: A database typically


has many users, each of whom may require a different perspective
or view of the database. A view may be a subset of the database or
it may contain virtual data that is derived from the database
files but is not explicitly stored. Some users may not need to be
aware of whether the data they refer to is stored or derived. A
multiuser DBMS whose users have a variety of distinct applications
must provide facilities for defining multiple views.

● Sharing of data and multiuser transaction processing: A


multiuser DBMS must allow multiple users to access the database at
the same time. This is essential if data for multiple applications
is to be integrated and maintained in a single database. The DBMS
must include concurrency control software to ensure that several
users trying to update the same data do so in a controlled manner
so that the result of the updates is correct. A fundamental role
of multiuser DBMS software is to ensure that concurrent
transactions operate correctly and efficiently. The concept of a
transaction has become central to many database applications. A
transaction is an executing program or process that includes one
or more database accesses, such as reading or updating of database
records. Each transaction is supposed to execute a logically
correct database access if executed in its entirety without
interference from other transactions. The DBMS must enforce
several transaction properties. The isolation property ensures
that each transaction appears to execute in isolation from other
transactions, even though hundreds of transactions may be
executing concurrently. The atomicity property ensures that either
all the database operations in a transaction are executed or none
are.

#Difference between Database approach & Traditional file system :-


A database is a collection of interrelated data’s stored in a
database server; these data’s will be stored in the form of
tables. The primary aim of the database is to provide a way to
store and retrieve database information in fast and efficient
manner.
There are a number of characteristics that differ from traditional
file management system. In file system approach, each user defines
and implements the needed files for a specific application to run.
For example in the sales department of an enterprise, One user
will be maintaining the details of how many sales personnel are
there in the sales department and their grades, these details will
be stored and maintained in a separate file.
Another user will be maintaining the salesperson salary details
working in the concern, the detailed salary report will be stored
and maintained in a separate file. Although both of the users are
interested in the data’s of the salespersons they will be having
their details in separate files and they need different programs
to manipulate their files. This will lead to wastage of space and
redundancy or replication of data’s, which may lead to confusion,
sharing of data among various users is not possible, data
inconsistency may occur. These files will not be having any inter-
relationship among the data’s stored in these files. Therefore in
traditional file processing, every user will be defining their own
constraints and implement the files needed for the applications.
In database approach, a single repository of data is maintained
that is defined once and then accessed by many users. The
fundamental characteristic of database approach is that the
database system not only contains data’s but it contains a
complete definition or description of the database structure and
constraints. These definitions are stored in a system catalogue,
which contains the information about the structure and definitions
of the database. The information stored in the catalogue is called
the metadata, it describes the primary database. Hence this
approach will work on any type of database, for example, an
insurance database, Airlines, banking database, Finance details,
and Enterprise information database. But in traditional file
processing system, the application is developed for a specific
purpose and they will access the specific database only.
The other main characteristic of the database is that it will
allow multiple users to access the database at the same time and
sharing of data is possible. The database must include concurrency
control software to ensure that several users trying to update the
same data at the same time, it should maintain in a controlled
manner. In file system approach many programmers will be creating
files over a long period and various files have a different
format, in various application languages.
Therefore there is the possibility of information getting
duplicated, this redundancy is storing same data multiple times
leads to higher costs and wastage of space. This may result in
data inconsistency in the application, this is because the update
is done to some of the files only and not all the files. Moreover,
in database approach, multiple views can be created. The view is a
tailored representation of information contained in one or more
tables. The view is also called as “Virtual table” because the
view does not contain physically stored records and will not
occupy any space.
A multi-user database whose users have a variety of applications
must provide facilities for defining multiple views. In the
traditional file system, if any changes are made to the structure
of the files if will affect all the programs, so changes to the
structure of a file may require changing of all programs that
access the file. But in case of the database approach the
structure of the database is stored separately in the system
catalogue from the access of the application programs. This
property is known as program-data independence.
The database can be used to provide persistent storage for program
objects and data structures that resulted in object oriented
database approach. Traditional systems suffered from impedance
mismatch problem and difficulty in accessing the data, which is
avoided in object oriented database system. The database can be
used to represent complex relationships among data’s as well as to
retrieve and update related data easily and efficiently.
It is possible to define and enforce integrity constraints for the
data’s stored in the database. The database also provides
facilities for recovering hardware and software failures. The
backup and recovery subsystem is responsible for recovery. It
reduces the application development time considerably when
compared to the file system approach and availability of up-to-
date information of all the users. It also provides security to
the data’s stored in the database system.

(b) What is data integrity? Explain the types of integrity


constraints.
Answer: Data integrity is the overall completeness, accuracy and
consistency of data. This can be indicated by the absence of
alteration between two instances or between two updates of a data
record, meaning data is intact and unchanged. Data integrity is
usually imposed during the database design phase through the use
of standard procedures and rules. The concept of data integrity
ensures that all data in a database can be traced and connected to
other data. This ensures that everything is recoverable and
searchable. Data integrity can be maintained through the use of
various error-checking methods and validation procedures.
#The following three integrity constraints are used in a
relational database structure to achieve data integrity:
Entity Integrity: This is concerned with the concept of primary
keys. The rule states that every table must have its own primary
key and that each has to be unique and not null.
Referential Integrity: This is the concept of foreign keys. The
rule states that the foreign key value can be in two states. The
first state is that the foreign key value would refer to a primary
key value of another table, or it can be null. Being null could
simply mean that there are no relationships, or that the
relationship is unknown.
Domain Integrity: This states that all columns in a relational
database are in a defined domain.
2. (a) What is a data mart? Differentiate between dependent and
independent data marts.
Answer: A data mart is a condensed version of Data Warehouse and
is designed for use by a specific department, unit or set of users
in an organization. E.g., Marketing, Sales, HR or finance. It is
often controlled by a single department in an organization. Data
Mart usually draws data from only a few sources compared to a Data
warehouse. Data marts are small in size and are more flexible
compared to a Datawarehouse.
The main difference between independent and dependent data marts
is how you populate the data mart; that is, how you get data out
of the sources and into the data mart. This step, called
the Extraction-Transformation Transportation (ETT) process,
involves moving data from operational systems, filtering it, and
loading it into the data mart. With dependent data marts, this
process is somewhat simplified because formatted and summarized
data has already been loaded into the central data warehouse. The
ETT process for dependent data marts is mostly a process of
identifying the right subset of data relevant to the chosen data
mart subject and moving a copy of it, perhaps in a summarized
form. On the other hand, with independent data marts, however, we
must deal with all aspects of the ETT process, much as we do with
a central data warehouse. The number of sources is likely to be
fewer and the amount of data associated with the data mart is less
than the warehouse, given your focus on a single subject.
(b) A set of FDs for the relation R{A, B, C, 0, E, F} is AB→C, C,
C→C, A, BC→C, D, ACD→C, B, BE→C, C,EC→C, FA, CF→C, BD, D→C, E. Find a minimum cover
for this set of FDs.
Answer:
Step 1: Right Hand Side (RHS) of all FDs should be single
attribute. So we write F as F1, as follows:
F1= {AB->C, C->A, BC->D, ACD->B, BE->C, EC->F, EC->A, CF->B, CF-
>D, D->E}
Step 2: Remove extraneous attributes. Extraneous attribute is a
redundant attribute on the LHS of the functional dependency. In
the set of FDs, AB->C, BC ->D, ACD->B, BE->C, EC->F. EC->A, CF->B,
CF->D have more than one attribute in the LHS. Hence, we check one
of these LHS attributes are extraneous or not. To check, we need
to find the closure of each attribute on the LHS.
i. A+=A
ii. B+=B
iii. C+=CA
iv. D+=DE
v. E+=E
vi. F+=F
From (iii), the closure of C contains the attribute A. So E is
extraneous in EC->A and E can be removed. So, we can write the FDs
as:
F2= {AB->C, C->A, BC->D, ACD->B, BE->C, EC->F, C->A, CF->B, CF->D,
D->E}
Step 3: Eliminate redundant functional dependency. C->A is
redundant in F2. So, after removing, the final minimal cover we
get is:
F3= {AB->C, C->A, BC->D, ACD->B, BE->C, EC->F, CF->B, CF->D, D->E}
3. (a) What are the advantages of normalized relations over the
un-normalized relations?
Answer: The advantages of normalized relations over the un-
normalized relations are given below:
● A smaller database can be maintained as normalized
relation eliminates the duplicate data. Overall size of
the database is reduced as a result.
● Better performance is ensured which can be linked to the
above point. As databases become lesser in size, the
passes through the data becomes faster and shorter
thereby improving response time and speed.
● Narrower tables are possible as normalized tables will
be fine-tuned and will have lesser columns which allows
for more data records per page.
● Fewer indexes per table ensures faster maintenance tasks
(index rebuilds).
● Also realizes the option of joining only the tables that
are needed.

(b) Write a short note on:


(i) Serializability: The types of schedules that are always
considered to be correct when concurrent transactions are
executing are known as serializable schedules. The concept of
serializability of schedules is used to identify which
schedules are correct when transaction executions have
interleaving of their operations in the schedules. Formally,
a schedule S is serial if, for every transaction T
participating in the schedule, all the operations of T are
executed consecutively in the schedule; otherwise, the
schedule is called nonserial Serializability is a concept
that helps to identify which non-serial schedules are correct
and will maintain the consistency of the database. There are
2 types of serializability: Conflict and View
serializability.
● Conflict serializability: A schedule is called conflict
serializable if we can convert it into a serial schedule
after swapping its non-conflicting operations. Two
operations are said to be in conflict, if they satisfy
all the following three conditions:

a) Both the operations should belong to different


transactions.
b) Both the operations are working on same data item.
c) At least one of the operations is a write operation.

For example, let’s consider this schedule:

T1 T2

R(A)
R(B)

R(A)

R(B)

W(B)

W(A)

To convert this schedule into a serial schedule we must


have to swap the R(A) operation of transaction T2 with
the W(A) operation of transaction T1. However we cannot
swap these two operations because they are conflicting
operations, thus we can say that this given schedule
is not Conflict Serializable.

● View serializability: A schedule S is said to be view


serializable if it is view equivalent to a serial
schedule. Two schedules S and S_ are said to be view
equivalent if the following three conditions hold:
I. The same set of transactions participates in S and
S_, and S and S_ include the
same operations of those transactions.
II. For any operation ri(X) of Ti in S, if the value of
X read by the operation has
been written by an operation wj(X) of Tj (or if it is
the original value of X
before the schedule started), the same condition must
hold for the value of X
read by operation ri(X) of Ti in S_.
III. If the operation wk(Y) of Tk is the last operation
to write item Y in S, then wk(Y) of Tk must also be
the last operation to write item Y in S_.

(ii)Recoverability: Sometimes a transaction may not execute


completely due to a software issue, system crash or hardware
failure. In that case, the failed transaction has to be
rollback. If in a schedule, a transaction performs a dirty
read operation from an uncommitted transaction and its commit
operation is delayed till the uncommitted transaction either
commits or roll backs then such a schedule is known as
a Recoverable Schedule. Here, the commit operation of the
transaction that performs the dirty read is delayed. This
ensures that it still has a chance to recover if the
uncommitted transaction fails later.
4. (a) Differentiate between Strict two-phase locking protocol and
conservative two-phase locking protocol for concurrency control in
databases with the help of an example.
Answer: Strict two phase locking requires that in addition to the
lock being 2-Phase all Exclusive(X) Locks held by the transaction
be released until after the Transaction Commits while the
conservative protocol requires the transaction to lock all the
items it access before the Transaction begins execution by
predeclaring its read-set and write-set. If any of the predeclared
items needed cannot be locked, the transaction does not lock any
of the items, instead it waits until all the items are available
for locking. Moreover, Strict two-phase locking protocol does not
ensure free from deadlock while the conservative protocol is
deadlock free.
(b) Describe three-level schema architecture. Why do we need
mappings between schema levels? How do different schema
definition languages support this architecture?
Answer: The three-schema architecture is also called ANSI/SPARC
architecture or three-level architecture. This framework is used
to describe the structure of a specific database system. This
architecture is also used to separate the user applications and
physical database. The architecture contains three-levels. It
breaks the database down into three different categories which are
described below:

I. Internal Level

● The internal level has an internal schema which describes the


physical storage structure of the database.
● The internal schema is also known as a physical schema.
● It uses the physical data model. It is used to define that
how the data will be stored in a block.
● The physical level is used to describe complex low-level data
structures in detail.

II. Conceptual Level


● The conceptual schema describes the design of a database at
the conceptual level. Conceptual level is also known as
logical level.
● The conceptual schema describes the structure of the whole
database.
● The conceptual level describes what data are to be stored in
the database and also describes what relationship exists
among those data.
● In the conceptual level, internal details such as an
implementation of the data structure are hidden.
● Programmers and database administrators work at this level.

III. External Level

● At the external level, a database contains several schemas


that sometimes called as subschema. The subschema is used to
describe the different view of the database.
● An external schema is also known as view schema.
● Each view schema describes the database part that a
particular user group is interested and hides the remaining
database from that user group.
● The view schema describes the end user interaction with
database systems.

Mapping is mainly needed between schema levels for two


reasons. These are schema matching and visualization. The mapping
also helps in different kinds of transformation.

In a three-schema approach, most data-related description


languages or tools associated with schemas focus on the "physical
level" and "view level", with the "conceptual level" mostly used
in combining the schema design itself. In relational databases,
the physical model is explained using SQL DDL. Physical schemas in
NoSQL can be fixed to particular records using JSON or XML. There
are some languages to describe conceptual schemas. But now a day’s
languages are not used for conceptual schemas but instead Entity
relationship model, a diagramming and description tool is used.
The "view level" or "external level" is implemented outside of
data managers in user-side code. This is done using object
relational mapping with the help of tools like Hibernate. These
tools convert "physical layer" schemas structure to "external-
friendly" structures. Here the user input UI elements is given to
the application code. Code having "database layer" is implemented
inside the API.
5. Write short notes on any THREE of the following:
Answer:
(i) Categories of Data models: The data models can be classified
into four different categories:
a) Relational Model: The relational model uses a collection of
tables to represent both data and the relationships among
those data. Each table has multiple columns, and each column
has a unique name. Tables are also known as relations. The
relational model is an example of a record-based model.
Record-based models are so named because the database is
structured in
fixed-format records of several types. Each table contains
records of a particular type. Each record type defines a
fixed number of fields, or attributes. The columns of the
table correspond to the attributes of the record type. The
relational data model is the most widely used data model, and
a vast majority of current database systems are based on the
relational model.
b) Entity-Relationship Model: The entity-relationship (E-R) data
model uses a collection of basic objects, called entities,
and relationships among these objects. An entity is a “thing”
or “object” in the real world that is distinguishable from
other objects. The entity-relationship model is widely used
in database design.
c) Object-Based Data Model: Object-oriented programming
(especially in Java, C++, or C#) has become the dominant
software-development methodology. This led to the development
of an object-oriented data model that can be seen as
extending the E-R model with notions of encapsulation,
methods (functions), and object identity. The object-
relational data model combines features of the object-
oriented data model and relational data model.
d) Semi structured Data Model: The semi structured data model
permits the specification of data where individual data items
of the same type may have different sets of attributes. This
is in contrast to the data models mentioned earlier, where
every data item of a particular type must have the same set
of attributes. The Extensible Markup Language (XML) is widely
used to represent semi structured data.

(ii) Using Heuristics in Query optimization: This optimization


technique applies heuristic rules to modify the internal
representation of a query—which is usually in the form of a query
tree or a query graph data structure—to improve its expected
performance. The scanner and parser of an SQL query first generate
a data structure that corresponds to an initial query
representation, which is then optimized according to heuristic
rules. This leads to an optimized query representation, which
corresponds to the query execution strategy. Following that, a
query execution plan is generated to execute groups of operations
based on the access paths available on the files involved in the
query.
One of the main heuristic rules is to apply SELECT and
PROJECT operations before applying the JOIN or other binary
operations, because the size of the file resulting from a binary
operation—such as JOIN—is usually a multiplicative function of the
sizes of the input files. The SELECT and PROJECT operations reduce
the size of a file and hence should be applied before a join or
other binary operation.
(iii) Multimedia Databases: Multimedia databases provide features
that allow users to store and query different types of multimedia
information, which includes images (such as photos or drawings),
video clips (such as movies, newsreels, or home videos), audio
clips (such as songs, phone messages, or speeches), and documents
(such as books or articles). The main types of database queries
that are needed involve locating multimedia sources that contain
certain objects of interest. For example, one may want to locate
all video clips in a video database that include a certain person,
say Michael Jackson.
One may also want to retrieve video clips based on certain
activities included in them, such as video clips where a soccer
goal is scored by a certain player or team.
The above types of queries are referred to as content-based
retrieval, because the multimedia source is being retrieved based
on its containing certain objects or activities. Hence, a
multimedia database must use some model to organize and index the
multimedia sources based on their contents. Identifying the
contents of multimedia sources is a difficult and time-consuming
task. There are two main approaches. The first is based on
automatic analysis of the multimedia sources to identify certain
mathematical characteristics of their contents. This approach uses
different techniques depending on the type of multimedia source
(image, video, audio, or text). The second approach depends on
manual identification of the objects and activities of interest in
each multimedia source and on using this information to index the
sources. This approach can be applied to all multimedia sources,
but it requires a manual pre-processing phase where a person has
to scan each multimedia source to identify and catalog the objects
and activities it contains so that they can be used to index the
sources.
(iv) Mobile Databases: Mobile Database is a database that is
transportable, portable, and physically separate or detached from
the corporate database server but has the capability to
communicate with those servers from remote sites allowing the
sharing of various kinds of data. With mobile databases, users
have access to corporate data on their laptop, PDA, or other
Internet access device that is required for applications at remote
sites.
The components of a mobile database environment include:

● Corporate database server and DBMS that deals with and stores
the corporate data and provides corporate applications
● Remote database and DBMS usually manage and stores the mobile
data and provides mobile applications
● mobile database platform that includes a laptop, PDA, or
other Internet access devices
● Two-way communication links between corporate and mobile
DBMS.
Based on the particular necessities of mobile applications, in
many of the cases, the user might use a mobile device may and log
on to any corporate database server and work with data there. In
contrast, in others, the user may download data and work with it
on a mobile device or upload data captured at the remote site to
the corporate database. The communication between the corporate
and mobile databases is usually discontinuous and is typically
established or gets its connection for a short duration of time at
irregular intervals. Although unusual, some applications require
direct communication between mobile databases. The two main issues
associated with mobile databases are the management of the mobile
database and the communication between the mobile and corporate
databases. In the following section, we identify the requirements
of mobile DBMSs.
The additional functionality required for mobile DBMSs includes
the capability to:

● communicate with the centralized or primary database server


through modes
● repeat those data on the centralized database server and
mobile device
● coordinate data on the centralized database server and mobile
device
● capture data from a range of sources such as the Internet
● deal with those data on the mobile device
● analyse those data on a mobile device
● create customized and personalized mobile applications

6. (a) What do you understand by distributed databases? Give the


various advantages and disadvantages of distributed database
management systems.
Answer: A distributed database is a database that consists of two
or more files located in different sites either on the same
network or on entirely different networks. Portions of the
database are stored in multiple physical locations and processing
is distributed among multiple database nodes. The DDBMS
synchronizes all the data periodically and ensures that data
updates and deletes performed at one location will be
automatically reflected in the data stored elsewhere.
Advantages of Distributed database management system are:
● Modular Development: If the system needs to be expanded to
new locations or new units, in centralized database systems,
the action requires substantial efforts and disruption in the
existing functioning. However, in distributed databases, the
work simply requires adding new computers and local data to
the new site and finally connecting them to the distributed
system, with no interruption in current functions.
● More Reliable: In case of database failures, the total system
of centralized databases comes to a halt. However, in
distributed systems, when a component fails, the functioning
of the system continues may be at a reduced performance.
Hence DDBMS is more reliable.
● Better Response: If data is distributed in an efficient
manner, then user requests can be met from local data itself,
thus providing faster response. On the other hand, in
centralized systems, all queries have to pass through the
central computer for processing, which increases the response
time.
● Lower Communication Cost: In distributed database systems, if
data is located locally where it is mostly used, then the
communication costs for data manipulation can be minimized.
This is not feasible in centralized systems.
● Improved Performance: As the data is located near the site of
'greatest demand', and given the inherent parallelism of
distributed DBMSs, speed of database access may be better
than that achievable from a remote centralized database.
Furthermore, since each site handles only a part of the
entire database, there may not be the same contention for CPU
and I/O services as characterized by a centralized DBMS.
● Improved share ability and local autonomy: The geographical
distribution of an organization can be reflected in the
distribution of the data; users at one site can access data
stored at other sites. Data can be placed at the site close
to the users who normally use that data. In this way, users
have local control of the data, and they can consequently
establish and enforce local policies regarding the use of
this data. A global database administrator (DBA) is
responsible for the entire system. Generally, part of this
responsibility is assigned the local level, so that the local
DBA can manage the local DBMS.
Disadvantages of DDBMS are:
● Complexity: A distributed DBMS that hides the distributed
nature from the user and provides an acceptable level of
performance, reliability, availability is inherently more
complex then a centralized DBMS. The fact that data can be
replicated also adds an extra level of complexity to the
distributed DBMS. If the software does not handle data
replication adequately, there wi1l be degradation in
availability, reliability and performance compared with the
centralized system, and the advantages we cites above will
become disadvantages.
● Cost: Increased complexity means that we can expect the
procurement and maintenance costs for a DDBMS to be higher
than those for a centralized DBMS. Furthermore, a distributed
DBMS requires additional hardware to establish a network
between sites. There are ongoing communication costs incurred
with the use of this network. There are also additional labor
costs to manage and maintain the local DBMSs and the
underlying network.
● Security: In a centralized system, access to the data can be
easily controlled. However, in a distributed DBMS not only
does access to replicated data have to be controlled in
multiple locations but also the network itself has to be made
secure. In the past, networks were regarded as an insecure
communication medium. Although this is still partially true,
significant developments have been made to make networks more
secure.
● Integrity control more difficult: Database integrity refers
to the validity and consistency of stored data. Integrity is
usually expressed in terms of constraints, which are
consistency rules that the database is not permitted to
violate. Enforcing integrity constraints generally requires
access to a large amount of data that defines the
constraints. In a distributed DBMS, the communication and
processing costs that are required to enforce integrity
constraints are high as compared to centralized system.
● Lack of Standards: Although distributed DBMSs depend on
effective communication, we are only now starting to see the
appearance of standard communication and data access
protocols. This lack of standards has significantly limited
the potential of distributed DBMSs. There are also no tools
or methodologies to help users convert a centralized DBMS
into a distributed DBMS
● Lack of experience: General-purpose distributed DBMSs have
not been widely accepted, although many of the protocols and
problems are well understood. Consequently, we do not yet
have the same level of experience in industry as we have with
centralized DBMSs. For a prospective adopter of this
technology, this may be a significant deterrent.
● Database design more complex: Besides the normal difficulties
of designing a centralized database, the design of a
distributed database has to take account of fragmentation of
data, allocation of fragmentation to specific sites, and data
replication.
(b) How data mining system can be integrated with data
warehouse system?
Ans: The possible integration schemes of Data mining with DW
systems include no coupling, loose coupling, semi tight coupling,
and tight coupling. These are described below:
i. No coupling: No coupling means that a DM system will not
utilize any function of a DB or DW system. It may fetch data
from a particular source (such as a file system), process
data using some data mining algorithms, and then store the
mining results in another file.
ii. Loose coupling: Loose coupling means that a DM system will
use some facilities of a DB or DW system, fetching data from
a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file
or in a designated place in a database or data Warehouse.
Loose coupling is better than no coupling because it can
fetch any portion of data stored in databases or data
warehouses by using query processing, indexing, and other
system facilities. However, many loosely coupled mining
systems are main memory-based. Because mining does not
explore data structures and query optimization methods
provided by DB or DW systems, it is difficult for loose
coupling to achieve high scalability and good performance
with large data sets.
iii. Semi tight coupling: Semi tight coupling means that besides
linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives
(identified by the analysis of frequently encountered data
mining functions) can be provided in the DB/DW system. These
primitives can include sorting, indexing, aggregation,
histogram analysis, multi way join, and precomputation of
some essential statistical measures, such as sum, count, max,
min, standard deviation.
iv. Tight coupling: Tight coupling means that a DM system is
smoothly integrated into the DB/DW system. The data mining
subsystem is treated as one functional component of
information system. Data mining queries and functions are
optimized based on mining query analysis, data structures,
indexing schemes, and query processing methods of a DB or DW
system.
7. (a) What is logical data independence and why is it important?
Answer: The logical data independence is the key feature of
database management system for maintaining data integrity and
for maintaining overall effectiveness of data usage. Logical
data independence is an important part of three schema
architecture which allows the change or modification of
conceptual schema without disturbing the external schema. The
modifications in conceptual schema may include alteration of
entities, attributes, relationships etc. Changing any of these
elements will not disturb the external application programs
which is a key advantage of logical data independence feature of
database management systems. Due to Logical independence, any of
the below change will not affect the external layer.
● Add/Modify/Delete a new attribute, entity or relationship is
possible without a rewrite of existing application programs
● Merging two records into one
● Breaking an existing record into two or more records

(b) What do you understand by timestamp-based protocol? Discuss


multi-version scheme also.
Answer: The Timestamp Ordering Protocol is used to order the
transactions based on their Timestamps. The order of transaction
is nothing but the ascending order of the transaction creation.
The priority of the older transaction is higher that's why it
executes first. To determine the timestamp of the transaction,
this protocol uses system time or logical counter. The timestamp
ordering protocol also maintains the timestamp of last 'read' and
'write' operation on a data. Algorithm must ensure that, for each
item accessed by Conflicting Operations in the schedule, the order
in which the item is accessed does not violate the ordering. To
ensure this, use two Timestamp Values relating to each database
item X.
● W_TS(X) is the largest timestamp of any transaction that
executed write(X) successfully.
● R_TS(X) is the largest timestamp of any transaction that
executed read(X) successfully.

Basic Timestamp ordering protocol works as follows:

i. Check the following condition whenever a transaction Ti


issues a Read (X) operation:

● If W_TS(X) >TS(Ti) then the operation is rejected.


● If W_TS(X) <= TS(Ti) then the operation is executed.
● Timestamps of all the data items are updated.

ii. Check the following condition whenever a transaction Ti


issues a Write(X) operation:

● If TS(Ti) < R_TS(X) then the operation is rejected.


● If TS(Ti) < W_TS(X) then the operation is rejected and Ti is
rolled back otherwise the operation is executed.

Where,

TS(TI) denotes the timestamp of the transaction Ti.

Multiversion Concurrency Control (MVCC) is a method of


controlling consistency of data accessed by multiple users
concurrently. MVCC implements the snapshot isolation guarantee
which ensures that each transaction always sees a consistent
snapshot of data. Each transaction obtains a consistent snapshot
of data when it starts and can only view and modify data in this
snapshot. When the transaction updates an entry, ignite verifies
that the entry hasn't been updated by other transactions and
creates a new version of the entry. The new version will become
visible to other transactions only when and if this transaction
commits successfully. If the entry has been updated, the current
transaction fails with an exception (see the Concurrent Updates
section for the information on how to handle update conflicts).
The snapshots are not physical snapshots but logical snapshots
that are generated by the MVCC-coordinator: a cluster node that
coordinates transactional activity in the cluster. The coordinator
keeps track of all active transactions and is notified when each
transaction finishes. All operations with an MVCC-enabled cache
request a snapshot of data from the coordinator.

8. (a) Define Specialization and aggregation with the help of an


example and explain its purpose.
Answer: In specialization, an entity is divided into sub-entities
based on their characteristics. It is a top-down approach where
higher-level entity is specialized into two or more lower level
entities. For Example, EMPLOYEE entity in an Employee management
system can be specialized into DEVELOPER, TESTER etc. as shown in
Figure-1. In this case, common attributes like E_NAME, E_SAL etc.
become part of higher entity (EMPLOYEE) and specialized attributes
like TES_TYPE become part of specialized entity (TESTER).
An ER diagram is not capable of representing relationship between
an entity and a relationship which may be required in some
scenarios. In those cases, a relationship with its corresponding
entities is aggregated into a higher-level entity. For Example,
Employee working for a project may require some machinery. So,
REQUIRE relationship is needed between relationship WORKS_FOR and
entity MACHINERY. Using aggregation, WORKS_FOR relationship with
its entities EMPLOYEE and PROJECT is aggregated into single entity
and relationship REQUIRE is created between aggregated entity and
MACHINERY.
Specialization is used to identify the subset of an entity set
that shares some distinguishing characteristics. Generalization is
used to combine two or more entities of lower level to form a
higher-level entity if they have some attributes in common. In
generalization, an entity of a higher level can also combine with
the entities of the lower level to form a further higher-level
entity.

(b) Differentiate among candidate key, primary key, super key and
foreign key.

Answer: The set of attributes which can uniquely identify a tuple


is known as Super Key and the minimal set of attributes which can
uniquely identify a tuple is known as candidate key. There can be
more than one candidate key in relation out of which one can be
chosen as the primary key. If an attribute can only take the
values which are present as values of some other attribute, it
will be a foreign key to the attribute to which it refers.

Example: <STUDENT>

Student_Number Student_Name Student_Phone Subject_Number


1 Andrew 6615927284 10
2 Sara 6583654865 20
3 Harry 4647567463 10
<SUBJECT>
Subject_Number Subject_Name Subject_Instructor
10 DBMS Korth
20 Algorithms Cormen
30 Algorithms Leiserson
<ENROLL>
Student_Number Subject_Number
1 10
2 20
3 10

The Super Keys in <Student> table are –


{Student_Number}
{Student_Phone}
{Student_Number,Student_Name}
{Student_Number,Student_Phone}
{Student_Number,Subject_Number}
{Student_Phone,Student_Name}
{Student_Phone,Subject_Number}
{Student_Number,Student_Name,Student_Phone}
{Student_Number,Student_Phone,Subject_Number}
{Student_Number,Student_Name,Subject_Number}
{Student_Phone,Student_Name,Subject_Number}

The Super Keys in <Subject> table are –


{Subject_Number}
{Subject_Number,Subject_Name}
{Subject_Number,Subject_Instructor}
{Subject_Number,Subject_Name,Subject_Instructor}
{Subject_Name,Subject_Instructor}

The Super Key in <Enroll> table is –


{Student_Number,Subject_Number}

The Candidate Key in <Student> table is {Student_Number} or


{Student_Phone}
The Candidate Key in <Subject> table is {Subject_Number} or
{Subject_Name,Subject_Instructor}
The Candidate Key in <Student> table is {Student_Number,
Subject_Number}

The Primary Key in <Student> table is {Student_Number}


The Primary Key in <Subject> table is {Subject_Number}
The Primary Key in <Enroll> table is {Student_Number,
Subject_Number}

Subject_Number in <SUBJECT> table is Foreign key to Subject_Number


in <STUDENT> table

You might also like