0% found this document useful (0 votes)
83 views16 pages

Database MC A

This document discusses distributed databases and related technologies. It covers the following key points in 3 sentences: Distributed databases allow data to be stored across multiple physical locations while maintaining transparency to users. The document discusses different architectures for distributed databases including client-server, collaborating server, and middleware approaches. It also covers important concepts for distributed databases like data fragmentation, replication, distributed query processing, and ensuring distributed data independence.

Uploaded by

Maheswari M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views16 pages

Database MC A

This document discusses distributed databases and related technologies. It covers the following key points in 3 sentences: Distributed databases allow data to be stored across multiple physical locations while maintaining transparency to users. The document discusses different architectures for distributed databases including client-server, collaborating server, and middleware approaches. It also covers important concepts for distributed databases like data fragmentation, replication, distributed query processing, and ensuring distributed data independence.

Uploaded by

Maheswari M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

202 – ADVANCE

DATABASE
TECHNOLOGY
UNIT - II
Prepared
By.E.Janakiraman.MCA,Mphil,.
Assistant Professor-MCA/APEC
Page 1
UNIT II SPATIAL AND
TEMPORAL
DATABASES
Active Databases Model
– Design and
Implementation Issues -
Temporal
Databases - Temporal
Querying - Spatial
Databases: Spatial Data
Types, Spatial
Operators and Queries –
Spatial Indexing and
Mining – Applications -–
Mobile
Databases: Location and
Handoff Management,
Mobile Transaction
Models –
Deductive Databases -
Multimedia Databases.
2.1 A
4 DISTRIBUTED DATABASES
 The abstract idea of a distributed database is that the data
should be physically stored at different locations but its
distribution and access should be transparent to the user.
1.4.1 Introduction to DBMS:
A Distributed Database should exhibit the following properties:
1) Distributed Data Independence: - The user should be able
to access the database without having the need to know the
location of the data.
2) Distributed Transaction Atomicity: - The concept of
atomicity should be distributed for the operation taking place
at the distributed sites.
 Types of Distributed Databases are:-
a) Homegeneous Distributed Database is where the
data stored across multiple sites is managed by same
DBMS software at all the sites.
b) Heterogeneous Distributed Database is where
multiple sites which may be autonomous are under
the control of different DBMS software.
1.4.2 Architecture of DDBs :
There are 3 architectures: -
1.4.2.1Client-Server:
 A Client-Server system has one or more client processes
and one or more server processes, and a client process
can send a query to any one server process. Clients are
responsible for user-interface issues, and servers
manage data and execute transactions.
 Thus, a client process could run on a personal computer
and send queries to a server running on a mainframe.
9
 Advantages: -
1. Simple to implement because of the centralized
server and separation of functionality.
2. Expensive server machines are not underutilized with
simple user interactions which are now pushed on to
inexpensive client machines.
3. The users can have a familiar and friendly client side
user interface rather than unfamiliar and unfriendly
server interface
1.4.2.2 Collaborating Server:
 In the client sever architecture a single query cannot be
split and executed across multiple servers because the
client process would have to be quite complex and
intelligent enough to break a query into sub queries to be
executed at different sites and then place their results
together making the client capabilities overlap with the
server. This makes it hard to distinguish between the client
and server
 In Collaborating Server system, we can have collection of
database servers, each capable of running transactions
against local data, which cooperatively execute
transactions spanning multiple servers.
 When a server receives a query that requires access to
data at other servers, it generates appropriate sub queries
to be executed by other servers and puts the results
together to compute answers to the original query.
1.4.2.3 Middleware:
 Middleware system is as special server, a layer of software
that coordinates the execution of queries and transactions
across one or more independent database servers.
 The Middleware architecture is designed to allow a single
query to span multiple servers, without requiring all
database servers to be capable of managing such multi
site execution strategies. It is especially attractive when
trying to integrate several legacy systems, whose basic
capabilities cannot be extended.
 We need just one database server that is capable of
managing queries and transactions spanning multiple
servers; the remaining servers only need to handle local
queries and transactions.
10
1.5 STORING DATA IN DDBS
Data storage involved 2 concepts
1. Fragmentation
2. Replication
1.5.1 Fragmentation:
 It is the process in which a relation is broken into smaller
relations called fragments and possibly stored at different sites.
 It is of 2 types
1. Horizontal Fragmentation where the original relation is
broken into a number of fragments, where each fragment
is a subset of rows.
The union of the horizontal fragments should reproduce
the original relation.
2. Vertical Fragmentation where the original relation is
broken into a number of fragments, where each fragment
consists of a subset of columns.
The system often assigns a unique tuple id to each tuple
in the original relation so that the fragments when joined
again should from a lossless join.
The collection of all vertical fragments should reproduce
the original relation.
1.5.2 Replication:
 Replication occurs when we store more than one copy of a
relation or its fragment at multiple sites.
 Advantages:-
1. Increased availability of data: If a site that contains a
replica goes down, we can find the same data at other
sites. Similarly, if local copies of remote relations are
available, we are less vulnerable to failure of
communication links.
2. Faster query evaluation: Queries can execute faster by
using a local copy of a relation instead of going to a
remote site.
1.5.3 Distributed catalog management :
Naming Object
 It’s related to the unique identification of each fragment
that has been either partitioned or replicated.
11
 This can be done by using a global name server that
can assign globally unique names.
 This can be implemented by using the following two
fields:-
1. Local name field – locally assigned name by the site where
the relation is created. Two objects at different sites can
have same local names.
2. Birth site field – indicates the site at which the relation is
created and where information about its fragments and
replicas is maintained.
Catalog Structure:
 A centralized system catalog is used to maintain the
information about all the transactions in the distributed
database but is vulnerable to the failure of the site
containing the catalog.
 This could be avoided by maintaining a copy of the
global system catalog but it involves broadcast of every
change done to a local catalog to all its replicas.
 Another alternative is to maintain a local catalog at
every site which keeps track of all the replicas of the
relation.
Distributed Data Independence:
 It means that the user should be able to query the
database without needing to specify the location of the
fragments or replicas of a relation which has to be done
by the DBMS
 Users can be enabled to access relations without
considering how the relations are distributed as follows:
The local name of a relation in the system catalog is a
combination of a user name and a user-defined relation
name.
 When a query is fired the DBMS adds the user name to
the relation name to get a local name, then adds the
user's site-id as the (default) birth site to obtain a global
relation name. By looking up the global relation name in
the local catalog if it is cached there or in the catalog at
the birth site the DBMS can locate replicas of the
relation.
12
1.5.4 Distributed query processing:
 In a distributed system several factors complicates the query
processing.
 One of the factors is cost of transferring the data over network.
 This data includes the intermediate files that are transferred to
other sites for further processing or the final result files that may
have to be transferred to the site where the query result is
needed.
 Although these cost may not be very high if the sites are
connected via a high local n/w but sometime they become quit
significant in other types of network.
 Hence, DDBMS query optimization algorithms consider the goal
of reducing the amount of data transfer as an optimization
criterion in choosing a distributed query execution strategy.
 Consider an EMPLOYEE relation.
 The size of the employee relation is 100 * 10,000=10^6 bytes
 The size of the department relation is 35 * 100=3500 bytes
EMPLOYEE
Fname Lname SSN Bdate Add Gender Salary Dnum
 10,000 records
 Each record is 100 bytes
 Fname field is 15 bytes long
 SSN field is 9 bytes long
 Lname field is 15 bytes long
 Dnum field is 4 byte long
DEPARTMENT
Dname Dnumber MGRSSN MgrStartDate
 100records
 Each record is 35 bytes long
 Dnumber field is 4 bytes long
 Dname field is 10 bytes long
 MGRSSN field is 9 bytes long
 Now consider the following query:
“For each employee, retrieve the employee name and the name
of the department for which the employee works.”
13
 Using relational algebra this query can be expressed as
FNAME, LNAME, DNAME
( EMPLOYEE * DNO=DNUMBER DEPARTMENT)
 If we assume that every employee is related to a department
then the result of this query will include 10,000 records.
 Now suppose that each record in the query result is 40 bytes
long and the query is submitted at a distinct site which is the
result site.
 Then there are 3 strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT
relations to the site 3 that is your result site and perform the
join at that site. In this case a total of 1,000,000 + 3500 =
1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2 (site where u
have Department relation) and send the result to site 3. the
size of the query result is 40 * 10,000 = 400,000 bytes so
400,000 + 1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTEMNT relation to site 1 (site where u
have Employee relation) and send the result to site 3. in this
case 400,000 + 3500 = 403,500 bytes must be transferred.
1.5.4.1 Nonjoin Queries in a Distributed DBMS:
 Consider the following two relations:
Sailors (sid: integer, sname:string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: date, rname: string)
 Now consider the following query:
SELECT S.age
FROM Sailors S
WHERE S.rating > 3 AND S.rating < 7
 Now suppose that sailor relation is horizontally fragmented
with all the tuples having a rating less than 5 at Shanghais
and all the tuples having a rating greater than 5 at Tokyo.
 The DBMS will answer this query by evaluating it both sites
and then taking the union of the answer.
1.5.4.2 Joins in a Distributed DBMS:
 Joins of a relation at different sites can be very expensive so
now we will consider the evaluation option that must be
considered in a distributed environment.
14
 Suppose that Sailors relation is stored at London and
Reserves relation is stored at Paris. Hence we will consider
the following strategies for computing the joins for Sailors and
Reserves.
 In the next example the time taken to read one page from disk
(or to write one page to disk) is denoted as td and the time
taken to ship one page (from any site to another site) as ts.
2.5.1 Fetch as needed:
 We can do a page oriented nested loops joins in London with
Sailors as the outer join and for each Sailors page we will
fetch all Reserves pages form Paris.
 If we cache the fetched Reserves pages in London until the
join is complete , pages are fetched only once, but lets
assume that Reserves pages are not cached, just to see how
bad result we can get:
 To scan Sailors the cost is 500td for each Sailors page, plus
the cost of scanning and shipping all of Reserves is
1000(td+ts). Therefore the total cost is 500td+500000(td+ts).
 In addition if the query was not submitted at the London site
then we must add the cost of shipping the result to the query
site and this cost depends on the size of the result.
 Because sid a key for the Sailors. So, the number of tuples in
the result is 100,000 (which is the number of tuples in
Reserves) and each tuple is 40+50=90 bytes long.
 Thus (4000 is the size of the result) 4000/90=44 result tuples
fit on a page and the result size is 100000/44=2273 pages.
1.5.4.3 Ship to one site:
 There are three possibilities to compute the result at one site:
• Ship the Sailors from London to Paris and carry out the join.
• Ship the Reserves form Paris to London and carry out the
join.
• Ship both i.e. Sailors and Reserves to the site where the
query was posed and compute the join.
 And the cost will be:
• The cost of scanning and shipping Sailors form London to
Paris and doing the join at Paris is 500(2td+ ts) + 4500td.
• The cost shipping Reserves form Paris to London and then
doing the join at London is 1000 (2td + ts) + 4500td.
15
2.5.2 Semi joins and bloomjoins:
 Consider the strategy of shipping Reserves from Paris to
London and computing the joins at London.
 It may happen that some tuples in Reserves do not join with any
tuple in the Sailors, so we could somehow identify the tuples
that are guaranteed not to join with any Sailors tuples and we
could avoid shipping them.
 Semijoins:
• The basic idea of Semijoins can be proceed in three steps:
1) At London compute the projection of Sailors onto the
join columns, and ship this projection to Paris.
2) At Paris, compute the natural join of the projection
received from the first site with the Reserves relation.
The result of this join is called the reduction of
Reserves with respect to Sailors because only those
Reserves tuples in the reduction will join with tuples in
the Sailors relation. Therefore, ship the reduction of
Reserves to London, rather than the entire Reserves
relation.
3) At London, compute the join of the reduction of
Reserves with Sailors.
 Computing the cost of Sailors and Reserves using Semijoin:
• Assume we have a projection based on first scanning Sailors
and creating a temporary relation with tuples that have only
an sid field, then sorting the temporary and scanning the
sorted temporary to eliminate duplicates.
• If we assume that the size of the sid field is 10 bytes, then
the cost of projection is 500td for scanning Sailors, plus
100td for creating the temporary, plus 400td for sorting it,
plus 100td for the final scan, plus 100td for writing the result
into another temporary relation, that is total is 1200td.
500td + 100td + 400td + 100td + 100td = 1200td.
• The cost of computing the projection and shipping them it to
Paris is 1200td + 100ts.
• The cost of computing the reduction of Reserves is 3 *
(100+1000)=3300td.
 But what is the size of Reduction?
• If every sailor holds at least one reservation then the
reduction includes every tuple of Reserves and the effort
invested in shipping the projection and reducing Reserves is
a total waste.
16
• So because of this observation we can say that Semijoin is
especially useful in conjunction with a selection on one of the
relations.
• For example if we want to compute the join of Sailors tuples
with a rating>8 of the Reserves relation, then the size of the
projection on sid for tuples that satisfy the selection would be
just 20% of the original projection that is 20 pages.
 Bloomjoin:
• Bloomjoin is quit similar to semijoins.
• The steps of Bloomjoins are:
Step 1:
The main difference is that a bit-vector is shipped in first step,
instead of the projection of Sailors.
A bit-vector (some chosen tuple) of size k is computed by
hashing each tuple of Sailors into the range 0 to k-1 and
setting bit I to 1 if some tuple hashes to I and 0 otherwise.
Step 2:
The reduction of Reserves is computed by hashing each tuple
of Reserves (using the sid field) into the range 0 to k-1, using
the same hash function which is used to construct the bitvector and discard the tuples whose
hash values corresponds
to 0 bit.
Because no Sailors tuples hash to such an i and no Sailors
tuples can join with any Reserves tuple that is not in the
reduction.
 Thus the cost of shipping a bit-vector and reducing Reserves
using the vector are less than the corresponding costs is
Semijoins.
1.5.4.4 Cost-Based Query Optimization:
A query involves several operations and optimizing a query in a
distributed database poses some challenges:
 Communication cost must be considered. If we have several
copies of a real time then we will have to decide which copy to
use.
 If the individual sites are run under the control of different DBMS
then the autonomy of each site must be respected while doing
global query planning.
17
1.6 DISTRIBUTED CONCURRENCY CONTROL AND
RECOVERY
The main issues with respect to the Distributed transaction are:
 Distributed Concurrency Control
 How can deadlocks be detected in a distributed database?
 How can locks for objects stored across several sites be
managed?
 Distributed Recovery
 When a transaction commits, all its actions across all the sites
at which it executes must persist.
 When a transaction aborts none of its actions must be allowed
to persist.
1.6.1 Concurrency Control and Recovery in Distributed
Databases:
For currency control and recovery purposes, numerous
problems arise in a distributed DBMS environment that is not
encountered in a centralized DBMS environment.
These includes the following:
Dealing with multiple copies of the data items:
The concurrency control method is responsible for
maintaining consistency among these copies. The recovery method
is responsible for making a copy consistent with other copies if the
site on which he copy is stored fails and recovers later.
Failure of individual sites:
The DBMS should continue to operate with its running sites, if
possible when one or the more individual site fall. When a site
recovers its local database must be brought up to date with the rest
of the sites before it rejoins the system.
Failure of communication links:
The system must be able to deal with failure of one or more of
the communication links that connect the sites. An extreme case of
this problem is hat network partitioning may occur. This breaks up
the sites into two or more partitions where the sites within each
partition can communicate only with one another and not with sites
in other partitions.
18
Distributed Commit:
Problems can arise with committing a transactions that is
accessing database stored on multiple sites if some sites fail during
the commit process. The two-phase commit protocol is often used
to deal with this problem.
Distributed Deadlock:
Deadlock may occur among several sites so techniques for
dealing with deadlocks must be extended to take this into account.
1.6.2 Lock management can be distributed across sites in
many ways:
 Centralized: A single site is in charge of handling lock and
unlock requests for all objects.
 Primary copy: One copy of each object is designates as the
primary copy. All requests to lock or unlock a copy of these
objects are handled by the lock manager at the site where the
primary copy is stored, regardless of where the copy itself is
stored.
 Fully Distributed: Request to lock or unlock a copy of an object
stored at a site are handled by the lock manager at the site
where the copy is stored.
Conclusion:
• The Centralized scheme is vulnerable to failure of the single
site that controls locking.
• The Primary copy scheme avoids this problem, but in general
reading an object requires communication with two sites:
 The site where the primary copy is resides.
 The site where the copy to be read resides.
• This problem is avoided in fully distributed scheme because in
this scheme the locking is done at the site where the copy to
be read resides.
1.6.3 Distributed Deadlock
 One issue that requires special attention when using either
primary copy or fully distributed locking is deadlocking detection
 Each site maintains a local waits-for graph and a cycle in a local
graph indicates a deadlock.
19
For example:
 Suppose that we have two sites A and B, both contain copies
of objects O1 and O2 and that the read-any write-all
technique is used.
 T1 which wants to read O1 and write O2 obtains an S lock on
O1 and X lock on O2 at site A, and request for an X lock on
O2 at site B.
 T2 which wants to read O2 and write O1 mean while obtains
an S lock on O2 and an X lock on O1 at site B then request
an X lock on O1 at site A.
 As shown in the following figure T2 is waiting for T1 at site A
and T1 is waiting for T2 at site B thus we have a Deadlock.
At site A Global Waits-for Graph
At site B
To detect such deadlocks, a distributed deadlock detection
algorithm must be used and we have three types of algorithms:
1. Centralized Algorithm:
• It consist of periodically sending all local waits-for graphs to
some one site that is responsible for global deadlock detection.
• At this site, the global waits-for graphs is generated by
combining all local graphs and in the graph the set of nodes is
the union of nodes in the local graphs and there is an edge from
one node to another if there is such an edge in any of the local
graphs.
2. Hierarchical Algorithm:
• This algorithm groups the sites into hierarchies and the sites
might be grouped by states, then by country and finally into
single group that contain all sites.
• Every node in this hierarchy constructs a waits-for graph that
reveals deadlocks involving only sites contained in (the sub tree
rooted at) this node.
T1 T2 T1 T2
T1 T2
20
• Thus, all sites periodically (e.g., every 10 seconds) send their
local waits-for graph to the site constructing the waits-for graph
for their country.
• The sites constructing waits-for graph at the country level
periodically (e.g., every 10 minutes) send the country waits-for
graph to site constructing the global waits-for graph.
3. Simple Algorithm:
• If a transaction waits longer than some chosen time-out interval,
it is aborted.
• Although this algorithm causes many unnecessary restart but
the overhead of the deadlock detection is low.
1.6.4 Distributed Recovery:
Recovery in a distributed DBMS is more complicated than in a
centralized DBMS for the following reasons:
 New kinds of failure can arise: failure of communication links
and failure of remote site at which a sub transaction is
executing.
 Either all sub transactions of a given transaction must commit
or none must commit and this property must be guaranteed
despite any combination of site and link failures. This
guarantee is achieved using a commit protocol.
Normal execution and Commit Protocols:
 During normal execution each site maintains a log and the
actions of a sub transaction are logged at the site where it
executes.
 The regular logging activity is carried out which means a
commit protocol is followed to ensure that all sub
transactions of a given transaction either commit or abort
uniformly.
 The transaction manager at the site where the transaction
originated is called the Coordinator for the transaction and
the transaction managers where its sub transactions execute
are called Subordinates.
Two Phase Commit Protocol:
 When the user decides to commit the transaction and the
commit command is sent to the coordinator for the transaction.
21
This initiates the 2PC protocol:
 The coordinator sends a Prepare message to each
subordinate.
 When a subordinate receive a Prepare message, it then
decides whether to abort or commit its sub transaction. it
force-writes an abort or prepares a log record and then
sends a NO or Yes message to the coordinator.
 Here we can have two conditions:
o If the coordinator receives Yes message from all
subordinates. It force-writes a commit log record and
then sends a commit message to all the subordinates.
o If it receives even one No message or No response from
some coordinates for a specified time-out period then it
will force-write an abort log record and then sends an
abort message to all subordinate.
 Here again we can have two conditions:
o When a subordinate receives an abort message, it forcewrites an abort log record sends an
ack message to
coordinator and aborts the sub transaction.
o When a subordinates receives a commit message, it
force-writes a commit log record and sends an ack
message to the coordinator and commits the sub
transaction.






22
2
DATA WAREHOUSING
Unit Structure:
2.1 Introduction
2.2 Data Marts
2.3 Extraction
2.4 Transformation
2.5 Loading
2.6 Metadata
2.7 Data Warehousing and ERP
2.8 Data Warehousing and CRM
2.9 Data Warehousing and KM
2.1 INTRODUCTION:
A data warehouse is a repo

MC4202 –
ADVANCE
DATABASE
TECHNOLOGY
UNIT - II
Prepared
By.E.Janakiraman.MCA,Mphil,.
Assistant Professor-MCA/APEC
Page 1
UNIT II SPATIAL AND
TEMPORAL
DATABASES
Active Databases Model
– Design and
Implementation Issues -
Temporal
Databases - Temporal
Querying - Spatial
Databases: Spatial Data
Types, Spatial
Operators and Queries –
Spatial Indexing and
Mining – Applications -–
Mobile
Databases: Location and
Handoff Management,
Mobile Transaction
Models –
Deductive Databases -
Multimedia Databases.
2

You might also like