0% found this document useful (0 votes)
74 views

Lecture 1 Advance Database Systems Concepts

The document provides an overview of distributed database concepts. It discusses that a distributed database system consists of multiple connected database sites that work together so data can be accessed from any site as if it were local. It then outlines twelve fundamental principles of distributed databases including local autonomy, continuous operation, location independence, and hardware/operating system independence. Finally, it discusses challenges like distributed query processing, transaction management, and concurrency control in distributed systems.

Uploaded by

Tabindah asif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Lecture 1 Advance Database Systems Concepts

The document provides an overview of distributed database concepts. It discusses that a distributed database system consists of multiple connected database sites that work together so data can be accessed from any site as if it were local. It then outlines twelve fundamental principles of distributed databases including local autonomy, continuous operation, location independence, and hardware/operating system independence. Finally, it discusses challenges like distributed query processing, transaction management, and concurrency control in distributed systems.

Uploaded by

Tabindah asif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Advance Database Concepts

Instructor:
Dr. Muhammad Ali Memon
Associate Professor, IICT
1
Distributed DBs– Introduction

• A distributed database system consist of collection of


sites, connected together via some kind of
communication network, in which
– Each site is a full database system site in its own right,
but
– The sites have agreed to work together so that a user at
any site can access data any where in the network
exactly as if the data were all stored at the user’s own
site.

2
Distributed DBs – Intro (Cont.)

New York Site London Site

Communication Network

Los Angeles Site San Francisco Site

Each site is a database system site in its own right. 3


Distributed DBs – Intro (Conti.)

4
A Fundamental Principal
– Distributed Database system should look exactly like a non-distributed system.
• All of the problems of distributed systems are – or should be – internal or implementation-
level problem, not external or user-level problems
– Above principle has twelve rules;
• Local autonomy
• No reliance on a central site
• Continuous operation
• Location independence
• Fragmentation independence
• Replication independence
• Distributed query processing
• Distributed transaction management
• Hardware independence
• Operating system independence
• Network independence
• DBMS independence

5
Twelve Objectives – Local Autonomy
• Local Autonomy
– Local autonomy means that all operations at a given site are controlled
by that site; no site X should depend on some other site Y for its
successful operation .

– Local autonomy also implies that local data is locally owned and managed
with local accountability.

– Security, integrity and storage is under control of local site.

6
No Reliance on a Central Site
• All site must be treated as equal.

•There is no particular reliance on central “Master”.

•Two problems
• Central site might be a bottleneck;
• System would be vulnerable (susceptible) – if the central site
goes down

7
Continuous Operation
• System should provide “reliability and greater
availability”
• Reliability:
– The probability that the system is up and running at any
given movement.
– System should be running even if it face any crash or
failure.
• Availability:
– The probability that the system is up and running
continuously throughout a specified period.
8
Location Independence
• Location independence also known as
location transparency.
– Location independence is simple: users should not
have to know where data is physically stored but
rather should be able to behave – atleast from a
logical standpoint – as if the data were all stored
at their own local site.

9
Fragmentation Independence
• Data can be stored at the location where it is
most frequently used, so that most operations
are local and network traffic is reduced.

• A system supports data fragmentation if a


given relvar can be divided up into pieces or
fragments for physical storage purposes.

10
An example of fragmentation
EMP # DEP # SALARY
E1 D1 40K
E2 D1 42K
E3 D2 30K
E4 D2 35K
E5 D3 48K

New York London


EMP # DEP # SALARY
EMP # DEP # SALARY
E1 D1 40K
E3 D2 30K
E2 D1 42K
E4 D2 35K
E5 D3 48K

11
Replication Independence
• A System supports data replication if a given stored
relvar – or, more generally, a given fragment of a
given stored relvar – can be represented by many
distinct copies or replicas, stored at many distinct
sites.
• This provides better availability.
• Disadvantage of this is that when given replicated
object is updated, all copies of that object must be
updated: the updated propagation problem.

12
Distributed Query Processing
• If user is at New York site and the data is at the
London site. User has to send request to London site
from New York site and response will be sent from
London site to New york site.
• Optimization is more important in a distributed
system than it is in a centralized one.

13
Distributed Transaction Management
• Transaction are consist of several agents, where an agent is
the process performed in behalf of a given transaction.
• Transaction may be updating many sites at a time, this can be
done using agents.
• Recovery control should also be there, because if any crash or
error occurred at any site transaction can be roll back or
commit.
• This effect can be achieved by means of the two-phase
commit protocol.
• Concurrency control in most distributed systems in typically
based on locking, just as it is in non-distributed systems.

14
Hardware Independence
• System should work on any hardware machine like
Dell, IBM, HP machine.

• Distributed Database system should work impendent


of hardware type.

15
Operating System Independence
• It is desirable, not only to be able to run the same
DBMS on different hardware platforms but also to be
able to run it on different operating system platform.
• E.g Windows, Unix, NT etc

16
Network Independence
• DBMS should work on any type of network,
e.g wireless, wired etc

17
DBMS Independence
• System must have ability that it can
communicate amount different type of
database system. E.g Oracle, DB2, MySql etc
• System can support homogenous databases
or it can also support heterogeneous database
systems.

18
Problems of Distributed Systems
• Since networks are slow (WANs) and we might
have limited network bandwidth and we have
to minimize network utilization.
– Query processing
– Catalog management
– Update propagation
– Recovery control
– Concurrency Control

19
Query Processing
• The objective is minimizing network utilization implies that the
query optimization process itself need to be distributed, as well
as the query execution process.
• For example
– Suppose a database (suppliers and parts)
• S { Sno, City} 10,000 stored tuples at site A
• P {Pno, Color} 100, 000 stored tuples at site B
• SP { Sno, Pno} 1, 000, 000 stored tuples at site A
Assume that every stored tuple is 25 bytes (200 bits) long

Now say query is


( where S.Sno = SP.Sno AND P.Pno = SP.Pno ) AND (City = ‘London’ and
Color = ‘Red’ )

20
Query Processing (Summarizing the result)

Strategy Technique Communication time


1 Move P to A 6.67 mins
2 Move S and SP to B 1.12 hrs
3 For each London shipment, check if part is red 5.56 hrs
4 For each red part , check if a London supplier exists 2.00 secs
5 Move London shipments to B 6.67 mins
6 Move red parts to A 0.10 secs (Best)

21
Query Processing
• Estimated cardinalities of certain intermediate
result:
– Number of red parts = 10
– Number of shipments by London Suppliers = 100, 000

– Communication Assumptions:
• Date rate = 50, 000 bit per second
• Access Delay = 0.1 second

• Now formula for total communication time


– (Total access delay) + (Total data volume / data rate)
– In seconds -> (total of messages / 10) + number of bits / 50000)

22
Query Processing
• 1. Move parts to site A and process the query
at A
– T[1] = 0.1 + (100000 *200) / 50000)
= 400 seconds approx. (6.67 minutes)
• 2. Move suppliers and shipments to site B and
process the query at B
– T[2] = 0.2 + (10000 + 1000000) * 200) / 50000
= 4040 seconds approx. (1.12 hours)

23
Query Processing
• 3. Join suppliers and shipments at site , restrict
the result to London suppliers, and then, for
each of those suppliers in turn, check site B to
see whether the corresponding part is red.
• Each of these checks will involve two messages,
a query and a response. The transmission time
for these messages will be small compared with
the access delay.
• T[3] = 20000 seconds approx. (5.56 hour)
24
Query Processing
• 4. Restrict parts at site B to those that are red,
and then, for each of those parts in turn check
site A to se whether there exists a shipment
relating the part to a London supplier.
• Each of these checks will involves two messages;
again then transmission time for these
messages will be small compared with the
access delay
• T[4] = 2 seconds approx
25
Query Processing
• 5. Join suppliers and shipments at site A,
restrict the result to London suppliers, project
the result over Sno and Pno and Move to site
B. complete the processing at site B
– T[5] = 0.1 + (100000 * 200) /50000
= 400 seconds approx. (6.67 minutes )

26
Query Processing
• 6. Restrict the part at site B to those that are
red and move the result to site A. Complete
the processing at site A.
• T [6] = 0.1 + (10 * 200) / 50000
= 0.1 seconds approx.

27
Catalog Management
• System catalog will include not only the usual
catalog data regarding base relvar, views,
authorization, etc., but also all the necessary
control information to enable the system to
provide the desired location, fragmentation,
and replication independence.
• The question arises: Where and how should
the catalog itself be stored?

28
Some Possibilities
• 1. Centralized: The total catalog is stored exactly
once, at a single central site.

• 2. Fully replicated: The total catalog is stored in


its entirety at every site.

• 3. Partitioned: Each site maintain its own catalog for


objects stored at that site. The total catalog is the union
of all of those disjoint local catalogs.
29
Some Possibilities (Cont.)
• Combination of 1 and 3: Each site maintain its
own local catalog; (as in partitioned) in
addition, a single central maintain a unified
copy of all of those local catalogs; (as in
Centralized)

30
Problems in each possibilities
• Approach 1 obviously violates the “no reliance
on a central site” objective.
• Approach 2 suffers from a severe loss of
autonomy, in that every catalog update has to
be propagated to every site.
• Approach 3 makes nonlocal operations very
expensive (finding a remote object will require
access to half the site, on average).

31
Problems in each possibilities
• Approach 4 is more efficient than Approach 3
(finding a remote object requires only one
remote catalog access), but violates the “no
reliance on a central site” objective again.

32
Object naming
• Object naming: is a significant issue for
distributed systems in general; the possibility
that two distinct sites X and Y might both have
an object. Say a tuple (or view etc.), called A
implies that some mechanism – typically
qualification by site name – will be required in
order to “Disambiguate” ( i.e., to guarantee
system-wide name uniqueness).

33
Object naming (Cont.)
• If qualified names such as X.A and Y.A are
exposed to the user, however, the location
independence objective will clearly be
violated.
• Solution to this problem can be R* approach

34
R* Approach
• Name referred by user can then be mapped to
system-wide name, which is a globally unique
internal identifier for the object.
– Creator ID (the ID of the user who created the object);
– Creator site ID (the ID of the site at which the CREATE
operation is entered);
– Local name (the unqualified name of the object);
– Birth site ID (the ID of the site at which the object is
initially stored).
• E.g MARILYN @ NEWYORK . STATS @ LONDON

35
R* SQL
• CREATE SYNONYM MSTATS FOR MARILYN @
NEWYORK . STATS @ LONDON;

• Now user can say (e.g.)


– SELECT … FROM MSTATS …;

36
Cases of names
• In first case: system is using the local name.
• System-wide name by assuming all the
obvious defaults- namely, that object was
created by this user, it was created at this site,
and it was initially stored at this site.

37
Cases of names
• Second case: System is using synonym.
• The system determines the system-wide name
by interrogating the relevant Synonym table.
• Each site maintains a Synonym table.
– Synonym table contains
• A catalog entry for every object born at that site.
• A catalog entry for every object currently stored at that
site

38
Synonym table (Example)
• If a user now issues a request referring to the
synonym MSTATS.
• First, the system looks up the corresponding
system-wide name in the appropriate
synonym table.
• Now it knows the birth site namely London in
the example, and it can interrogate the
London catalog.

39
Synonym table (Example) (Cont.)
• If object has migrated to (say) LA, then the catalog
entry in London will say as much and so the system
can now interrogate the “LA” catalog.
• If object has migrated from LA to New York then
the system will
• 1. Insert a New York catalog entry
• 2. Delete the LA catalog entry
• 3. Update the London catalog entry

40
Update Propagation
• A difficulty that arises immediately is that
some site holding a copy of the object might
be unavailable (because of a site or network
failure) at the time of the update.
• To deal with this kind of problem (not the only
one possible) is the so called primary copy
scheme.

41
Update Propagation
• One copy of each replicated object is
designated as the primary copy. The
remainder are all secondary copies.
• Primary copy of different objects are at
different sites (so that is a distributed scheme
once again).

42
Update Propagation
• Update operations are considered to be
logically complete as soon as the primary copy
has been updated. The site holding that copy
is then responsible for propagating the update
to the secondary copies at some subsequent
time. (That “subsequent time” must be prior
to COMMIT, however if the ACID properties of
the transaction are to be preserved.)

43
Recovery Control
• Recovery control in distributed system is
typically based on the two-phase commit
protocol.
• Single transaction can interact with several
autonomous resource managers;
• The local DBMSs – are operating at distinct
sites and hence are very autonomous.
• Points arising. See next slide.

44
Recovery Control
• 1. the “no-reliance on a central site” objective
dictates that the coordinator function must not be
assigned to one distinguished site in the network,
but instead must be performed by different sites for
different transactions.
• Typically it is handled by the site at which the
transaction in question is initiated; thus, each site
must be capable for acting as the coordinator sites
for some transactions and as participant site for
others.
45
Recovery Control
• 2. The two-phase commit process requires the
coordinator to communicate with every
participant site – which means more messages
and more overhead.
• 3. If site y acts as a participant in a two-phase
commit process coordinated by site X, then site
y must do what it is told by site X (commit or
rollback) – a loss of local autonomy (or
independence).
46
Concurrency control
• Concurrency control in most distributed systems is based on
locking.

• If each site is responsible for locks on objects stored at that


site (as it will be under local autonomy assumption), then
straightforward implementation will require at least 5n
messages.
– n lock requests
– n lock grants
– n update messages
– n acknowledgments
– n unlock requests

47
Concurrency control
• Where n is number of sites.
• Eg. Suppose there are two sites, B and C. (5(2) = 10 messages)
– Two lock requests sent, one to site B and one to site C
– Two lock grant responses, one from site B and one from
site C
– Two update messages, one to site B and one to site C
– Two acknowledge messages, one from site B and one from
site C
– Two unlock requests, one to site B and One to site C

48
Concurrency control
• Solution to this problem can be, if we use
primary copy update strategy.
• Messages will be reduced from 5n to 2n+3
(one lock request, one lock grant, n updates, n
acknowledgements and one unlock request).

• If primary copy unavailable then update can


not be applied.
49
DBMS Independence
• Different site connected together might have
different databases but in distributed
databases all those sites should behave like
single database.

50
Gateways
• Suppose we have two sites X and Y running DB2
and Oracle respectively.
• User U at site X wishes to see a single
distributed database that includes data from the
DB2 at site X and Oracle at site Y.
• Since user is of DB2 therefore, its DB2’s
responsibility, not Oracle, to provide the
necessary support.
– How is it possible?
51
Gateways
• DB2 must provide a special program usually
called a gateway – whose effect is “to make
Oracle look like DB2”. See example below.

DB2 GATE Oracle


(SQL) WAY (SQL)
PC1

DB2 User

DB2 Oracle
Database Distributed DB2 database Database

52
Gateways (Cont.)
1. Implementing protocols for the exchange
information b/w DB2 and Oracle.
2. Translate DB2 SQL statements to Oracle SQL
statements
3. Mapping of DB2 data types to Oracle to data
types. ( eg. Data type and their sizes)
4. Mapping of data type formats e.g. Date
formats, String formats, currency etc.

53
Data Access Middleware
PC1
Client Computer

Data Access Oracle


Middle Ware Database

DB2 Oracle SQL Server MySQL Sybase

54

You might also like