DDB Unit 1-5
DDB Unit 1-5
S.MOUNASRI
Asst. Prof.
CSE
Unit-1
Introduction
distributed data
processing
distributed database
system
promises of DDBSs
problem areas
Distributed DBMS Architecture
DDMBS architecture
Distributed database design
alternative design
strategies
Fragmentation
allocation
INTRODUCTION
1. Horizontal Fragmentation
2. Vertical Fragmentation
3. Hybrid Fragmentation
Hybrid Fragmentation
In most cases a simple horizontal or vertical fragmentation of a
database schema will not be sufficient to satisfy the
requirements of user applications.
In this case a vertical fragmentation may be followed by
a horizontal one, or vice versa, producing a tree
structured partitioning
Since two types of partitioning strategies are applied one after
the other which is called as hybrid fragmentation (mixed or nested
fragmentation.)
Allocation
The allocation of resources across the nodes or placing individual
files of a computer network is a big task.
Allocation Problem
S.MOUNASRI
Asst. Prof.
CSE
Query Processing & Decomposition
• If the transaction manager generates a non-serializable schedule, we say that it has failed.
(ii) Reliability and Availability
• Reliability refers to the probability that the system under consideration does not experience
any failures in a given time interval
• Formally, the reliability of a system, R(t), is defined as the following conditional
probability:
R(t) =Pr{0 failures in time [0,t]|no failures at t =0}
• If we assume that failures follow a Poisson distribution (which is usually the case
for hardware), this formula reduces to
(iii) Mean Time between Failures (MTBF) /Mean Time to Repair (MTTR)
• MTBF is the expected time between subsequent failures in a system with repair.
• MTBF can be calculated either from empirical data or from the reliability function as :
• MTTD : mean time to detect
Failures in distributed DBMS
• Designing a reliable system that can recover from failures requires identifying the types
of failures with which the system has to deal.
• In a distributed database system, we need to deal with four types of failures:
i. transaction failures (aborts)
ii. site (system) failures
iii. media (disk) failures
iv. communication line failures
(i) transaction failures :
• Transactions can fail for a number of reasons.
• Failure can be due to an error in the transaction caused by incorrect input data or due to
deadlocks
• some concurrency control algorithms do not permit a transaction to proceed or even to wait if
the data that they attempt to access are currently being accessed by another transaction.
• This might also be considered a failure.
• The approach to take in cases of transaction failure is to abort the transaction, thus resetting the
database to its state prior to the start of this transaction
(ii) site (system) failures :
• The reasons for system failure can be traced back to a hardware or to a software failure
• A system failure is always assumed to result in the loss of main memory contents.
• Therefore, any part of the database that was in main memory buffers is lost as a result of a
system failure.
• system failures are referred to as site failures, which makes site unreachable
• We differentiate between partial and total failures in a distributed system.
• Total failure refers to the simultaneous failure of all sites in the distributed system
• partial failure indicates the failure of only some sites while the others remain operational.
(iii) media failures :
• Media failure refers to the failures of the secondary storage devices that store the database.
• These failures may be due to operating system errors, as well as to hardware faults
• It means that all or part of the database that is on the secondary storage is considered to be
destroyed and inaccessible
(iv) Communication Failures :
• The three types of failures described above are common to both centralized and distributed
DBMSs.
• Communication failures are unique to the distributed & there are a number of types of
communication failures.
• Errors in the messages, improperly ordered messages, lost (or undeliverable) messages, and
communication line failures.
Local & distributed reliability protocols
Local reliability protocols
• we discuss the functions performed by the local recovery manager (LRM) that exists at each site.
• These functions maintain the atomicity and durability properties of local transactions.
• Which relate to the execution of the commands that are passed to the LRM, which are
begin_transaction, read, write, commit, and abort.
• When the LRM wants to read a page of data on behalf of a transaction it issues a fetch command,
indicating the page that it wants to read.
• The buffer manager checks to see if that page is already in the buffer and if so, makes it available for that
transaction; if not, it reads the page from the stable database into an empty database buffer
• Other than above a sixth interface command to the LRM: recover.
• The recover command is the interface that the operating system has to the LRM
• It is used during recovery from system failures when the operating system asks the DBMS to recover
the database to the state that existed when the failure occurred.
distributed reliability protocols
• the distributed version also aim to maintain the atomicity and durability of distributed transactions that
execute over a number of databases
• The protocols address the distributed execution of the begin_transaction, read, write, abort, commit,
and recover commands.
• all the commands are executed in the same manner of centralized system
• We assume that at the originating site of a transaction there is a coordinator process and at each site
where the transaction executes there are participant processes.
• Thus, the distributed reliability protocols are implemented between the coordinator and the participants.
• Assuming that during the execution of a distributed transaction, one of the sites involved in the
execution fails; we would like the other sites to terminate the transaction
• Recovery protocols deal with the procedure that the process (coordinator or participant) at the failed site
has to go through to recover its state once the site is restarted
site failures and network partitioning
Parallel database systems
parallel database system architectures
• Many data-intensive applications like e-commerce, data warehousing, and data mining, s
require support for very large databases.
• Very large databases are accessed through high numbers of concurrent transactions (e.g.,
performing on-line orders on an electronic store) or complex queries (e.g., decision-support
queries).
• The first kind of access is representative of On-Line Transaction Processing (OLTP)
applications
• while the second is representative of On-Line Analytical Processing (OLAP) applications
• Supporting very large databases efficiently for either OLTP or OLAP can be addressed by
combining parallel computing and distributed database management.
(i) Parallel Database System Architectures
Objectives
• Parallel database systems combine database management and parallel processing to increase performance
and availability
• A parallel database system can be loosely defined as a DBMS implemented on a parallel\ computer.
• The objectives of parallel database systems are covered by those of distributed DBMS (performance,
availability, extensibility).
• Ideally, a parallel database system should provide the following advantages
High-performance : obtained by parallel data management, query optimization, and load balancing(Load
balancing is the ability of the system to divide a given workload equally among all processors.) etc
High-availability: A parallel database system consists of many redundant components, it can well increase
data availability and fault-tolerance
Extensibility: In a parallel system, accommodating increasing database sizes or increasing performance
demands should be easier.
Functional Architecture
1. Session Manager : provide support for client interactions with the server and also performs
the connections and disconnections between the client processes and the two other
subsystems. Therefore, it initiates and closes user sessions
2. transaction Manager : It receives client transactions related to query compilation and
execution. Depending on the transaction, it activates the various compilation phases,
triggers query execution, and returns the results as well as error codes to the client
application
3. Data Manager : It provides all the low-level functions needed to run compiled queries in
parallel, i.e., database operator execution, parallel transaction support, cache management,
etc.
Parallel DBMS Architectures
• There are three basic parallel computer architectures depending on how main memory or disk is
shared: shared-memory, shared-disk and shared-nothing
1. shared-memory : In the shared-memory approach any processor has access to any memory module
or disk unit through a fast interconnect . All the processors are under the control of a single
operating system.
2. shared-disk : In this any processor has access to any disk unit through the interconnect but
exclusive (non-shared) access to its main memory . Each processor-memory node is under the control
of its own copy of the operating system.
3. shared-nothing: In this approach each processor has exclusive access to its main memory and disk
unit(s). Similar to shared-disk, each processor memory-disk node is under the control of its own copy
of the operating system. Then, each node can be viewed as a local site
Parallel data placement
• Data placement in a parallel database system exhibits similarities with data fragmentation in
distributed databases
• we use the terms partitioning and partition instead of horizontal fragmentation and
horizontal fragment
• There are three basic strategies for data partitioning: round-robin, hash, and range partitioning
1. Round-robin partitioning is the simplest strategy, it ensures uniform data distribution. This strategy
enables the sequential access to a relation to be done in parallel.
2. Hash partitioning applies a hash function to some attribute that yields the partition number. This
strategy allows exact-match queries on the selection attribute to be processed by exactly one node
and all other queries to be processed by all the nodes in parallel.
3. Range partitioning distributes tuples based on the value intervals (ranges) of some attribute
parallel query processing
• The objective of parallel query processing is to transform queries into execution plans that can be
efficiently executed in parallel.
• It focuses on both intra-operator parallelism (a single operator is distributed among
multiple processors.)and inter-operator parallelism(each query runs on multiple processors
which corresponds to different operators of a query running in different processors.)
• A parallel query optimizer can be seen as three components: a search space, a cost model, and a
search strategy.
load balancing
• Good load balancing is crucial for the performance of a parallel system
• the response time of a set of parallel operators is that of the longest one.
• Thus, minimizing the time of the longest one is important for minimizing response time.
• Balancing the load of different transactions and queries among different nodes is also essential to
maximize throughput
• Solutions to these problems can be obtained at the intra- and inter-operator levels
database clusters
• a cluster can have a shared-disk or shared-nothing architecture
• Shared-disk requires a special interconnect that provides a shared disk space to all nodes with
provision for cache consistency
• Shared-nothing can better support database autonomy without the additional cost of a special
interconnect and can scale up to very large configurations
• Client applications interact with the middleware in a classical way to submit database transactions
• The general processing of a transaction to a single database is as follows. First, the transaction is
authenticated and authorized using the directory. If successful, the transaction is routed to a
DBMS at some, possibly different, node to be executed.
• As in a parallel DBMS, the database cluster middleware has several software layers:
transaction load balancer, replication manager, query processor and fault tolerance manager
UNIT-5
DISTRIBUTED DATABASES
S.MOUNASRI
Asst.Prof.
DISTRIBUTED OBJECT DATABASE MANAGEMENT SYSTEMS
ca
r
Object identity
• Object identity is typically implemented via a unique, system-generated OID. The value
of the OID is not visible to the external user, but is used internally by the system to
identify each object uniquely and to create and manage inter-object references
persistence of objects
• Persistence denotes a process or an object that continues to exist even after its parent
process, or the system that runs it is turned off. (the browser restarts the next time you
open it and attempts to reopen any tabs that were open when it crashed. A persistent
process thus exists even if it failed or was killed for some technical reasons. )
• there are two types of persistence: object persistence and process persistence.
• object persistence refers to an object that is not deleted until a need emerges to remove
it from the memory.
• process persistence, processes are not killed or shut down by other processes and exist
until the user kills them.
persistent programming languages
• A persistent programming language is a programming language extended with constructs
to handle persistent data.
• Using Embedded SQL, a programmer is responsible for writing explicit code to fetch
data into memory or store data back to the database
• In a persistent program language, a programmer can manipulate persistent data without
having to write such code explicitly.
COMPARISON OODBMS AND ORDBMS
oodbms ordbms
• In the object oriented database, the data • In relational database, data is stored in
is stored in the form of objects. the form of tables, which contains rows
• In oodbms, relationships are represented and column.
by references via the object identifier • In ordbms, connections between two
(OID). relations are represented by foreign key
• Handles larger and complex data than attributes
RDBMS. • Handles comparatively simpler data.
• In oodbms, the data management • n relational database systems there are
language is typically incorporated into a data manipulation languages such as
programming language such as #C++. SQL,
• Stores data entries are described as • Stores data in entries is described as
object. tables.