02 DistributedDataManagement
02 DistributedDataManagement
Management
Big Data Management
1
Knowledge objectives
1. Give a definition of Distributed System
2. Enumerate the 6 challenges of a Distributed System
3. Give a definition of Distributed Database
4. Explain the different transparency layers in DDBMS
5. Identify the requirements that distribution imposes on the ANSI/SPARC architecture
6. Draw a classical reference functional architecture for DDBMS
7. Enumerate the 8 main features of Cloud Databases
8. Explain the difficulties of Cloud Database providers to have multiple tenants
9. Enumerate the 4 main problems tenants/users need to tackle in Cloud Databases
10. Distinguish the cost of sequential and random access
11. Explain the difference between the cost of sequential and random access
12. Distinguish vertical and horizontal fragmentation
13. Recognize the complexity and benefits of data allocation
14. Explain the benefits of replication
15. Discuss the alternatives of a distributed catalog
2
Understanding Objectives
• Decide when a fragmentation strategy is correct
3
Distributed System
Distributed DBMS
Cloud DBMS
Distributed Systems
4
Distributed system
“One in which components located at networked computers communicate
and coordinate their actions only by passing messages.”
G. Coulouris et al.
• Characteristics:
• Concurrency of components
• Independent failures of components
• Lack of a global clock
Network
5
Challenges of distributed systems
• Openness
• Scalability
• Quality of service
• Performance/Efficiency
• Reliability/Availability
• Confidentiality
• Concurrency Network
• Transparency
• Heterogeneity of components
6
Scalability
Cope with large workloads
• Scale up
• Scale out
• Use: Network
• Automatic load-balancing
• Avoid:
• Bottlenecks
• Unnecessary communication
• Peer-to-peer
7
Performance/Efficiency
Efficient processing
• Minimize latencies
• Maximize throughput
• Use
• Parallelism Network
• Network optimization
• Specific techniques
8
Reliability/Availability
a) Keep consistency
b) Keep the system running
• Even in the case of failures
• Use
Network
• Replication
• Flexible routing
• Heartbeats
• Automatic recovery
9
Concurrency
Share resources as much as possible
• Use
• Consensus Protocols
Network
• Avoid
• Interferences
• Deadlocks
10
Transparency
a) Hide implementation (i.e., physical) details to the users
b) Make transparent to the user all the mechanisms to solve the other
challenges
Network
11
Further objectives
• Use
• Platform-independent software
• Avoid
• Complex configurations
• Specific hardware/software Network
12
Distributed System
Distributed DBMS
Cloud DBMS
13
Distributed database
“A Distributed DataBase (DDB) is an integrated collection of databases that is physically
distributed across sites in a computer network. A Distributed DataBase Management
System (DDBMS) is the software system that manages a distributed database such that
the distribution aspects are transparent to the users.”
Encyclopedia of Database Systems
Network Network
14
Transparency layers (I)
• Fragmentation transparency
• The user must not be aware of the existence of different fragments
• Replication transparency
• The user must not be aware of the existing replicas
• Network transparency
• Data access must be independent regardless where data is located
• Each data object must have a unique name
• Data independency at the logical and physical level must be guaranteed
• Inherited from centralized DBMSs (ANSI SPARC)
15
Transparency layers (II)
16
Classification According to Degree of Autonomy
17
Extended ANSI-SPARC Architecture of Schemas
Query Manager
Execution Manager
Scheduler
19
Distributed DBMS Functional Architecture
Global Query Manager External
One coordinator
Schema
View Security Constraint Query
GLOBAL CATALOG
Manager Manager Checker Optimizer Global
Conceptual
Schema
Fragment
Global Execution Manager Schema
Allocation
Schema
Global Scheduler
…
Local Query Manager Local
Conceptual
Schema
Many workers
LOCAL CATALOG
LOCAL CATALOG
Internal
Schema
Operating Recovery Data Manager
Manager Log Data Manager
system
…
20
Distributed System
Distributed DBMS
Cloud DBMS
Cloud Databases
21
Parallel database architectures
22
Key Features of Cloud Databases
• Scalability
a) Ability to horizontally scale (scale out)
• Quality of service
• Performance/Efficiency
b) Fragmentation: Replication & Distribution
c) Indexing: Distributed indexes and RAM
• Reliability/Availability
• Concurrency Network
d) Weaker concurrency model than ACID
• Transparency
e) Simple call level interface or protocol
• No declarative query language
• Further objectives
f) Flexible schema
• Ability to dynamically add new attributes
g) Quick/Cheap set up
h) Multi-tenancy
23
Multi-tenancy platform problems (provider side)
• Difficulty: Unpredictable load characteristics
• Variable popularity
• Flash crowds
• Variable resource requirements
• Requirement: Support thousands of tenants
a) Maintain metadata about tenants (e.g., activated features)
b) Self-managing
c) Tolerating failures
d) Scale-out is necessary (sooner or later)
• Rolling upgrades one server at a time
e) Elastic load balancing
• Dynamic partitioning of databases
24
Data management problems (tenant side)
I. (Distributed) data design
• Data fragmentation
• Data allocation
• Data replication
II. (Distributed) catalog management
• Metadata fragmentation
• Metadata allocation
• Metadata replication
III. (Distributed) transaction management
• Enforcement of ACID properties
• Distributed recovery system
• Distributed concurrency control system
• Replica consistency
• Latency&Availability vs. Update performance
IV. (Distributed) query processing
• Optimization considering
1) Distribution/Parallelism
• Communication overhead
2) Replication
25
(Distributed) Data Design
Challenge I
26
DDB Design
• Given a DB and its workload, how should the DB be split and allocated to
sites as to optimize certain objective functions
• Minimize resource consumption for query processing
27
Data Fragmentation
• Usefulness
• An application typically accesses only a subset of data
• Different subsets are (naturally) needed at different sites
• The degree of concurrency is enhanced
• Facilitates parallelism
• Fragments can be even defined dynamicaly (i.e., at query time, not at design time)
• Difficulties
• Complicates the catalog management
• May lead to poorer performance when multiple fragments need to be joined
• Fragments likely to be used jointly can be colocated to minimize communication overhead
• Costly to enforce the dependency between attributes in different fragments
28
Fragmentation Correctness
• Completeness
• Every datum in the relation must be assigned to a fragment
• Disjointness
• There is no redundancy and every datum is assigned to only one fragment
• The decision to replicate data is in the allocation phase
• Reconstruction
• The original relation can be reconstructed from the fragments
• Union for horizontal fragmentation
• Join for vertical fragmentation
29
Finding the best fragmentation strategy
• Consider it per table
• Computational cost is NP-hard
• Needed information
• Workload
• Frequency of each query
• Access plan and cost of each query
• Take intermediate results and repetitive access into account
• Value distribution and selectivity of predicates
• Work in three phases
1. Determine primary partitions (i.e., attribute subsets often accessed together)
2. Generate a disjoint and covering combination of primary partitions
3. Evaluate the cost of all combinations generated in the previous phase
30
Data Allocation
• Given a set of fragments, a set of sites on which a number of applications are
running, allocate each fragment such that some optimization criterion is met (subject
to certain constraints)
• It is known to be an NP-hard problem
• The optimal solution depends on many factors
• Location in which the query originates
• The query processing strategies (e.g., join methods)
• Furthermore, in a dynamic environment the workload and access patterns may change
• The problem is typically simplified with certain assumptions
• E.g., only communication cost considered
• Typical approaches build cost models and any optimization algorithm can be
adapted to solve it
• Sub-optimal solutions
• Heuristics are also available
• E.g., best-fit for non-replicated fragments
31
Data Replication
• Generalization of Allocation (for more than one location)
• Provides execution alternatives
• Improves availability
• Generates consistency problems
• Specially useful for read-only workloads
• No synchronization required
32
(Distributed) Catalog
Management
Challenge II
33
DDBMS Catalog Characteristics
External
• Fragmentation Schema
• Global metadata
GLOBAL CATALOG
Global
• External schemas Conceptual
• Global conceptual schema Schema
• Fragment schema Fragment
• Allocation schema Schema
• Local metadata Allocation
• Local conceptual schema Schema
• Physical schema
• Allocation Local
• Global metadata in the coordinator node Conceptual
• Local metadata in the workers Schema
• Replication Local
LOCAL CATALOG
Internal
a) Single-copy (Coordinator node) Schema
• Single point of failure
• Poor performance (potential bottleneck)
b) Multi-copy (Mirroring, Secondary node)
• Requires synchronization
34
Closing
35
Summary
• Distributed Systems
• Distributed Database Systems
• Distributed Database Systems Architectures
• Cloud Databases
• Distributed Database Design
• Fragmentation
• Kinds
• Characteristics
• Allocation
• Replication
• Distributed Catalog
36
References
• D. DeWitt & J. Gray. Parallel Database Systems: The future of High
Performance Database Processing. Communications of the ACM, June
1992
• N. J. Gunther. A Simple Capacity Model of Massively Parallel Transaction
Systems. CMG National Conference, 1993
• L. Liu, M.T. Özsu (Eds.). Encyclopedia of Database Systems. Springer, 2009
• M. T. Özsu & P. Valduriez. Principles of Distributed Database Systems, 3rd
Ed. Springer, 2011
• G. Coulouris et al. Distributed Systems: Concepts and Design, 5th Ed.
Addisson-Wesley, 2012
37