0% found this document useful (0 votes)
260 views146 pages

Databases Ii

This document is an instruction manual for the course "Distributed Databases" taught through the Distance Learning program at Mount Kenya University. The manual contains an outline of the course, which covers topics such as database design and implementation methodologies, database recovery and security, transaction control, concurrency control, query processing, query decomposition, distributed databases, and distributed database design. It is divided into 8 chapters that provide background information, concepts, and review questions for each topic area.

Uploaded by

moses
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
260 views146 pages

Databases Ii

This document is an instruction manual for the course "Distributed Databases" taught through the Distance Learning program at Mount Kenya University. The manual contains an outline of the course, which covers topics such as database design and implementation methodologies, database recovery and security, transaction control, concurrency control, query processing, query decomposition, distributed databases, and distributed database design. It is divided into 8 chapters that provide background information, concepts, and review questions for each topic area.

Uploaded by

moses
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 146

Mount Kenya University

P.O. Box 342-01000 Thika

Email: [email protected]

Web: www.mku.ac.ke

DEPARTMENT OF INFORMATION
TECHNOLOGY

COURSE CODE: BIT 4207


COURSE TITLE: DISTRIBUTED DATABASES

Instructional manual for BBIT – Distance Learning

By David Kibaara
TABLE OF CONTENT
TABLE OF CONTENT.....................................................................................................................................2

COURSE OUTLINE........................................................................................................................................7

CHAPTER ONE: DATABASE DESIGN AND IMPLEMENTATION METHODOLOGIES.......................................13

Introduction...........................................................................................................................................13

Logical Database Design........................................................................................................................15

Physical Database Design......................................................................................................................35

Database Development Life Cycle.........................................................................................................43

Chapter Review Questions....................................................................................................................46

CHAPTER TWO: DATABASE RECOVERYAND DATABASE SECURITY.............................................................47

Introduction to Database Recovery.......................................................................................................47

Causes of Failures..................................................................................................................................47

Recovery Procedures.............................................................................................................................47

Database Recovery Features.................................................................................................................48

DBMS Recovery Facilities.......................................................................................................................48

Recovery Techniques.............................................................................................................................49

Introduction to Database Security.........................................................................................................49

Threats...................................................................................................................................................50

Database Security Mechanisms.............................................................................................................50

Backups.................................................................................................................................................52

Internal Consistency..............................................................................................................................54

Proposals for Multilevel Security...........................................................................................................55

Chapter Review Questions....................................................................................................................56

CHAPTER THREE: TRANSACTION CONTROL...............................................................................................57

Introduction to Transactions.................................................................................................................57
The ACID Properties...............................................................................................................................58

Two-Phase Commits (2PC).....................................................................................................................59

Nested Transactions..............................................................................................................................59

Implementing Transactions...................................................................................................................59

Database Transactions......................................................................................................................60

Object Transactions...........................................................................................................................60

Distributed Object Transactions........................................................................................................61

Including Non-Transactional Steps...................................................................................................61

Chapter Review Questions....................................................................................................................62

CHAPTER FOUR: CONCURRENCY CONTROL...............................................................................................63

Introduction to Concurrency Control.....................................................................................................63

Concurrency Control Locking Strategies................................................................................................63

Lock Problems.......................................................................................................................................64

Collision Resolution Strategies..............................................................................................................66

Collision Resolution Strategies..............................................................................................................66

Optimistic concurrency control.............................................................................................................67

Timestamp ordering..............................................................................................................................67

Why is concurrency control needed?....................................................................................................68

Chapter Review Questions....................................................................................................................69

CHAPTER FIVE: OVERVIEW OF QUERY PROCESSING..................................................................................70

Query Processing Overview...................................................................................................................70

Query Optimization...............................................................................................................................73

Query Optimization Issues.....................................................................................................................74

Distributed Query Processing Steps.......................................................................................................76

Chapter Review Questions....................................................................................................................76


CHAPTER SIX: QUERY DECOMPOSITION AND DATA LOCALIZATION..........................................................77

Query Decomposition............................................................................................................................77

Data Localization...................................................................................................................................83

Data Localizations Issues.......................................................................................................................84

Chapter Review Questions....................................................................................................................84

CHAPTER SEVEN: DISTRIBUTED DATABASES..............................................................................................85

Introduction to Distributed Databases..................................................................................................85

Data Independence...............................................................................................................................85

Applications of Distributed Databases...................................................................................................89

Promises of DDBSs.................................................................................................................................89

Transparency.........................................................................................................................................90

Distributed database Complicating Factors...........................................................................................93

Chapter Review Questions....................................................................................................................94

CHAPTER EIGHT: DISTRIBUTED DATABASE DESIGN...................................................................................95

Design Problem.....................................................................................................................................95

Framework of Distribution....................................................................................................................95

Design Strategies...................................................................................................................................96

Fragmentation.......................................................................................................................................98

Correctness Rules of Fragmentation....................................................................................................100

Correctness of Vertical Fragmentation................................................................................................102

Replication and Allocation...................................................................................................................102

Fragment Allocation............................................................................................................................103

Chapter Review Questions..................................................................................................................103

CHAPTER NINE: DDBMS ARCHITECTURE.................................................................................................104

Introduction to DDBMS Architecture...................................................................................................104


Standardization...................................................................................................................................104

ANSI/SPARC Architecture of DBMS......................................................................................................105

Architectural Models for DDBMSs.......................................................................................................107

Client-Server Architecture for DDBMS (Data-based)...........................................................................109

Multi-DBMS Architecture (Data-based)...............................................................................................111

Chapter Review Questions..................................................................................................................113

CHAPTER TEN: SEMANTIC DATA CONTROL.............................................................................................114

Semantic Data Control.........................................................................................................................114

View Management..............................................................................................................................114

Data Security.......................................................................................................................................117

Data Protection...................................................................................................................................117

Authorization Control..........................................................................................................................117

Distributed Authorization Control.......................................................................................................118

Semantic Integrity Constraints............................................................................................................119

Semantic Integrity Constraints Enforcement.......................................................................................121

Distributed Constraints........................................................................................................................122

Chapter Review Questions..................................................................................................................123

CHAPTER ELEVEN: DISTRIBUTED DBMS RELIABILITY...............................................................................124

Reliability.............................................................................................................................................124

Local Recovery Management..............................................................................................................124

Commit Protocols................................................................................................................................129

Centralized Two Phase Commit Protocol (2PC)...................................................................................130

Global Commit Rule.............................................................................................................................130

Linear 2PC Protocol.............................................................................................................................131

2PC Protocol and Site Failures.............................................................................................................133


Chapter Review Questions..................................................................................................................137

CHAPTER TWELVE: SAMPLE PAPERS........................................................................................................141


COURSE OUTLINE

BIT 4207: Distributed Databases

Course Description

This course investigates the architecture, design, and implementation of massive-scale data
systems. The course discusses foundational concepts of distributed database theory including
design and architecture, security, integrity, query processing and optimization, transaction
management, concurrency control, and fault tolerance. It then applies these concepts to both
large-scale data warehouse and cloud computing systems. The course blends theory with
practice, with each student developing both distributed database and cloud computing projects.

Prerequisites

BIT 2203 - Data Structures and algorithms

Course Goal

The goal of this course is to teach distributed database management system theory.

Course Objectives

 To provide an understanding of architecture and design tradeoffs of all aspects of


distributed database management systems.
 To explore distributed database design methods and heuristics.

 To examine issues of distributed query execution, including optimization, transaction


managment, and fault tolerance.

 To provide hand-on experience programming portions of a distributed database


management system.
Course Outline

ONE: DATABASE DESIGN AND IMPLEMENTATION METHODOLOGIES

 Logical Database Design

 Physical Database Design

 Database Development Life Cycle

TWO: DATABASE RECOVERYAND DATABASE SECURITY

 Introduction to Database Recovery

 Causes of Failures

 Recovery Procedures

 Database Recovery Features

 DBMS Recovery Facilities

 Recovery Techniques

 Introduction to Database Security

 Threats

 Database Security Mechanisms

 Backups

 Internal Consistency

 Proposals for Multilevel Security

THREE: TRANSACTION CONTROL


 Introduction to Transactions

 The ACID Properties

 Two-Phase Commits (2PC)

 Nested Transactions

 Implementing Transactions

 Database Transactions

 Object Transactions

 Distributed Object Transactions

 Including Non-Transactional Steps

FOUR: CONCURRENCY CONTROL

 Introduction to Concurrency Control

 Concurrency Control Locking Strategies

 Lock Problems

 Collision Resolution Strategies

 Collision Resolution Strategies

 Optimistic concurrency control

 Timestamp ordering

 Why is concurrency control needed?

FIVE: OVERVIEW OF QUERY PROCESSING

 Query Processing Overview


 Query Optimization

 Query Optimization Issues

 Distributed Query Processing Steps

SIX: QUERY DECOMPOSITION AND DATA LOCALIZATION

 Query Decomposition

 Data Localization

 Data Localizations Issues

SEVEN: DISTRIBUTED DATABASES

 Introduction to Distributed Databases

 Data Independence

 Applications of Distributed Databases

 Promises of DDBSs

 Transparency

 Distributed database Complicating Factors

EIGHT: DISTRIBUTED DATABASE DESIGN

 Design Problem

 Framework of Distribution

 Design Strategies

 Fragmentation

 Correctness Rules of Fragmentation


 Correctness of Vertical Fragmentation

 Replication and Allocation

 Fragment Allocation

NINE: DDBMS ARCHITECTURE

 Introduction to DDBMS Architecture

 Standardization

 ANSI/SPARC Architecture of DBMS

 Architectural Models for DDBMSs

 Client-Server Architecture for DDBMS (Data-based)

 Multi-DBMS Architecture (Data-based)

TEN: SEMANTIC DATA CONTROL

 Semantic Data Control

 View Management

 Data Security

 Data Protection

 Authorization Control

 Distributed Authorization Control

 Semantic Integrity Constraints

 Semantic Integrity Constraints Enforcement

 Distributed Constraints
ELEVEN: DISTRIBUTED DBMS RELIABILITY

 Reliability

 Local Recovery Management

 Commit Protocols

 Centralized Two Phase Commit Protocol (2PC)

 Global Commit Rule

 Linear 2PC Protocol

 2PC Protocol and Site Failures

Student Assessment Criteria

Exam 70 %

CATS 30%

Reference

Raghu Ramakrishnan and Johannes Gerhrke. 2003.  Database Management Systems, 3rd edition,
McGraw-Hill. ISBN: 978-0-07-246563-1.

Supplementary Reading

Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems: The
Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th edition,
Addison Wesley. ISBN-13: 978-0136086208

Module prepared by David Kibaara


CHAPTER ONE: DATABASE DESIGN AND
IMPLEMENTATION METHODOLOGIES

Learning Objectives:

By the end of this chapter the learner shall be able to;


i. Understanding the database design methodologies

ii. understanding logical database design

iii. understanding physical database design

Introduction
Database design is the process of producing a detailed data model of a database. This logical data
model contains all the needed logical and physical design choices and physical storage
parameters needed to generate a design in a Data Definition Language, which can then be used to
create a database. A fully attributed data model contains detailed attributes for each entity.
Database design is a technique that involves the analysis, design, description, and specification
of data designed for automated business data processing. This technique uses models to enhance
communication between developers and customers.

The term database design can be used to describe many different parts of the design of an overall
database system. Principally, and most correctly, it can be thought of as the logical design of the
base data structures used to store the data. In the relational model these are the tables and views.
In an object database the entities and relationships map directly to object classes and named
relationships. However, the term database design could also be used to apply to the overall
process of designing, not just the base data structures, but also the forms and queries used as part
of the overall database application within the database management system (DBMS).

The process of doing database design generally consists of a number of steps which will be
carried out by the database designer. Usually, the designer must:
 Determine the relationships between the different data elements.
 Superimpose a logical structure upon the data on the basis of these relationships.

Data models and supporting descriptions are the tools used in database design. These tools
become the deliverables that result from applying database design. There are two primary
objectives for developing of these deliverables. The first objective is to produce documentation
that describes a customer’s perspective of data and the relationships among this data. The second
objective is to produce documentation that describes the customer organization's environment,
operations and data needs. In accomplishing these objectives, the following deliverables result:

 Decision Analysis and Description Forms


 Task Analysis and Description Forms
 Task/Data Element Usage Matrix
 Data Models
 Entity-Attribute Lists
 Data Definition Lists
 Physical Database Specifications Document

Consider a database approach if one or more of the following conditions exist in the user
environment:

 A multiple number of applications are to be supported by the system.


 A multiple number of processes or activities use a multiple number of data sources.
 A multiple number of data sources are used in the reports produced.
 The data, from the data definitions, are known to be in existing database(s).
 The development effort is to enhance the capabilities of an existing database.

If it appears that conditions would support database development, then undertake the activities of
logical database analysis and design. When the logical schema and sub schemas are completed
they are translated into their physical counterparts. Then the physical sub schemas are supplied
as part of the data specifications for program design. The exact boundary between the last stages
of logical design and the first stages of physical analysis is difficult to assess because of the lack
of standard terminology. However, there seems to be general agreement that logical design
encompasses a DBMS-independent view of data and that physical design results in a
specification for the database structure, as it is to be physically stored. The design step between
these two that produces a schema that can be processed by a DBMS is called implementation
design.

Do not limit database development considerations to providing random access or ad hoc query
capabilities for the system. However, even if conditions appear to support database development,
postpone the decision to implement or not implement a DBMS until after completing a thorough
study of the current environment. This study must clarify any alternatives that may or may not be
preferable to DBMS implementation.

Logical Database Design


To develop a logical database, analyze the business of the organization that the database would
support, how the operations relate to each other, and what data is used in business operations.
After this analysis, model the data. This modeling involves studying data usage and grouping
data elements into logical units so that a task supported by one or more organizational units is
independent of support provided for other tasks.

By providing each task with its own data groups, changes in the data requirements of one task
will have minimal, if any, impact on data provided for another task. By having data managed as a
synthesis, data redundancy is minimized and data consistency among tasks and activities is
improved.

Logical database design comprises two methods to derive a logical database design. The first
method is used to analyze the business performed by an organization. Following this analysis,
the second method is used to model the data that supports the business. These methods are:

 Business Analysis
 Data Modeling

Business Analysis

Business analysis is a method for analyzing and understanding a customer’s business. In


applying this method, the objectives are to:
 Gain a clear understanding of an organization's objectives and how it performs its
mission.
 Focus the analysis on identifying specific requirements that must be reflected in the
database. This involves decision analysis and task analysis. In identifying each decision
and task, the analyst focuses on the information requirements and how they are related.
The intent is to gain understanding, not to provide a critique of the operations.
 Identify not only stated data needs but also various indicators such as organizational
structure, environmental policies, interaction between functions, etc., which may indicate
additional data requirements.
 Define the scope of the database and the environment that the database will support
including any constraints on the database operation.
 Produce documentation that presents a valid picture of the organization's operation.

Prior to applying this method, acquired and study the following documentation:

 A high level data flow diagram (DFD) depicting the major applications to be supported
and the major data sources and outputs of these applications.
 Detailed DFDs depicting the functions and tasks performed and a list of the documents,
files, and informal references (e.g., memos, verbal communications, etc.) used to perform
each function.

Business analysis involves the following steps:

 Identify mission, functions and operations;


 Identify tasks performed and data usage;
 Identify task/data relationships;
 Develop list of constraints;
 Develop list of potential future changes.

Identify Mission, Functions and Operations

Identify the mission, functions and operations of the organizational element that the database is
to support. The purpose of this step is to define the scope of the potential database's current and
future needs and develop a reference point for further analysis. This step covers all relevant
functional areas and be developed separately from any single application design effort.

In examining an organizational element, which may range in size from a branch to an entire
organization, the following may provide sources of information:

 If available, the organization's "information plan" would be the best source. These
plans vary widely in content but must articulate the organization's current and future
management information strategy, a discussion of each system's scope, and
definitions of the dependencies between major systems (both automated and
manual) and groups of data. With this information, it is possible to determine which
functional areas must be included within the scope of the design.
 If an information plan is not available, or this plan does exist but does not contain
diagrams of systems and data dependencies, it will be the designer's responsibility to
determine the scope. In this case, persons within the relevant functional areas must
be interviewed to determine how they relate to the rest of the organization. After the
areas to which they relate are determined, additional interviews can be conducted in
these newly identified areas to ascertain the extent to which they share data with the
application(s) under design.
 Other potential sources of information are the Requests for Information Services
(RIS), mission, functional statements, internal revenue manuals, and senior staff
interviews.
 Future changes to the organization must be considered when defining the scope of
the design effort, i.e., any major changes in operating policy, regulations, etc. Each
potential change must be identified and further defined to determine whether it could
change the definition, usage, or relationships of the data. Where a change could
affect the database in the future, the design scope should be expanded to consider
the effects of this change.

After determining the scope of the database, construct a high-level DFD to graphically depict the
boundaries.

Identify Tasks Performed and Data Usage


Identify the tasks performed in each of the functions and operations. The purpose is to identify
tasks performed in each function of the organizational element that the database would support
and to identify the data usage or "data needs" of these tasks. The functions and their related tasks
can be divided into two categories: operational and control/planning.

Decompose each function into the lowest levels of work that require, on a repetitive basis,
unique sets of data. Work at this level is considered a "task", a unique unit of work consisting of
a set of steps performed in sequence. All these steps are directed toward a common goal and use
and/or create a common set of data.

Once a task has been defined, decompose it into subtasks. This decomposition must occur if one
or more of the following conditions exist:

 More than one person is needed to carry out the task and each of them is required to
have a different skill and/or carries out his/her part independently.
 There are different levels of authorization, i.e., different people authorize different
parts of the task.
 Different frequencies or durations apply to different parts of the task.
 Input documents are not used uniformly within the task.
 Totally different documents are used for different parts of the task.
 Many different operations are carried out within the task.
 There are different primitive operations which each have separate input/output
requirements.

However, when a subtask has been defined, make certain it is limited to that particular task. If it
spans two or more tasks, it cannot be considered a subtask.

Collect all information in a precise manner using interviews and documentation techniques. This
approach is especially important when identifying operational functions because they provide the
basic input to the database design process. These functions and their associated tasks must be
identified first. Therefore, begin by identifying the organizational areas within the scope of the
design effort which perform the functions essential to conducting business. Once these functional
areas have been determined, the persons to be interviewed can be specified. The recommended
approach is as follows:

 Determine the key individuals within these areas and send out questionnaires
requesting: job titles of persons within their areas of responsibility; functions
performed in each job; and a brief statement of the objective(s) of each job.
 After receiving the results of the questionnaire, develop a document showing job
title, functions performed, and the objectives of these functions. Then review and
classify each job as either operational or control and planning. Once this is
completed, the contact the supervisor of each job which is identified as "operational"
and ask to select one, preferably two, persons performing that job who can be
interviewed.
 Conduct the operational interviews. Keep the following three objectives in mind:
identify each operational function; identify the data associated with each of these
functions; and identify the implicit and explicit rules determining when and how
each function occurs.

When conducting operational interviews, accomplish the following steps during the interviews:

1. Begin by having each interviewee describe, in detail, the functions and tasks that are
performed on a daily or potentially daily basis. Document these major actions, decisions,
and interfaces on task analysis and decision analysis forms. These actions, decisions, and
interfaces must also be reflected on a detailed data flow diagram. This documentation can
subsequently be used to verify that all operational functions and their sequence are
correct. Repeat this same procedure for functions that occur weekly, monthly, quarterly,
and annually.
2. As the functions, tasks, and other activities are defined, determine the documents, files
and informal references (memos, verbal communications, etc.) used to perform them and
indicate these in a separate numbered list. A task/document usage matrix may also be
used specifying a task's inputs and outputs in terms of documents.
3. Once the person interviewed agrees to the contents of the documentation, discuss more
specifically each action, decision, and interface point to determine what specific
documents or references are required. Then request a copy of each document that has
been discussed.
4. Finally, identify the data elements actually used or created on each document and
compile a list of these elements. Include their definitions and lengths. Any data elements
that are not included in the dictionary must be entered.

The second type of information required for conceptual database development involves the
organization's control and planning functions and their related data needs. An in-depth
investigation of the organization's explicit and implicit operating policies is necessary. Such
information can be obtained through interviews with management. Since the nature of the
information collected will vary according to the organization and persons involved, there is no
rigid format in which the interview must be documented. However, in order to minimize the
possibility of losing or missing information, it is recommended that there be two interviewers
who could alternate posing questions and taking notes.

Conduct interviews for control and planning functions with persons whose responsibilities
include defining the goals and objectives of the organization, formulating strategies to achieve
these goals, and managing plans to implement these strategies; and with those persons directly
responsible for the performance of one or more operating areas. The objective of these
interviews is to gain, where appropriate, an overall understanding of:

 The basic components of the organization and how they interact with one another.
 The external environment that affects the organization directly or indirectly (i.e.,
Congressional directives, Treasury policies, etc.).
 Explicit or implicit operating policies that determine how the mission is performed;
some of these may be identified when discussing the internal and external
environment.
 Information used currently or required to plan organizational activities and measure
and control performance. If available, obtain examples.
 Changes that are forecast that may affect the organization.

The following are steps for conducting control and planning interviews:
1. Present the designer's perception of functions and operations and seek confirmation and
clarification, i.e., clarify which are main functions, support functions, and sub-functions
or tasks.
2. Ask what additional functions, if any, are performed.
3. Ask what monitoring functions are performed and what critical indicators are used to
trigger intervention.
4. Ask what planning functions are performed and what data is used for planning purposes.
5. Express appreciation by thanking the person interviewed for his/her time.
6. If any new data elements are defined during the interviews, make certain they are
incorporated in the Enterprise Data Dictionary so that they may be properly cross-
referenced.

Identify Task/Data Relationships

Collect information about data usage and identify task/data relationships. Once all functions and
tasks are identified as either operational or control and planning and their data usage has been
determined, add specificity to the task/data relationships. A task/data relationship is defined as
the unique relationship created between data items when they are used to perform a specific task.
It is critical that these relationships be carefully and thoughtfully defined.

The process of defining task/data relationships begins with analyzing the documentation
developed during the interviews. When identifying a series of unique tasks, follow and apply
these rules:

 A task must be performed within one functional area. Each task must consist of a set
of serially performed steps (or serially positioned symbols on a DFD). If a decision
point occurs and one path of the decision involves a new action, in effect the current
task ends and a new one begins. Each step within a single task must be performed
within a reasonable period. If a significant amount of time can elapse between two
steps, more than one task must be defined.
 Each step within the task must use the same set of data. However, if new data is
created in one step of the task and used in the next step, they may be considered as
the same set of data.
After all the data flows and other documentation have been analyzed and assigned to tasks,
compare the tasks for each duplicate interview to determine if the same ones were defined - This
is assuming that two persons with the same job title were interviewed in each relevant area.
When conflicts are found, compare the two sets of documentation to determine if one is merely
more detailed:

 If the DFDs, etc., appear to be the same and differ only on levels of detail, choose
the one that best defines a complete unit of work.
 If real functional differences are found, review the documents (and notes) associated
with each. Sometimes people with similar titles perform different functions due to
their seniority or competence. When major differences are found, separate any
unique tasks and add them to the list.
 If differences are found and it is difficult to determine why they exist, request that
the appropriate supervisor review the task definitions developed during the
interviews. (However, do not include any portions of the interviews that are
confidential).

Once any conflicting definitions have been resolved, task/data relationships specifically
documented. Because it is likely that redundant tasks have been defined, arrange the
documentation already produced by department or area. This method increases the likelihood
that redundant tasks will be identified. It is suggested that the documentation of task/data
element relationships begin with a table such as the one shown in below.

Task Average Data


Task Definition Type Frequency Department
# Volume Elements
Examine order 410, 200, 201 -
1 operational daily 500 Order Entry
request 225
... ... ... ... ... ... ...
... ... ... ... ... ... ...
The documentation must:

 Numerically identify each task


 Briefly define each task by a verb-object type command (e.g., fill out error report, request
alternate items, etc.)
 Classify tasks as operational or control/planning Identify the frequency and average
volume for each task
 Relate each task to a specific functional area
 Then construct a task/data element matrix to specify each task's inputs and outputs in
terms of data elements

Develop a List of Constraints

Develop a list of all implicit and explicit constraints such as security, data integrity, response or
cyclic processing time requirements. The purpose of developing a list of all implicit and explicit
constraints is to provide information for the physical database designer to use in determining
operational considerations such as access restrictions, interfaces to other packages, and recovery
capabilities. Document constraints using either a tabular or a memo format. Examples of items to
be considered are:

 Data security needs


 Access and processing cycle time requirements
 Special display or calculation requirements
 Special equipment utilization

Develop a List of Potential Future Changes

Develop a list of potential future changes and the way in which they may affect operations. The
purpose of this step is to include in the database design considerations that may affect operations
in the future. Consider future changes to include anything that may affect the scope of the
organization, present operating policies, or the relationship of the organization to the external
environment. When reviewing the interviews to identify changes, highlight anything that implies
change, and, if possible, the effect(s) of that change.

Data Modeling
Data modeling is a technique that involves the analysis of data usage and the modeling the
relationships among entities. These relationships are modeled independent of any particular
hardware or software system. The objective of logical design is to clearly define and depict user
perspectives of data relationships and information needs.

The various approaches to logical database design involve two major design methodologies-
entity analysis and attribute synthesis.

In applying this method, the primary tool used is the data relationship diagram. This type of
diagram is used to facilitate agreement between the designer and users on the specific data
relationships and to convey those relationships to the physical database designer. It is a graphic
representation of data relationships. The format used must be either the data structure diagram or
entity-relationship diagram.

Simplify the modeling process by partitioning the model into the following four design
perspectives:

 The organizational perspective reflects senior and middle management's view of the
organization's information requirements. It is based on how the organization
operates.
 The application perspective represents the processing that must be performed to
meet organizational goals, i.e., reports, updates, etc.
 The information perspective depicts the generic information relationships necessary
to support decision-making and long-term information requirements. It is
represented by user ad hoc queries, long-range information plans, and general
management requirements.
 The event perspective deals with time and scheduling requirements. It represents
when things happen, e.g., frequency of reports.

There are two general rules that provide the foundation for design perspectives:

 The design perspectives are modeled by three types of constructs: entity, attribute and
relationship;
 In the design perspective, each component of information must be represented by one,
and only one, of these constructs.

An entity refers to an object about which information is collected, e.g., a person, place, thing, or
event. A relationship is an association between the occurrences of two or more entities. An
attribute is a property of an entity, that is, characteristic about the entity, e.g., size, color, name,
age, etc.

It is important to remember that data usage is dynamic. Perceptions change, situations change
and rigid concepts of data use are not realistic. Data involves not only values but relationships as
well and must be divided into logical groups before being molded into whatever structures are
appropriate-matrices, entity relationship diagrams, data structure diagrams, etc. If at any point it
becomes apparent that a database approach is definitely not suitable or practical for whatever
reason, take an alternative path as soon as possible to save vital resources.

Data modeling involves the following steps:

1. Identify local views of the data.


2. Formulate entities.
3. Specify relationships.
4. Add descriptive attributes.
5. Consolidate local views and design perspectives.
6. Verify the data model.

Identify Local Views of the Data

Identify local views of the data. Develop local views for the organization, application,
information, and event design-perspectives.

For each of the functions, activities and tasks identified, there exists what may be called "sub
perspective" or local views of the data. Normally there will be several local views of the data
depending on the perspective. These views correspond to self-contained areas of data that are
related to functional areas. The selection of a local view will depend on the particular perspective
and the size of the functional area. Factors which must be considered in formulating local views
include a manageable scope and minimum dependence on, or interaction with, other views.
The primary vehicles for determining local views will be the task/data element matrices and the
task analysis and description forms constructed during logical database analysis.

Formulate Entities

For each local view, formulate the entities that are required to capture the necessary information
about that particular view.

At this point the designer is confronted with two major considerations. The first consideration
deals with the existence of multiple entity instances and can be addressed by using the concept of
"type" or "role". For example, the population of the entity EMPLOYEE can be categorized into
employees of "type": computer systems analyst, secretary, auditor, etc. It is important, at this
stage of logical design to capture the relevant types and model each as a specific entity. The
generalization of these types into the generic entity EMPLOYEE will be considered in the next
stage of conceptual design where user views are consolidated.

The second consideration deals with the use of the entity construct itself. Often a piece of
information can be modeled as either an entity, attribute, or relationship. For example, the fact
that two employees are married can be modeled using the entity MARRIAGE, the relationship
IS-MARRIED-TO, or the attribute CURRENT-SPOUSE. Therefore, at this point in the design
process the designer must be guided by two rules. First, use the construct that seems most
natural. If this later proves to be wrong, it will be factored out in subsequent design steps.
Second, avoid redundancy in the use of modeling constructs; use one and only one construct to
model a piece of information.

One rule of thumb, which has been successfully used to restrict the number of entities identified
so that a local view can be properly represented, is the "magic number seven, plus or minus two."
This states that the number of facts (information clusters) that a person can manage at any one
time is about seven, give or take two. Therefore, when this is applied to the database design
process, the number of entities contained in a local view must, at the most, be nine, but probably
closer to six or seven. If this restriction cannot be met, perhaps the scope of the local view is too
large.
Give careful consideration to the selection and assignment of an entity name. Since an entity
represents a fact, give a precise name to this fact. This is also important later when views are
consolidated because that subsequent stage deals with homonyms and synonyms. If the name
given to an entity does not clearly distinguish that entity, the integration and consolidation
process will carry this distortion even further.

Finally, select identifying attributes for each entity. Although a particular collection of attributes
may be used as the basis for formulating entities, the significant attribute is the identifier (or
primary key) that uniquely distinguishes the individual entity instances (occurrences), for
example, employee number. This entity identifier is composed of one or more attributes whose
value set is unique. This is also important later in the consolidation phase because the identifying
attribute values are in a one-to-one correspondence with the entity instances. Therefore, two
entities with the same identifiers may to some degree be redundant. However, this will depend
on their descriptive attributes and the degree of generalization.

Specify Relationships

Identify relationships between the entities. In this step, additional information is added to the
local view by forming associations among the entity instances. There are several types of
relationships that can exist between entities. These include:

 Optional relationships
 Mandatory relationships
 Exclusive relationships
 Contingent relationships
 Conditional relationships

In an optional relationship the existence of either entity in the relationship is not dependent on
that relationship. For example, there are two entities, OFFICE and EMPLOYEE. Although an
office may be occupied by an employee, they can exist independently.
In a mandatory relationship, the existence of both entities is dependent on that relationship.

An exclusive relationship is a relationship of three entities where one is considered the prime
entity that can be related to either one of the other entities but not both.

In a contingent relationship the existence of one of the entities in the relationship is dependent on
that relationship. A VEHICLE is made from many PARTS.
A conditional relationship is a special case of the contingent relationship. When it occurs the
arrow must be labeled with the condition of existence.

Relationships can exist in several forms. The associations can be one-to-one (1:1), one-to-many
(1:N) or many-to-many (N:N). A one-to-one association is shown by a single-headed arrow and
indicates that the relationship involves only one logical record, entity or entity class of each type.
A one-to-many association is shown by a double-headed arrow and documents the fact that a
single entity, entity class or logical record of one type can be related to more than one of another
type. A many-to-many association is shown by a double-headed arrow in both directions.
An informal procedure for identifying relationships is to pair each entity in the local view with
all other entities contained in that view. Then for each pair, ask if a meaningful question can be
proposed involving both entities or if both entities may be used in the same transaction. If the
answer is yes to either question, determine the type of relationship that is needed to form the
association. Next, determine which relationships are most significant and which are redundant.
Of course, this can be done only with a detailed understanding of the design perspective under
consideration.

Add Descriptive Attributes

Add descriptive attributes. Attributes can be divided into two classes-those that serve to identify
entity instances and those that provide the descriptive properties of entities. The identifier
attributes, which uniquely identify an entity, were added when the entities were formulated.
Descriptive attributes help describe the entity. Examples of descriptive attributes are color, size,
location, date, name and amount.

In this step of local view modeling, the descriptive attributes are added to the previously defined
entities. Only single-valued attributes are allowed for the description of an entity.

Consolidate Local Views and Design Perspectives

Consolidate local views and design perspectives. Consolidation of the local views into a single
information' structure is the major effort in the logical database design. It is here that the separate
views and applications are unified into a potential database. Three underlying concepts that form
the basis for consolidating design perspectives; these concepts are identity, aggregation, and
generalization.

Identity is a concept which refers to synonymous elements. Two or more elements are said to be
identical, or to have an identity relationship, if they are synonyms. Although the identity concept
is quite simple, the determination of synonyms is not. Owing to inadequate data representation
methods, the knowledge of data semantics is really quite limited. Typically, an in-depth
understanding of the user environments is required to determine if synonyms exist. Determining
whether similar definitions may be resolved to identical definitions, or if one of the other element
relationships really applies, requires a clear and detailed understanding of user functions and data
needs.

Aggregation is a concept in which a relation between elements is considered to become another


higher-level element. For example, EMPLOYEE may be thought of as an aggregation of NAME,
SSN, and ADDRESS. Actually many aggregations are easy to identify since the major data
models incorporate syntax that can represent aggregations.

Generalization is a concept in which a group of similar elements is thought of as a single generic


element by suppressing the differences between them. For example, the entity "EMPLOYEE"
may be thought of as a generalization of "FACTORY-WORKER", "OFFICE-WORKER", and
"EXECUTIVE". An instance of any of these three types is also an instance of the generalized
"EMPLOYEE". This is the most difficult concept to grasp and care must be taken not to confuse
it with aggregation. Whereas aggregation can be thought of as parts making up a "whole",
generalization is concerned only with "wholes".

Since aggregation and generalization are quite similar in structure and application, one element
may participate in both aggregation and generalization relationships.

Inferences can be drawn about the aggregation dimension from the generalization dimension and
vice versa, e.g., it can be inferred that each instance of "EXECUTIVE" is also an aggregation of
Name, SSN, and Address. See Figure below.
There are three consolidation types. These types may be combined in various ways to construct
any type of relationship between objects (elements) in different user views. By combining
consolidation types, powerful and complex relationships can be represented. In fact, we
recommend that most semantic relationships be represented by some combination of these types
of consolidation. The consolidation types are:

 Identity Consolidation - Two objects may be semantically identical with the


additional option of having identical names. Homonyms must be guarded against as
well as similar, but not identical, objects. Similarity is best expressed using
aggregation and generalization. As a check on the consistency of the consolidation
and also on user views, if an object from User is view is found to be identical to an
object from User 2's view, neither of these objects can participate further in any other
identity consolidations between these two views. This is true because each object is
assumed to be unique within the context of its own local user view.
 Aggregation Consolidation - This may occur in two forms. The difference depends
on whether or not one of the users has specified the aggregated "whole" object. An
example of the simpler form is where User 1 has specified a number of objects
without making any consolidation type relationships between them, e.g., an
inventory view of HANDLE BARS, WHEELS, SEATS, and FRAMES. However,
User 2 has specified an object, BICYCLE, which is an aggregation of User is objects.
The conceptually more difficult version of aggregation occurs when both users have
specified some or all of the parts of an unmentioned "whole". As an example, when
separate inventory functions are maintained for basic, non-variable parts (FRAMES,
WHEELS) and for parts that may be substituted by customer request (SEATS,
HANDLE BARS). This type of aggregation is more difficult to recognize since
neither user has defined a BICYCLE object.
 Generalization Consolidation - This may also occur in two forms. Again, the
difference lies in whether either of the users has specified the generalized or generic
object.

The consolidation process comprises four steps:

1. Select perspectives.
2. Order local views within each perspective.
3. Consolidate local views within each perspective.
4. Resolve conflicts.

Select perspectives. First, confirm the sequence of consolidation by following the order of design
perspectives. Since this order is general, check it against the objectives of the database being
designed. For example, if you are designing a database for a process-oriented organization, you
might consider the key perspectives to be the application and event perspectives and therefore
begin the process with these.

Order local views within each perspective. Once the design perspectives have been ordered,
focus the consolidation process on local views within each perspective. Several views comprise
the perspective chosen and this second step orders these views for the consolidation process. The
order must correspond to each local view's importance with respect to specific design objectives
for the database.

Consolidate local views within each perspective. This step is the heart of the consolidation
process. For simplicity and convenience, use binary consolidation, i.e., integrating only two user
views at a time. This avoids the confusion of trying to consolidate too many views. The order of
consolidation is determined by the previous step where the local views within a perspective have
been placed in a particular order. The process proceeds as follows:

1. Take the top two views in the perspective being considered and consolidate these using
the basic consolidation principles.
2. Using the binary approach, merge the next local view with the previously consolidated
local views. Continue this process until the last view is merged. When the consolidation
process is completed for the first design perspective, the next design perspective is
introduced and this process continues until all perspectives are integrated.

Resolve conflicts. Conflicts can arise in the consolidation process for a number of reasons,
primarily because of the number of people involved and the lack of semantic power in our
modeling constructs. They may also be caused by incomplete or erroneous specification of
requirements. Although the majority of these conflicts are dealt with in the consolidation step
using the rules previously discussed, any remaining conflicts that have to be dealt with by
designer decisions are taken care of in this step. When a design decision is made, it is important
to "backtrack" to the point in the consolidation process where these constructs were entered into
the design. At this point the implications of the design decision are considered and also their
effects on the consolidation process.

Present Data Model

The purpose of this step is to present the data model. Use data relationship diagrams to document
local views and their consolidation. These must take the form of an entity-relationship diagram
or a data structure diagram.

When constructing either of these diagrams, use these rules of thumb:

 Each entity and relationship must be clearly labeled.


 Each entity must be related to at least one other entity.
 Show attributes only if they uniquely identify the entity or are a common access path
to the entity.
 Limit entities and relationships for a given activity to a single page.
 Identify the name of the activity supported at the top of the page.
 If a data relationship diagram for an activity needs to span more than one page, use
the same off-page connectors as used in DFDs.

Verify Data Model

Verify the data model. The purpose of this step is to verify the accuracy of the data model and
obtain user concurrence on the proposed database design.

The process of developing the information structure involves summarizing and interpreting large
amounts of data concerning how different parts of an organization create and/or use that data.
However, it is extremely difficult to identify and understand all data relationships and the
conditions under which they may or may not exist.

Although the design process is highly structured, it is still probable that some relationships will
be missed and/or expressed incorrectly. In addition, since the development of the information
structure is the only mechanism that defines explicitly how different parts of an organization use
and manage data, it is reasonable to expect that management, with this newfound knowledge,
might possibly consider some changes. Because of this possibility, it is necessary to provide
management with an understanding of the data relationships shown in the information structure
and how these relationships affect the way in which the organization performs, or can perform,
its mission. Each relationship in the design and each relationship excluded from the design must
be identified and expressed in very clear statements that can be reviewed and approved by
management. Following management's review, the design will, if necessary, be adjusted to
reflect its decisions.

The verification process is separated into two parts, self-analysis and user review. In self
analysis, the analyst must insure that:

 All entities have been fully defined for each function and activity identified.
 All entities have at least one relation to another entity.
 All attributes have been associated with their respective entities.
 All data elements have been defined in the Enterprise Data Dictionary.
 All processes in the data flow diagram can be supported by the database when the
respective data inputs and outputs are automated.
 All previously identified potential changes have been assessed for their impact on
the database and necessary adjustments to the database have been determined.

To obtain user concurrence on the design of the database, perform the following steps, in the
form of a walk-through, to interpret the information structure for the user:

1. State what each entity is dependent upon (i.e., if an arrow points to it). Example: All
ORDERS must be from CUSTOMERS with established accounts;
2. State what attributes are used to describe each entity;
3. Define the perceived access path for each entity;
4. Define the implications of each arrow (i.e., one-to-one, one-to-many etc.);
5. Define what information cannot exist if an occurrence of an entity is removed from the
database.
Give the user an opportunity to comment on any perceived discrepancies in the actual operations
or usage. If changes need to be made, then give the user the opportunity to review the full design
at the completion of the changes. Once all changes have been made and both the relationship
diagram and the data definitions have been updated, obtain user concurrence on the design
specifications.

Physical Database Design


The boundary between logical and physical database design is difficult to assess because of the
lack of standard terminology. However, there seems to be general agreement that logical design
encompasses a DBMS-independent view of data and that physical design results in a
specification for the database structure, as it will be physically stored. The design step between
these two that produces a schema that can be processed by a DBMS can be called
implementation design. The DBMS-independent schema developed during logical design is one
of the major inputs. Refinements to the database structure that occur during this design phase are
developed from the viewpoint of satisfying DBMS-dependent constraints as well as the more
general constraints specified in the user requirements.

The major objective of implementation design is to produce a schema that satisfies the full range
of user requirements and that can be processed by a DBMS. These extend from integrity and
consistency constraints to the ability to efficiently handle any projected growth in the size and/or
complexity of the database. However, these must be considerable interaction with the application
program design activities that are going on simultaneously with database design. Analyze high-
level program specifications and program design guidance supplied to correspond to the
proposed database structure.

The guidance provide in this section serves a dual purpose. First, it provides general guidelines
for physical database design. Various techniques and options used in physical design are
discussed as well as when and how they must be used to meet specific requirements. These
guidelines are generic in nature and for this reason are intended to provide a basic understanding
of physical database design prior to using specific vendor documentation. Second, since database
management systems vary according to the physical implementation techniques and options they
support, these guidelines will prove useful during the database management software
procurement. They will provide a means for evaluating whether a DBMS under consideration
can physically handle data in a manner that will meet user requirements.

The usefulness of these guidelines is directly related to the where one is in a development life
cycle and the level of expertise of a developer. This document assumes the reader is familiar
with database concepts and terminology since designers will most likely be database
administrators or senior computer specialists. To aid the reader, a glossary and bibliography are
provided.

The criterion for determining physical design is quite different from that of logical design.
Selection of placement and structure is determined by evaluating such requirements as
operational efficiency, response time, system constraints and security concerns. This physical
design layout must be routinely adjusted to improve the system operation, while maintaining the
user's logical view of data. The physical structuring or design will often be quite different from
the user's perception of how the data is stored.

The following steps provide general guidance for physical database design. Since much of the
effort will depend on the availability of data and resources, the sequence of these steps is
flexible:

1. Determine user requirements;


2. Determine processing environment;
3. Select database management system software;
4. Design physical placement of data;
5. Perform sizing of data;
6. Consider security and recovery.

Determine the User’s Requirements

These critical factors will dictate the weight placed on the various physical design considerations
to be discussed. The following are examples of user requirements. Notice that with each
requirement there is an example of an associated trade-off.
Retrieval time decreases with a simple database structure; however, to meet the logical design
requirements, it may be necessary to implement a more complex multilevel structure.

Ease of recovery increases with a simple structure but satisfying data relationships may require
more complex mechanisms.

Due to increased pointer or index requirements, hardware cost is increased if information is


spread over many storage devices; however, data clustering and compacting degrade
performance.

Privacy requirements may require stringent security such as encryption or data segmentation.
These procedures decrease performance, however, in terms of update and retrieval time.

Active files, especially those accessed in real time, dictate high-speed devices; however, this will
represent increased cost.

Determine the Processing Environment

By attempting to determine the primary type of processing, the designer has a framework of
physical requirements with which to begin design. Three environments will be discussed;
however, keep in mind that these are merely guidelines since most systems will not fit neatly into
one general set of requirements. These considerations will often conflict with the user's
requirements or security needs (to be discussed), thus forcing the designer to make decisions
regarding priority.

Normally this environment requires fast response time, and multiple run units will actively share
DBMS facilities. In order to meet response time specifications, cost may increase due to the
necessity for additional system resources. Recovery may be critical in such a volatile
environment, and whenever possible, use a simple structure. In a CODASYL (network)
structure, this time specification would translate into reduced data levels, number of network
relationships, number of sets and size of set occurrences. In a high volume processing
environment requests are most frequently random in nature requiring small amounts of
information transfer; thus affecting page and buffering considerations.
Low volume systems generally process more data per request, indicating run units may remain in
the system longer. There is the likelihood of more sequential requests and reports, and response
time is probably not the critical issue. Resources may be more limited in this environment,
implying smaller buffers and perhaps fewer peripherals. With the possibility of fewer resources,
those resources may need to be more highly utilized. On-line recovery techniques may be
unavailable since the resource requirements are costly. Although the number of transactions is
low in this environment, the probability of multiple simultaneous run units accessing the same
data may be high.

When a batch environment is indicated, the designer is left with maximum flexibility since the
requirement is reasonable turnaround time and effective use of resources. Because of job
scheduling options, concurrency problems can be controlled. Recovery tends to be less critical
and will be determined by such factors as file volatility, the time necessary to rerun update
programs, and the availability of input data. For example, if the input data is readily available,
the update programs short and processing 85 percent retrieval; the choice may be made to avoid
the overhead of maintaining an on-line recovery file.

Select DBMS Software

The DBMS must first physically support the logical design requirements. That is, based on the
logical data model, the package must support the required hierarchical, network or relational
structure. Early stages of analysis must provide enough information to determine this basic
structure. From a physical database design point of view, an analysis must then be made as to
how effectively the DBMS can handle the organizational and environmental considerations. If
the proposed package fails to provide adequate support of the requirements, the project manager
must be notified. The notification must include the specific point(s) of failure, anticipated
impact(s), and any suggestions or alternatives for alleviating the failure(s).

Design the Physical Placement of Data

This procedure involves selecting the physical storage and access methods as well as secondary
and multiple key implementation techniques. DBMS packages vary as to the options offered.
The use of vendor documentation, providing specific software handling details, will be necessary
to complete this process.
Perform Sizing of Data

Obtain the specifics of sizing from vendor documentation as each DBMS handles space
requirements in a different manner. Consider Sizing in conjunction with designing the placement
of data. Once data records, files and other DBMS specifics have been sized according to a
proposed design, a decision may be made, because of the space allocation involved, to change
the design. Data compaction techniques may be considered at this point. Flexibility to make
changes and reevaluate trade-offs during this entire procedure is of critical importance.

Consider Security and Recovery

The DBMS selected must have the options necessary to implement security and recovery
requirements. Implementation of these considerations will often cause trade-offs in other design
areas.

Deliverables

In applying database design, the following deliverables result:

1. Decision Analysis and Description Forms


2. Task Analysis and Description Forms
3. Task/Data Element Usage Matrix
4. Data Models
5. Entity-Attribute Lists
6. Data Definition Lists
7. Physical Data Base Specifications Document

Decision Analysis and Description Forms

Decision Analysis and Description Forms must identity such items as type of decision, the
decision maker, and the nature of the decision.

Task Analysis and Description Forms

Task Analysis and Description Forms must include the name of the task, its description
(overview), the people/departments involved, and subtasks and their relationships.
Task/Data Element Usage Matrix

A task/data element usage matrix relates each data element to one or more tasks.

Data Models

Data relationship diagrams depict the relationships between entities. These are tools that provide
one way of logically showing how data within an organization is related. They must be models
using conventions for either data structure diagrams or entity relationship diagrams. The term
"entity" refers to an object of interest-person, place, thing, event-about which information is
collected. When constructing either of these diagrams it is recommended that the entities be
limited to those of fundamental importance to the organization.

Entity-Attribute Lists

Entity-attribute relation lists may be derived from the Enterprise Data Dictionary listings.

Data Definition Lists

Data definition lists may be derived from Enterprise Data Dictionary listings.

Physical Database Specifications Document

Duplication of effort can be eliminated if there are existing documents available containing
physical specifications. Use the following resources for developing documentation:

1. DBMS provided documentation - For example, a listing of the scheme from a Codasyl
DBMS will provide such detail as data names, sizing, placement and access methods.
2. Data Dictionary – The Data Dictionary listing provides certain physical specifications,
such as, data format and length.
3. Project documentation - All documentation submitted as Physical Database
Specifications must be organized in a macro to micro manner or global to specific. That
is, begin at the schema level, moving to subschema, indices, data elements, etc. The
objective is to organize documentation in a manner that is clear and easy for the user to
read.

Names
Where appropriate in the documentation, identify names of physical database items. Specifically,
the items will be all those defined to the DBMS software, such as: Schema; Subschema; Set;
Record; Field; Key; Index Names

Where data naming standards are applicable, these standards shall be met to the extent possible
with the DBMS software. For example, if the DBMS does not permit hyphens in naming, an
exception would be made to the standard "all words in a name must be separated with a hyphen".

Data Structure/Sizing

Identification of data elements, associations within and between record types as well as sizing
requirements are identified and documented during the logical database design process. The
physical representation of data structures will vary from the logical, however, since the
physically stored data must adhere to specific DBMS characteristics. As applicable to the
DBMS, the following structures must be documented: Records; Blocks/Pages; Files

Describe the physical record layout. In this description, include the data fields, embedded
pointers, spare data fields, and database management system overhead (flags, codes, etc).
Besides data record types, document any other record types such as index records. If records are
handled as physical blocks or pages, provide the following:

 Calculations determining block/page size


 Total number of blocks/pages allocated

Describe the strategy for determining block/page sizes.

Specify the amount of space allocated for each database file. This must be consistent with the
total record and block/page sizing documentation described above.

Data Placement

1. For each record type: State the storage and access method used; Describe the storage and
assess method; Where applicable, identify the physical data location (track, cylinder);
Where an algorithm access method is used:
2. Give the primary record key to be used by the algorithm; Describe the algorithm used,
including the number and size of randomized address spaces available to the algorithm;
Give the packing density, and the strategy for its determination.
3. Where an index sequential access method is used: Give the primary record key; State
indexing strategy/levels; State initial load strategy.
4. Where a chains access method is used: Give the access path to the record type (i.e., Is this
primary access of detail record though the master record or is this a chain of secondary
keys?); List the pointer options used (i.e., forward, backward, owner, etc.); Indicate
whether the chain is scattered or stored contiguously with the master.
5. Where an index access method is used, identify keys used to index a record type.

Database Development Life Cycle


The life cycle of a relational database is the cycle of development and changes that a relational
database goes through during the course of its life. The cycle typically consists of several stages
The database development process comprises a series of phases. The major phases in information
engineering are:
 Data base Planning
 Requirements Collection & Analysis
 Design
 DBMS Selection
 Implementation
 Maintenance

Database planning
The database-planning phase begins when a customer requests to develop a database project. It is
set of tasks or activities, which decide the resources required in the database development and
time limits of different activities. During planning phase, four major activities are performed.
 Review and approve the database project request.
 Prioritize the database project request.
 Allocate resources such as money, people and tools.
 Arrange a development team to develop the database project.

Database planning should also include the development of standards that govern how data will
be collected, how the format should be specified, what necessary documentation will be needed.
Summary
 Define the problems and constraints
 Define the objectives
 Define scope and boundaries

Requirements Analysis
Requirements analysis is done in order to understand the problem, which is to be solved. It is
very important activity for the development of database system. The person responsible for the
requirements analysis is often called "Analyst". In requirements analysis phase, the requirements
and expectations of the users are collected and analyzed. The collected requirements help to
understand the system that does not yet exist. There are two major activities in requirements
analysis.
 Problem understanding or analysis
 Requirement specifications.
 Most important stage
 Labor-intensive
 Involves assessing the information needs of an organization so the database can be
designed to meet those needs
Summary
 Examine the current system operation.

 Try to establish how and why the current system fails.


Design
The database design is the major phase of information engineering. In this phase, the information
models that were developed during analysis are used to design a conceptual schema for the
database and to design transaction and application.
 In conceptual schema design, the data requirements collected in Requirement Analysis
phase are examined and a conceptual database schema is produced.
 In transaction and application design, the database applications analyzed in Requirement
Analysis phase are examined and specifications of these applications are produced.

DBMS Selection
In this phase an appropriate DBMS is selected to support the information system. A number of
factors are involved in DBMS selection. They may be technical and economical factors. The
technical factors are concerned with the suitability of the DBMS for information system. The
following technical factors are considered.
 Type of DBMS such as relational, object-oriented etc
 Storage structure and access methods that the DBMS supports.
 User and programmer interfaces available.
 Type of query languages.
 Development tools etc.

Implementation
After the design phase and selecting a suitable DBMS, the database system is implemented. The
purpose of this phase is to construct and install the information system according to the plan and
design as described in previous phases. Implementation involves a series of steps leading to
operational information system that includes creating database definitions (such as tables,
indexes etc), developing applications, testing the system, data conversion and loading,
developing operational procedures and documentation, training the users and populating the
database. In the context of information engineering, it involves two steps.
 Database definitions: The tables developed in the ER diagram are converted into SQL
statements
 Creating applications. System Administrator has installed and configured an RDBMS

Operational Maintenance
Once the database system is implemented, the operational maintenance phase of the database
system begins. The operational maintenance is the process of monitoring and maintaining the
database system. Maintenance includes activities such as adding new fields, changing the size of
existing field, adding new tables, and so on. As the database system requirement change, it
becomes necessary to add new tables or remove existing tables and to reorganize some files by
changing primary access methods or by dropping old indexes and constructing new ones. Some
queries or transactions may be rewritten for better performance. Database tuning or
reorganization continues throughout the life of database and while the requirements keep
changing.

Chapter Review Questions


1. Discuss the logical database design

2. Discuss the Physical Database Design

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER TWO: DATABASE RECOVERYAND DATABASE
SECURITY

Learning Objectives:

By the end of this chapter the learner shall be able to;


i. Understand the database security and recovery

ii. understanding database recovery features

iii. understanding database access controls

Introduction to Database Recovery


Database Recovery is the process of restoring the database to a correct state in the event of a
failure

Causes of Failures
Transaction-Local - The failure of an individual transaction can be caused in several ways:
 Transaction – Induced Abort e.g. insufficient funds
 Unforeseen Transaction Failure, arising from bugs in the application programs
 System-Induced Abort – e.g. when Transaction Manager explicitly aborts a
transaction because it conflicts with another transaction or to break a deadlock.
Site Failures - Occurs due to failure of local CPU and results in a System Crashing.
 Total Failure – all sites in DDS are down
 Partial Failure – some sites are down
Media Failures e.g. head crashes
Network Failures - a failure may occur in the communications links.
Disasters e.g. fire or power failures
Carelessness - unintentional destruction of data by users
Sabotage - intentional corruption of destruction of data, h/w or s/w facilities
Recovery Procedures

 Recovering from any type of system failure requires the following:

 Determining which data structures are intact and which ones need recovery.

 Following the appropriate recovery steps.

 Restarting the database so that it can resume normal operations.

 Ensuring that no work has been lost nor incorrect data entered in the database.

Database Recovery Features

 recovery from system, software, or hardware failure

 automatic database instance recovery at database start up

 recovery of individual offline table spaces or files while the rest of a database is
operational

 time-based and change-based recovery operations to recover to a transaction-consistent


state specified by the database administrator

 increased control over recovery time in the event of system failure

 the ability to apply redo log entries in parallel to reduce the amount of time for recovery

 Export and Import utilities for archiving and restoring data in a logical data format, rather
than a physical file backup

DBMS Recovery Facilities


 A backup mechanism
 Logging Facilities
 Checkpoint Facility to enable updates to database that are in progress to be permanent
 Recovery Manager - It is the role of the Recovery Manager to guarantee two of the four
A.C.I.D properties i.e. durability and atomicity, in the presence of unpredictable features.
Log File

All operations on the database carried out by all transactions are recorded in the log file
(journal). Each log record contains
 Transaction identifier

 Type of log record e.g. begin, write, commit, abort etc.

 Identifier of data object affected

 Before-image of data object

 After-image of data object

 Log management information


The Log is also used for monitoring and audit purposes.

Recovery Techniques
Restart Procedures – No transactions are accepted until the database has been repaired.
Includes:
 Emergency Restart - follows when a system fails without warning e.g. due to power
failure.
 Cold Restart - system restarted from archive when the log and/or restart file has
been corrupted.
 Warm Restart - follows controlled shutdown of system

Backups - particularly useful if there has been extensive damage to database.


Mirroring - Two complete copies of the DB are maintained on-line on different stable storage
devices. Used in environments with non-stop, fault-tolerant operations.
Undo/Redo - Undoing and Redoing a transaction after failure. The Transaction Manager keeps
an active-list, an abort-list and a commit-list; transactions that have began, transactions aborted,
and committed transactions respectively
Introduction to Database Security
Database security can be defined as the protection of the database against: -
1. unauthorized access to or modification of the database,
2. denial of service to authorized users and
3. provision of service to unauthorized users
It also includes the measures necessary to detect, document, and counter threats
Characteristics of database security
 Confidentiality – protection against disclosure to unauthorized parties
 Integrity – data is not accidentally or maliciously manipulated, altered or corrupted
 Availability – accessibility, reliability and assurance of continuity of operation

Threats

 Browsing - accessing info

 Misuse – malice, errors of omission etc.

 Penetration – Unauthorized access

 Systems Flaws – h/w and s/w errors

 Component Failure – malfunctioning of h/w, s/w or media

 Threats

 Tampering – attacks to physical and logical components

 Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing

 Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding

Database Security Mechanisms


Two types of database security mechanisms

 Discretionary security
 Mandatory security

Discretionary Access Control


Discretionary access control in a database system is base on granting and revoking privileges
Database Security & Database Administrator (DBA)
The DBA has the following privileges

 Account creation

 Privilege granting

 Privilege revocation

 Security level assignment

Discretionary Privileges
Two levels of assigning privileges

 Account level: CREATE Acc, ALTER Acc, DROP Acc, SELECT Acc

 Relation level: SELECT on R, MODIFY on R, REFERENCES on R

Authentication using user ids and passwords. System privileges allow a user to create or
manipulate objects, but do not give access to actual database objects. Object privileges are used
to allow access to a specific database object, such as a particular table or view and are given at
the view level. Privileges can be granted and revoked. Privileges can also be propagated
Roles
Roles are used to ease the management task of assigning a multitude of privileges to users. Roles
are first created and then given sets of privileges that can be assigned to users and other roles.
Users can be given multiple roles.
Three default roles:
 Connect Role allows user login and the ability to create their own tables, indexes, etc.
 Resource Role is similar to the Connect Role, but allows for more advanced rights such
as the creation of triggers and procedures.
 Database Administrator Role is granted all system privileges needed to administer the
database and users.
Profiles
Profiles allow the administrator to place specific restrictions and controls on a number of system
resources, password use etc. These profiles can be defined, named, and then assigned to specific
users or groups of users.
Two types of profiles: system resource profiles and product profiles
 System resource profiles can be used to put user limits on certain system resources such
as CPU time, No. of data blocks that can be read per session or program call, the number
of concurrent active sessions, idle time, and the maximum connection time for a user.
 Product profiles can be used to prevent users from accessing specific commands or all
commands
Profiles cab be used to prevent intentional or unintentional system resource "hogs"
Access Control
Access control for Operating Systems
 Deals with unrelated data
 Deals with entire files
Access control for Databases
 Deals with records and fields
 Concerned with inference of one field from another
Access control list for several hundred files is easier to implement than access control list for a
database!

Backups

 "Cold" backups allow backups when the database is down.

 "Hot" backups allow backups to be done while the database is up.

 Logical backups or "exports" take a snapshot of the database at a given point in time by
user or specific table(s) and allow recovery of the full database or of single tables if needed.
Recovery
Database Audits
Audit Trail - A database log that is used mainly for security purpose. Audit trail desirable in
order to:
 Determine who did what
 Prevent incremental access
Audit trail of all accesses is impractical:
 Slow
 Large
Possible over reporting. pass through problem - field may be accessed during select operation but
values never reported to user
Replication
Database replication facilities can be used to create a duplicate fail-over database site in case of
system failure of the primary database. A replicated database can also be useful for off-loading
large processing intensive queries.
Parallel Servers
Parallel Server makes use of two or more servers in a cluster which access a single database. A
cluster can provide load balancing, can scale up more easily, and if a server in the cluster fails
only a sub-set of users may be affected.
Data Partitioning
Data partitioning can be used by administrators to aid in the management of very large tables.
Large tables can be broken into smaller tables by using data partitioning. One advantage of
partitioning is that data that is more frequently accessed can be partitioned and placed on faster
hard drives. This helps to ensure faster access times for users.

Database Integrity
Concern that the database as a whole is protected from damage
Element Integrity - Concern that the value of a specific element is written or changed only by
actions of authorized users
Element Accuracy - Concern that only correct values are written into the elements of a database
Problem with DDBS - Failure of system while modifying data
Results
 Single field - half of a field being updated may show the old data
 Multiple fields - no single field reflects an obvious error
Solution - Update in two phases
First phase - Intent Phase
 DBMS gathers the information and other resources needed to perform the update
 Makes no changes to database.
Second Phase - Commit Phase
 Write commit flag to database
 DBMS make permanent changes
If the system fails during second phase, the database may contain incomplete data, but this can
be repaired by performing all activities of the second phase

Internal Consistency
Error Detection and Correction Code
 Parity checks
 Cyclic redundancy checks (CRC)
 Hamming codes
Shadow Fields
 Copy of entire attributes or records
 Second copy can provide replacement
Constraints
State Constraints
 Describes the condition of the entire database.
Transition Constraints
 Describes conditions necessary before changes can be applied to database
Multilevel Data Bases
Three characteristics of database security:
 The security of a single element may differ from the security of other elements of the
same record or from values of the same attribute (implies security should be
implemented for individual elements)
 Several grades of security may be needed and may represent ranges of allowable
knowledge, which may overlap
 The security of an aggregate may differ from the security of the individual elements
Every combination of elements in a database may also have a distinct sensitivity. The
combination may be more or less sensitive than any of its individual elements
 "Manhattan" (not sensitive)

 "project" (not sensitive)

 "Manhattan project" (sensitive)


An access control policy must dictate which users may have access to what data (each data
element is marked to show its access limitation). A means is needed to guarantee that the value
has not been changed by an unauthorized person

Proposals for Multilevel Security


Partitioning - The database is divided into separate databases, each at its own security level
 This destroys basic advantages of databases i.e. Elimination of redundancy and
Improved accuracy
Encryption -If sensitive data is encrypted, a user who accidentally receives sensitive data cannot
interpret the data
Integrity Lock - A way to provide both integrity and limited access for a database.
 Each data item consists of three elements
 Data itself

 Classification to indicate sensitivity(e.g. concealed)

 Cryptographic Checksum
Trusted Front-End (Guard)
 User identifies self to front-end; front-end authenticates user's identity
 User issues a query to front-end
 Front-end verifies user's authorization to data
 Front-end issues query to database manager
 Database manager performs I/O access
 Database manager returns result of query to front-end
 Front-end verifies validity of data via checksum and checks classification of data
against security level of user
 Front-end transmits data to untrusted front-end for formatting.
 Untrusted front-end transmits data to user
View
 A subset of a database, containing exactly the information that a user is entitled to
access
 Can represent a single user's subset database, so that all of a user's queries access only
that data
Layered Implementation
 Integrated with a trusted operating system to form trusted database manager base

 First level -Performs user authentication

 Second level - Performs basic indexing and computation functions

 Third level - Translates views into the base relations of the database

Chapter Review Questions


i. Discuss the layered implementation of database security
ii. How do you achieve database integrity
iii. Discuss database recovery procedures

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER THREE: TRANSACTION CONTROL

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understand the database transactions management

ii. understanding transaction properties

iii. understanding nested transactions

Introduction to Transactions
Transactions are collections of actions that potentially modify two or more entities.  An example
of a transaction is a transfer of funds between two bank accounts.  The transaction consists of
debiting the source account, crediting the target account, and recording the fact that this
occurred. An important message of this article is that transactions are not simply the domain of
databases, instead they are issues that are potentially pertinent to all of your architectural tiers.

Let’s start with a few definitions. Bernstein and Newcomer (1997) distinguish between:

 Business transactions. A business transaction is an interaction in the real world, usually


between an enterprise and a person, where something is exchanged.
 Online transaction. An online transaction is the execution of a program that performs
an administrative or real-time function, often by accessing shared data sources, usually
on behalf of an online user (although some transactions are run offline in batch). This
transaction program contains the steps involved in the business transaction. This
definition of an online transaction is important because it makes it clear that there is far
more to this topic than database transactions.

A transaction-processing (TP) system is the hardware and software that implements the
transaction programs. A TP monitor is a portion of a TP system that acts as a kind of funnel or
concentrator for transaction programs, connecting multiple clients to multiple server programs
(potentially accessing multiple data sources). In a distributed system, a TP monitor will also
optimize the use of the network and hardware resources. Examples of TP monitors include
IBM’s Customer Information Control System (CICS), IBM’s Information Management System
(IMS), BEA’s Tuxedo, and Microsoft Transaction Server (MTS).

The focus of this article is on the fundamentals of online transactions (e.g. the technical side of
things). The critical concepts are:

 ACID
 Two-phase commits
 Nested transactions

The ACID Properties


An important fundamental of transactions are the four properties that they must exhibit:

 Atomicity. The whole transaction occurs or nothing in the transaction occurs; there is
no in between. In SQL, the changes become permanent when a COMMIT statement is
issued, and they are aborted when a ROLLBACK statement is issued. For example, the
transfer of funds between two accounts is a transaction. If we transfer $20 from account
A to account B, then at the end of the transaction A’s balance will be $20 lower and B’s
balance will be $20 higher (if the transaction is completed) or neither balance will have
changed (if the transaction is aborted).
 Consistency. When the transaction starts the entities are in a consistent state, and when
the transaction ends the entities are once again in a consistent, albeit different, state.
The implication is that the referential integrity rules and applicable business rules still
apply after the transaction is completed.
 Isolation. All transactions work as if they alone were operating on the entities. For
example, assume that a bank account contains $200 and each of us is trying to
withdraw $50. Regardless of the order of the two transactions, at the end of them the
account balance will be $100, assuming that both transactions work. This is true even if
both transactions occur simultaneously. Without the isolation property two
simultaneous withdrawals of $50 could result in a balance of $150 (both transactions
saw a balance of $200 at the same time, so both wrote a new balance of $150). Isolation
is often referred to as serializability.
 Durability. The entities are stored in a persistent media, such as a relational database
or file, so that if the system crashes the transactions are still permanent.

Two-Phase Commits (2PC)


As the name suggests there are two phases to the 2PC protocol: the attempt phase where each
system tries its part of the transaction and the commit phase where the systems are told to persist
the transaction. The 2PC protocol requires the existence of a transaction manager to coordinate
the transaction. The transaction manager will assign a unique transaction ID to the transaction to
identify it. The transaction manager then sends the various transaction steps to each system of
record so they may attempt them, each system responding back to the transaction manager with
the result of the attempt. If an attempted step succeeds then at this point the system of record
must lock the appropriate entities and persist the potential changes in some manner (to ensure
durability) until the commit phase. Once the transaction manager hears back from all systems of
record that the steps succeeded, or once it hears back that a step failed, then it either sends out a
commit or abort request to every system involved.

Nested Transactions
So far I have discussed flat transactions, transactions whose steps are individual activities. A
nested transaction is a transaction where some of its steps are other transactions, referred to as
sub transactions. Nested transactions have several important features:

 When a program starts a new transaction, if it already inside of an existing transaction


then a sub transaction is started otherwise a new top level transaction is started.
 There does not need to be a limit on the depth of transaction nesting.
 When a sub transaction aborts then all of its steps are undone, including any of its sub
transactions. However, this does not cause the abort of the parent transaction; instead
the parent transaction is simply notified of the abort.
 When a sub transaction is executing the entities that it is updating are not visible to
other transactions or sub transactions (as per the isolation property).
 When a sub transaction commits then the updated entities are made visible to other
transactions and sub transactions.
Implementing Transactions
Although transactions are often thought of as a database issue the reality could be further from
the truth. From the introduction of TP monitors such as CICS and Tuxedo in the 1970s and
1980s, to the CORBA-based object request brokers (ORBs) of the early 1990s to the EJB
application servers of the early 2000s transaction have clearly been far more than a database
issue. This section explores three approaches to implementing transactions that involve both
object and relational technology. This material is aimed at application developers as well as
Agile DBAs that need to explore strategies that they may not have run across in traditional data-
oriented literature. These implementation options are:

1. Database transactions
2. Object transactions
3. Distributed object transactions
4. Including non-transactional steps

Database Transactions

The simplest way for an application to implement transactions is to use the features supplied by
the database. Transactions can be started, attempted, then committed or aborted via SQL code.
Better yet, database APIs such as Java Database Connectivity (JDBC) and Open Database
Connectivity (ODBC) provide classes that support basic transactional functionality.

Object Transactions

At the time of this writing support for transaction control is one of the most pressing issues in the
web services community and full support for nested transactions is underway within the EJB
community. As you see in Figure below, databases aren’t the only things that can be involved in
transactions. The fact is that objects, services, components, legacy applications, and non-
relational data sources can all be included in transactions.
The advantage of adding behaviors implemented by objects (and similarly services, components,
and so on) to transactions are that they become far more robust. Can you imagine using a code
editor, word processor, or drawing program without an undo function? If not, then I believe it
becomes reasonable to expect both behavior invocation as well as data transformations as steps
of a transaction. Unfortunately this strategy comes with a significant disadvantage – increased
complexity. For this to work your business objects need to be transactionally aware. Any
behavior that can be invoked as a step in a transaction requires supporting attempt, commit, and
abort/rollback operations. Adding support for object-based transactions is a non-trivial
endeavor.

Distributed Object Transactions

Just like it is possible to have distributed data transactions it is possible to have distributed object
transactions as well. To be more accurate, as you see in Figure above it’s just distributed
transactions period – it’s not just about databases any more, but it’s databases plus objects plus
services plus components plus… and so on.

Including Non-Transactional Steps

Sometimes you find that you need to include a non-transactional source within a transaction. A
perfect example is an update to information contained in an LDAP directory or the invocation of
a web service, neither of which at the time of this writing support transactions. The problem is
as soon as a step within a transaction is non-transactional the transaction really isn’t a transaction
any more. You have four basic strategies available to you for dealing with this situation:

1. Remove the non-transactional step from your transaction. In practice this is rarely an
option, but if it's a viable strategy then consider doing so.
2. Implement commit.  This strategy, which could be thought of as the “hope the parent
transaction doesn’t abort” strategy, enables you to include a non-transactional step within
your transaction. You will need to simulate the attempt, commit, and abort protocol used
by the transaction manager. The attempt and abort behaviors are simply stubs that do
nothing other than implement the requisite protocol logic. The one behavior that you do
implement, the commit, will invoke the non-transactional functionality that you want. A
different flavor of this approach, which I’ve never seen used in practice, would put the
logic in the attempt phase instead of the commit phase.

3. Implement attempt and abort.  This is an extension to the previous technique whereby
you basically implement the “do” and “undo” logic but not the commit. In this case, the
work is done in the attempt phase; the assumption is that the rest of the transaction will
work, but if it doesn’t, you still support the ability to roll back the work. This is an
“almost transaction” because it doesn’t avoid the problems with collisions described
earlier.

4. Make it transactional. With this approach, you fully implement the requisite attempt,
commit, and abort behaviors. The implication is that you will need to implement all the
logic to lock the affected resources and to recover from any collisions. An example of
this approach is supported by the J2EE Connector Architecture (JCA), in particular by the
LocalTransaction interface.

Chapter Review Questions

i. Discuss the nested transactions

ii. Discuss the transaction properties


References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER FOUR: CONCURRENCY CONTROL

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understand the database concurrency control

ii. understanding concurrency control locking strategies

Introduction to Concurrency Control


Concurrency control is a database management systems (DBMS) concept that is used to address
conflicts with the simultaneous accessing or altering of data that can occur with a multi-user
system. Concurrency control, when applied to a DBMS, is meant to coordinate simultaneous
transactions while preserving data integrity. The Concurrency is about to control the multi-user
access of Database.

To illustrate the concept of concurrency control, consider two travelers who go to electronic
kiosks at the same time to purchase a train ticket to the same destination on the same train.
There's only one seat left in the coach, but without concurrency control, it's possible that both
travelers will end up purchasing a ticket for that one seat. However, with concurrency control,
the database wouldn't allow this to happen. Both travelers would still be able to access the train
seating database, but concurrency control would preserve data accuracy and allow only one
traveler to purchase the seat.

This example also illustrates the importance of addressing this issue in a multi-user database.
Obviously, one could quickly run into problems with the inaccurate data that can result from
several transactions occurring simultaneously and writing over each other. The following section
provides strategies for implementing concurrency control.
Concurrency Control Locking Strategies

Pessimistic Locking: This concurrency control strategy involves keeping an entity in a database
locked the entire time it exists in the database's memory. This limits or prevents users from
altering the data entity that is locked. There are two types of locks that fall under the category of
pessimistic locking: write lock and read lock.

With write lock, everyone but the holder of the lock is prevented from reading, updating, or
deleting the entity. With read lock, other users can read the entity, but no one except for the lock
holder can update or delete it.

Optimistic Locking: This strategy can be used when instances of simultaneous transactions, or
collisions, are expected to be infrequent. In contrast with pessimistic locking, optimistic locking
doesn't try to prevent the collisions from occurring. Instead, it aims to detect these collisions and
resolve them on the chance occasions when they occur.

Pessimistic locking provides a guarantee that database changes are made safely. However, it
becomes less viable as the number of simultaneous users or the number of entities involved in a
transaction increase because the potential for having to wait for a lock to release will increase.

Optimistic locking can alleviate the problem of waiting for locks to release, but then users have
the potential to experience collisions when attempting to update the database.

Lock Problems

Deadlock:

When dealing with locks two problems can arise, the first of which being deadlock. Deadlock
refers to a particular situation where two or more processes are each waiting for another to
release a resource, or more than two processes are waiting for resources in a circular chain.
Deadlock is a common problem in multiprocessing where many processes share a specific type
of mutually exclusive resource. Some computers, usually those intended for the time-sharing
and/or real-time markets, are often equipped with a hardware lock, or hard lock, which
guarantees exclusive access to processes, forcing serialization. Deadlocks are particularly
disconcerting because there is no general solution to avoid them.

A fitting analogy of the deadlock problem could be a situation like when you go to unlock your
car door and your passenger pulls the handle at the exact same time, leaving the door still locked.
If you have ever been in a situation where the passenger is impatient and keeps trying to open the
door, it can be very frustrating. Basically you can get stuck in an endless cycle, and since both
actions cannot be satisfied, deadlock occurs.

Livelock:

Livelock is a special case of resource starvation. A livelock is similar to a deadlock, except that
the states of the processes involved constantly change with regard to one another wile never
progressing. The general definition only states that a specific process is not progressing. For
example, the system keeps selecting the same transaction for rollback causing the transaction to
never finish executing. Another livelock situation can come about when the system is deciding
which transaction gets a lock and which waits in a conflict situation.

An illustration of livelock occurs when numerous people arrive at a four way stop, and are not
quite sure who should proceed next. If no one makes a solid decision to go, and all the cars just
keep creeping into the intersection afraid that someone else will possibly hit them, then a kind of
livelock can happen.

Basic Timestamping:

Basic timestamping is a concurrency control mechanism that eliminates deadlock. This method
doesn’t use locks to control concurrency, so it is impossible for deadlock to occur. According to
this method a unique timestamp is assigned to each transaction, usually showing when it was
started. This effectively allows an age to be assigned to transactions and an order to be assigned.
Data items have both a read-timestamp and a write-timestamp. These timestamps are updated
each time the data item is read or updated respectively.
Problems arise in this system when a transaction tries to read a data item which has been written
by a younger transaction. This is called a late read. This means that the data item has changed
since the initial transaction start time and the solution is to roll back the timestamp and acquire a
new one. Another problem occurs when a transaction tries to write a data item which has been
read by a younger transaction. This is called a late write. This means that the data item has been
read by another transaction since the start time of the transaction that is altering it. The solution
for this problem is the same as for the late read problem. The timestamp must be rolled back and
a new one acquired.

Adhering to the rules of the basic timestamping process allows the transactions to be serialized
and a chronological schedule of transactions can then be created. Timestamping may not be
practical in the case of larger databases with high levels of transactions. A large amount of
storage space would have to be dedicated to storing the timestamps in these cases.

Collision Resolution Strategies

You have five basic strategies that you can apply to resolve collisions:

1. Give up.
2. Display the problem and let the user decide.

3. Merge the changes.

4. Log the problem so someone can decide later.

5. Ignore the collision and overwrite.

It is important to recognize that the granularity of a collision counts.  Assume that both of us are
working with a copy of the same Customer entity. If you update a customer’s name and I update
their shopping preferences, then we can still recover from this collision. In effect the collision
occurred at the entity level, we updated the same customer, but not at the attribute level. It is
very common to detect potential collisions at the entity level then get smart about resolving them
at the attribute level.
 Collision Resolution Strategies

You have five basic strategies that you can apply to resolve collisions:

1. Give up.
2. Display the problem and let the user decide.

3. Merge the changes.

4. Log the problem so someone can decide later.

5. Ignore the collision and overwrite.

It is important to recognize that the granularity of a collision counts.  Assume that both of us are
working with a copy of the same Customer entity. If you update a customer’s name and I update
their shopping preferences, then we can still recover from this collision. In effect the collision
occurred at the entity level, we updated the same customer, but not at the attribute level. It is
very common to detect potential collisions at the entity level then get smart about resolving them
at the attribute level.

 Optimistic concurrency control

King and Robinson (1981) proposed an alternative technique for achieving concurrency control,
called optimistic concurrency control. This is based on the observation that, in most applications,
the chance of two transactions accessing the same object is low. We will allow transactions to
proceed as if there were no possibility of conflict with other transactions: a transaction does not
have to obtain or check for locks.

This is the working phase. Each transaction has a tentative version (private workspace) of the
objects it updates - copy of the most recently committed version. Write operations record new
values as tentative values. Before a transaction can commit, a validation is performed on all the
data items to see whether the data conflicts with operations of other transactions. This is the
validation phase.
If the validation fails, then the transaction will have to be aborted and restarted later. If the
transaction succeeds, then the changes in the tentative version are made permanent. This is the
update phase. Optimistic control is deadlock free and allows for maximum parallelism (at the
expense of possibly restarting transactions)

Timestamp ordering

Another approach to concurrency control was presented by Reed in 1983. This is called
timestamp ordering. Each transaction is assigned a unique timestamp when it begins (can be
from a physical or logical clock). Each object in the system has a read and write timestamp
associated with it (two timestamps per object). The read timestamp is the timestamp of the last
committed transaction that read the object. The write timestamp is the timestamp of the last
committed transaction that modified the object (note - the timestamps are obtained from the
transaction timestamp - the start of that transaction)

The rule of timestamp ordering is:

 If a transaction wants to write an object, it compares its own timestamp with the object’s
read and write timestamps. If the object’s timestamps are older, then the ordering is good.
 If a transaction wants to read an object, it compares its own timestamp with the object’s
write timestamp. If the object’s write timestamp is older than the current transaction, then
the ordering is good.

If a transaction attempts to access an object and does not detect proper ordering, the transaction
is aborted and restarted (improper ordering means that a newer transaction came in and modified
data before the older one could access the data or read data that the older one wants to modify).

Why is concurrency control needed?

If transactions are executed serially, i.e., sequentially with no overlap in time, no transaction
concurrency exists. However, if concurrent transactions with interleaving operations are allowed
in an uncontrolled manner, some unexpected, undesirable result may occur. Here are some
typical examples:
 The lost update problem: A second transaction writes a second value of a data-item
(datum) on top of a first value written by a first concurrent transaction, and the first
value is lost to other transactions running concurrently which need, by their
precedence, to read the first value. The transactions that have read the wrong value end
with incorrect results.
 The dirty read problem: Transactions read a value written by a transaction that has
been later aborted. This value disappears from the database upon abort, and should not
have been read by any transaction ("dirty read"). The reading transactions end with
incorrect results.

 The incorrect summary problem: While one transaction takes a summary over the
values of all the instances of a repeated data-item, a second transaction updates some
instances of that data-item. The resulting summary does not reflect a correct result for
any (usually needed for correctness) precedence order between the two transactions (if
one is executed before the other), but rather some random result, depending on the
timing of the updates, and whether certain update results have been included in the
summary or not.

Most high-performance transactional systems need to run transactions concurrently to meet their
performance requirements. Thus, without concurrency control such systems can neither provide
correct results nor maintain their databases consistent.

Chapter Review Questions


i. Discuss the various Concurrency control locking mechanism

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER FIVE: OVERVIEW OF QUERY PROCESSING

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. understanding what is query processing

ii. understanding the steps in query processing

Query Processing Overview


Query processing: A 3-step process that transforms a high-level query (of relational
calculus/SQL) into an equivalent and more efficient lower-level query (of relational algebra).
1. Parsing and translation – Check syntax and verify relations. Translate the query into an
equivalent relational algebra expression.
2. Optimization – Generate an optimal evaluation plan (with lowest cost) for the query
plan.
3. Evaluation – The query-execution engine takes an (optimal) evaluation plan, executes
that plan, and returns the answers to the query.
The success of RDBMSs is due, in part, to the availability of declarative query languages that
allow to easily express complex queries without knowing about the details of the physical data
organization and; of advanced query processing technology that transforms the high-level
user/application queries into efficient lower-level query execution strategies.
The query transformation should achieve both correctness and efficiency. The main difficulty is
to achieve the efficiency; This is also one of the most important tasks of any DBMS
Distributed query processing: Transform a high-level query (of relational calculus/SQL) on a
distributed database (i.e., a set of global relations) into an equivalent and efficient lower-level
query (of relational algebra) on relation fragments.
Distributed query processing is more complex: Fragmentation/replication of relations; Additional
communication costs; Parallel execution.
Query Processing Example
Example: Transformation of an SQL-query into an RA-query.
Relations: EMP(ENO, ENAME, TITLE), ASG(ENO,PNO,RESP,DUR)
Query: Find the names of employees who are managing a project?
– High level query
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO AND DUR > 37
– Two possible transformations of the query are:

– Expression 2 avoids the expensive and large intermediate Cartesian product, and therefore
typically is better.
We make the following assumptions about the data fragmentation
– Data is (horizontally) fragmented:

– Relations ASG and EMP are fragmented in the same way


– Relations ASG and EMP are locally clustered on attributes RESP and ENO, Respectively
Now consider the expression
Calculate the cost of the two strategies under the following assumptions:
– Tuples are uniformly distributed to the fragments; 20 tuples satisfy DUR>37
– size(EMP) = 400, size(ASG) = 1000
– tuple access cost = 1 unit; tuple transfer cost = 10 units
– ASG and EMP have a local index on DUR and ENO
Strategy 1

Strategy 2
Query Optimization
Query optimization is a crucial and difficult part of the overall query processing. Objective of
query optimization is to minimize the following cost function:
I/O cost + CPU cost + communication cost
Two different scenarios are considered:
– Wide area networks - Communication cost dominates
· low bandwidth
· low speed
· high protocol overhead
Most algorithms ignore all other cost components
– Local area networks - Communication cost not that dominant; Total cost function should be
considered
Ordering of the operators of relational algebra is crucial for efficient query processing. Rule of
thumb: move expensive operators at the end of query processing. Cost of RA operations:
Query Optimization Issues
Several issues have to be considered in query optimization. Types of query optimizers
– wrt the search techniques (exhaustive search, heuristics)
– wrt the time when the query is optimized (static, dynamic)
• Statistics
• Decision sites
• Network topology
• Use of semijoins
Types of Query Optimizers wrt Search Techniques
– Exhaustive search: Cost-based; Optimal; Combinatorial complexity in the number of relations
– Heuristics: Not optimal; Regroups common sub-expressions; Performs selection, projection
first; Replaces a join by a series of semijoins; Reorders operations to reduce intermediate relation
size; Optimizes individual operations.
Types of Query Optimizers wrt Optimization Timing
– Static: Query is optimized prior to the execution; As a consequence it is difficult to estimate
the size of the intermediate results; Typically amortizes over many executions
– Dynamic: Optimization is done at run time; Provides exact information on the intermediate
relation sizes; Have to re-optimize for multiple executions
– Hybrid - First, the query is compiled using a static algorithm; Then, if the error in estimate
sizes greater than threshold, the query is re-optimized at run time
Statistics
– Relation/fragments: Cardinality; Size of a tuple; Fraction of tuples participating in a join with
another relation/fragment
– Attribute: Cardinality of domain; Actual number of distinct values; Distribution of attribute
values (e.g., histograms)
– Common assumptions: Independence between different attribute values; Uniform distribution
of attribute values within their domain
Decision sites
– Centralized: Single site determines the ”best” schedule; Simple; Knowledge about the entire
distributed database is needed
– Distributed: Cooperation among sites to determine the schedule; Only local information is
needed; Cooperation comes with an overhead cost
– Hybrid: One site determines the global schedule; Each site optimizes the local sub-queries
Network topology
– Wide area networks (WAN) point-to-point
 Characteristics: Low bandwidth; Low speed; High protocol overhead
 Communication cost dominate; all other cost factors are ignored
 Global schedule to minimize communication cost
 Local schedules according to centralized query optimization
– Local area networks (LAN)
 Communication cost not that dominant
 Total cost function should be considered
 Broadcasting can be exploited (joins)
 Special algorithms exist for star networks

Use of Semijoins
 Reduce the size of the join operands by first computing semijoins
 Particularly relevant when the main cost is the communication cost
 Improves the processing of distributed join operations by reducing the size of data
exchange between sites
 However, the number of messages as well as local processing time is increased
Distributed Query Processing Steps

Chapter Review Questions

i. Discuss the Distributed Query Processing Steps

ii. Discuss Query optimization

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER SIX: QUERY DECOMPOSITION AND DATA
LOCALIZATION

Learning Objectives:

By the end of this chapter the learner shall be able to;


i. Understand the process of query decomposition

ii. understanding data independence

iii. understanding the classification of distributed systems

Query Decomposition
Query decomposition: Mapping of calculus query (SQL) to algebra operations (select, project,
join, rename). Both input and output queries refer to global relations, without knowledge of the
distribution of data. The output query is semantically correct and good in the sense that
redundant work is avoided. Query decomposistion consists of 4 steps:
1. Normalization: Transform query to a normalized form
2. Analysis: Detect and reject ”incorrect” queries; possible only for a subset of relational
calculus
3. Elimination of redundancy: Eliminate redundant predicates
4. Rewriting: Transform query to RA and optimize query
Normalization: Transform the query to a normalized form to facilitate further processing.
Consists mainly of two steps.
1. Lexical and syntactic analysis
– Check validity (similar to compilers)
– Check for attributes and relations
– Type checking on the qualification
2. Put into normal form
– With SQL, the query qualification (WHERE clause) is the most difficult part as it might

be an arbitrary complex predicate preceeded by quantifiers


– Conjunctive normal form
– Disjunctive normal form

– In the disjunctive normal form, the query can be processed as independent conjunctive
subqueries linked by unions (corresponding to the disjunction)

Analysis: Identify and reject type incorrect or semantically incorrect queries


• Type incorrect – Checks whether the attributes and relation names of a query are defined in the
global
Schema. Checks whether the operations on attributes do not conflict with the types of the
attributes, e.g., a comparison > operation with an attribute of type string
• Semantically incorrect – Checks whether the components contribute in any way to the
generation of the result. Only a subset of relational calculus queries can be tested for correctness,
i.e., those that do not contain disjunction and negation. Typical data structures used to detect the
semantically incorrect queries are:
_ Connection graph (query graph)
_ Join graph
Elimination of redundancy: Simplify the query by eliminate redundancies, e.g.,redundant
predicates. Redundancies are often due to semantic integrity constraints expressed in the query
language e.g., queries on views are expanded into queries on relations that satiesfy certain
integrity and security constraints. Transformation rules are used, e.g.,
Rewriting: Convert relational calculus query to relational algebra query and find an efficient
expression. Example: Find the names of employees other than J. Doe who worked on the
CAD/CAM project for either 1 or 2 years.
• SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME 6= "J. Doe"
AND PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24)
A query tree represents the RA-expression
– Relations are leaves (FROM clause)
– Result attributes are root (SELECT clause)
– Intermediate leaves should give a result from the leaves to the root
Data Localization
Input: Algebraic query on global conceptual schema
Purpose: Apply data distribution information to the algebra operations and determine which
fragments are involved. Substitute global query with queries on fragments. Optimize the global
query.

In general, the generic query is inefficient since important restructurings and simplifications
can be done.

Data Localizations Issues


Various more advanced reduction techniques are possible to generate simpler and optimized
queries.
• Reduction of horizontal fragmentation (HF) – Reduction with selection. Reduction with join
• Reduction of vertical fragmentation (VF) – Find empty relations

Chapter Review Questions

i. Discuss the process of query decomposition

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER SEVEN: DISTRIBUTED DATABASES

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understand the basics of distributed databases

ii. understanding the database design problems

iii. understanding fragmentation

Introduction to Distributed Databases


In the old days, programs stored data in regular files. Each program has to maintain its own data
 huge overhead
 error-prone
Data Independence
The development of DBMS helped to fully achieve data independence (transparency). Provide
centralized and controlled data maintenance and access. Application is immune to physical and
logical file organization

Distributed database system is the union of what appear to be two diametrically opposed
approaches to data processing: database systems and computer network, Computer networks
promote a mode of work that goes against centralization. Key issues to understand this
combination. The most important objective of DB technology is integration not centralization.
Integration is possible without centralization, i.e., integration of databases and networking does
not mean centralization (in fact quite opposite). Goal of distributed database systems: achieve
data integration and data distribution transparency
A distributed computing system is a collection of autonomous processing elements that are
interconnected by a computer network. The elements cooperate in order to perform the assigned
task. The term “distributed” is very broadly used. The exact meaning of the word depends on the
context.
What can be distributed?
 Processing logic
 Functions
 Data
 Control
Classification of distributed systems with respect to various criteria
 Degree of coupling, i.e., how closely the processing elements are connected e.g.,
measured as ratio of amount of data exchanged to amount of local processing;
weak coupling, strong coupling
 Interconnection structure: point-to-point connection between processing elements;
common interconnection channel
 Synchronization: synchronous; asynchronous

A distributed database (DDB) is a collection of multiple, logically interrelated databases


distributed over a computer network. A distributed database management system (DDBMS)
is the software that manages the DDB and provides an access mechanism that makes this
distribution transparent to the users. The terms DDBMS and DDBS are often used
interchangeably
Implicit assumptions: -
– Data stored at a number of sites each site logically consists of a single processor
– Processors at different sites are interconnected by a computer network (we do not consider
multiprocessors in DDBMS, cf. parallel systems)
– DDBS is a database, not a collection of files (cf. relational data model). Placement and query
of data is impacted by the access patterns of the user
– DDBMS is a collections of DBMSs (not a remote file system)

The following systems are parallel database systems and are quite different from (though related
to) distributed DB systems
Applications of Distributed Databases
• Manufacturing, especially multi-plant manufacturing
• Military command and control
• Airlines
• Hotel chains
• Any organization which has a decentralized organization structure

Promises of DDBSs
Distributed Database Systems deliver the following advantages:
• Higher reliability
• Improved performance
• Easier system expansion
• Transparency of distributed and replicated data
Higher reliability
• Replication of components
• No single points of failure
• e.g., a broken communication link or processing element does not bring down the entire system
• Distributed transaction processing guarantees the consistency of the database and concurrency
Improved performance
• Proximity of data to its points of use
– Reduces remote access delays
– Requires some support for fragmentation and replication
• Parallelism in execution
– Inter-query parallelism
– Intra-query parallelism
• Update and read-only queries influence the design of DDBSs substantially
– If mostly read-only access is required, as much as possible of the data should be
replicated
– Writing becomes more complicated with replicated data
Easier system expansion
• Issue is database scaling
• Emergence of microprocessor and workstation technologies
– Network of workstations much cheaper than a single mainframe computer
• Data communication cost versus telecommunication cost
• Increasing database size

Transparency
• Refers to the separation of the higher-level semantics of the system from the lower-level
implementation issues
• A transparent system “hides” the implementation details from the users.
• A fully transparent DBMS provides high-level support for the development of complex
applications.
Various forms of transparency can be distingushed for DDBMSs:
• Network transparency (also called distribution transparency)
– Location transparency
– Naming transparency
• Replication transparency
• Fragmentation transparency
• Transaction transparency
– Concurrency transparency
– Failure transparency
• Performance transparency
Network/Distribution transparency allows a user to perceive a DDBS as a single, logical
entity. The user is protected from the operational details of the network (or even does not know
about the existence of the network). The user does not need to know the location of data items
and a command used to perform a task is independent from the location of the data and the site
the task is performed (location transparency). A unique name is provided for each object in the
database (naming transparency); In absence of this, users are required to embed the location
name as part of an identifier
Different ways to ensure naming transparency:
• Solution 1: Create a central name server; however, this results in
– loss of some local autonomy
– central site may become a bottleneck
– low availability (if the central site fails remaining sites cannot create new objects)
• Solution 2: Prefix object with identifier of site that created it
– e.g., branch created at site S1 might be named S1.BRANCH
– Also need to identify each fragment and its copies
– e.g., copy 2 of fragment 3 of Branch created at site S1 might be referred to as
S1.BRANCH.F3.C2
• An approach that resolves these problems uses aliases for each database object
– Thus, S1.BRANCH.F3.C2 might be known as local branch by user at site S1
– DDBMS has task of mapping an alias to appropriate database object
Replication transparency ensures that the user is not involved in the managment of copies of
some data. The user should even not be aware about the existence of replicas, rather should work
as if there exists a single copy of the data. Replication of data is needed for various reasons e.g.,
increased efficiency for read-only data access
Fragmentation transparency ensures that the user is not aware of and is not involved in the
fragmentation of the data. The user is not involved in finding query processing strategies over
fragments or formulating queries over fragments. The evaluation of a query that is specified over
an entire relation but now has to be performed on top of the fragments requires an appropriate
query evaluation strategy. Fragmentation is commonly done for reasons of performance,
availability, and reliability. Two fragmentation alternatives; Horizontal fragmentation: divide a
relation into a subsets of tuples; Vertical fragmentation: divide a relation by columns
Transaction transparency ensures that all distributed transactions maintain integrity and
consistency of the DDB and support concurrency. Each distributed transaction is divided into a
number of sub-transactions (a sub-transaction for each site that has relevant data) that
concurrently access data at different locations. DDBMS must ensure the indivisibility of both the
global transaction and each of the sub-transactions. Can be further divided into: -
– Concurrency transparency
– Failure transparency
Concurrency transparency guarantees that transactions must execute independently and are
logically consistent, i.e., executing a set of transactions in parallel gives the same result as if the
transactions were executed in some arbitrary serial order. Same fundamental principles as for
centralized DBMS, but more complicated to realize:
– DDBMS must ensure that global and local transactions do not interfere with each other
– DDBMS must ensure consistency of all sub-transactions of global transaction
Replication makes concurrency even more complicated. If a copy of a replicated data item is
updated, update must be propagated to all copies
– Option 1: Propagate changes as part of original transaction, making it an atomic
operation; however, if one site holding a copy is not reachable, then the transaction is
delayed until the site is reachable.
– Option 2: Limit update propagation to only those sites currently available; remaining
sites are updated when they become available again.
– Option 3: Allow updates to copies to happen asynchronously, sometime after the
original update; delay in regaining consistency may range from a few seconds to several
hours
Failure transparency: DDBMS must ensure atomicity and durability of the global transaction,
i.e., the sub-transactions of the global transaction either all commit or all abort. Thus, DDBMS
must synchronize global transaction to ensure that all sub-transactions have completed
successfully before recording a final COMMIT for the global transaction. The solution should be
robust in presence of site and network failures
Performance transparency: DDBMS must perform as if it were a centralized DBMS. DDBMS
should not suffer any performance degradation due to the distributed
Architecture. DDBMS should determine most cost-effective strategy to execute a request.
Distributed Query Processor (DQP) maps data request into an ordered sequence of operations on
local databases. DQP must consider fragmentation, replication, and allocation schemas. DQP has
to decide:
– which fragment to access
– which copy of a fragment to use
– which location to use
DQP produces execution strategy optimized with respect to some cost function. Typically, costs
associated with a distributed request include: I/O cost, CPU cost, and communication cost

Distributed database Complicating Factors


• Complexity
• Cost
• Security
• Integrity control more difficult
• Lack of standards
• Lack of experience
• Database design more complex

Chapter Review Questions


i. What is a Distributed Database
ii. Discuss transparency in distributed database

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER EIGHT: DISTRIBUTED DATABASE DESIGN

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understand the distributed database design

ii. understanding the database standardization process

Design Problem
Design problem of distributed systems: Making decisions about the placement of data and
programs across the sites of a computer network as well as possibly designing the network
itself. In DDBMS, the distribution of applications involves
– Distribution of the DDBMS software
– Distribution of applications that run on the database
Distribution of applications will not be considered in the following; instead the distribution of
data is studied.\

Framework of Distribution
Dimension for the analysis of distributed systems:
– Level of sharing: no sharing, data sharing, data and program sharing
– Behavior of access patterns: static, dynamic
– Level of knowledge on access pattern behavior: no information, partial information,complete
information
Design Strategies
• Top-down approach – Designing systems from scratch. Homogeneous systems
• Bottom-up approach – The databases already exist at a number of sites. The databases should
be connected to solve common tasks
Top-down design strategy
Distribution design is the central part of the design in DDBMSs (the other tasks are similar to
traditional databases). Objective: Design the LCSs by distributing the entities (relations) over
the sites. Two main aspects have to be designed carefully:
∗ Fragmentation - Relation may be divided into a number of sub-relations, which are
distributed
∗ Allocation and replication- Each fragment is stored at site with ”optimal” distribution. Copy
of fragment may be maintained at several sites
Distribution design issues:
– Why fragment at all?
– How to fragment?
– How much to fragment?
– How to test correctness?
– How to allocate?
Bottom-up design strategy
Fragmentation
What is a reasonable unit of distribution? Relation or fragment of relation?
• Relations as unit of distribution:
If the relation is not replicated, we get a high volume of remote data accesses. If the relation is
replicated, we get unnecessary replications, which cause problems in executing updates and
waste disk space Might be an Ok solution, if queries need all the data in the relation and data
stays at the only sites that uses the data
• Fragments of relationas as unit of distribution:
Application views are usually subsets of relations. Thus, locality of accesses of applications is
defined on subsets of relations. Permits a number of transactions to execute concurrently, since
they will access different portions of a relation. Parallel execution of a single query (intra-query
concurrency). However, semantic data control (especially integrity enforcement) is more
difficult. Fragments of relations are (usually) the appropriate unit of distribution. Fragmentation
aims to improve:
– Reliability
– Performance
– Balanced storage capacity and costs
– Communication costs
– Security
The following information is used to decide fragmentation:
– Quantitative information: frequency of queries, site, where query is run, selectivity of the
queries, etc.
– Qualitative information: types of access of data, read/write, etc.
Types of Fragmentation
– Horizontal: partitions a relation along its tuples

– Vertical: partitions a relation along its attributes

– Mixed/hybrid: a combination of horizontal and vertical fragmentation


Correctness Rules of Fragmentation
• Completeness – Decomposition of relation R into fragments R1,R2, . . . ,Rn is complete if each
data item in R can also be found in some Ri.
• Reconstruction – If relation R is decomposed into fragments R1,R2, . . . ,Rn, then there should
exist some relational operator ∇ that reconstructs R from its fragments, i.e., R = R1∇. . .∇Rn
∗ Union to combine horizontal fragments
∗ Join to combine vertical fragments
• Disjointness - If relation R is decomposed into fragments R1,R2, . . . ,Rn and data item
disappears in fragment Rj , then di should not appear in any other fragment Rk, k 6= j
(exception: primary key attribute for vertical fragmentation)
∗ For horizontal fragmentation, data item is a tuple
∗ For vertical fragmentation, data item is an attribute
Horizontal Fragmentation
Intuition behind horizontal fragmentation. Every site should hold all information that is used to
query at the site. The information at the site should be fragmented so the queries of the site run
faster. Horizontal fragmentation is defined as selection operation, _p(R)
Example:

Computing horizontal fragmentation (idea). Compute the frequency of the individual queries of
the site q1, . . . , qQ. Rewrite the queries of the site in the conjunctive normal form (disjunction
of conjunctions); the conjunctions are called minterms. Compute the selectivity of the
minterms. Find the minimal and complete set of minterms (predicates)
∗ The set of predicates is complete if and only if any two tuples in the same fragment are
referenced with the same probability by any application
∗ The set of predicates is minimal if and only if there is at least one query that accesses the
fragment
Vertical Fragmentation
Objective of vertical fragmentation is to partition a relation into a set of smaller relations so that
many of the applications will run on only one fragment. Vertical fragmentation of a relation R
produces fragments R1,R2, . . . , each of which contains a subset of R’s attributes. Vertical
fragmentation is defined using the projection operation of the relational algebra:

Example:

Vertical fragmentation has also been studied for (centralized) DBMS


– Smaller relations, and hence less page accesses
– e.g., MONET system
Two types of heuristics for vertical fragmentation exist:
Grouping: assign each attribute to one fragment, and at each step, join some of the fragments
until some criteria is satisfied.
∗ Bottom-up approach
Splitting: starts with a relation and decides on beneficial partitionings based on the access
behaviour of applications to the attributes.
∗ Top-down approach
∗ Results in non-overlapping fragments
∗ “Optimal” solution is probably closer to the full relation than to a set of small relations with
only one attribute
∗ Only vertical fragmentation is considered here
Application information: The major information required as input for vertical fragmentation is
related to applications. Since vertical fragmentation places in one fragment those attributes
usually accessed together, there is a need for some measure that would define more precisely the
notion of “togetherness”, i.e., how closely related the attributes are. This information is obtained
from queries and collected in the Attribute Usage Matrix and Attribute Affinity Matrix.
Correctness of Vertical Fragmentation
• Relation R is decomposed into fragments R1,R2, . . . ,Rn
– e.g., PROJ = {PNO,BUDGET, PNAME,LOC} into
PROJ1 = {PNO,BUDGET} and PROJ2 = {PNO, PNAME,LOC}
• Completeness – Guaranteed by the partitioning algortihm, which assigns each attribute in A to
one partition
• Reconstruction – Join to reconstruct vertical fragments

• Disjointness – Attributes have to be disjoint in VF. Two cases are distinguished:


∗ If tuple IDs are used, the fragments are really disjoint
∗ Otherwise, key attributes are replicated automatically by the system
∗ e.g., PNO in the above example
Mixed Fragmentation
In most cases simple horizontal or vertical fragmentation of a DB schema will not be sufficient
to satisfy the requirements of the applications. Mixed fragmentation (hybrid fragmentation):
Consists of a horizontal fragment followed by a vertical fragmentation, or a vertical
fragmentation followed by a horizontal fragmentation. Fragmentation is defined using the
selection and projection operations of relational algebra:

Replication and Allocation


Replication: Which fragements shall be stored as multiple copies?
Complete Replication - Complete copy of the database is maintained in each site
Selective Replication - Selected fragments are replicated in some sites
Allocation: On which sites to store the various fragments?
Centralized - Consists of a single DB and DBMS stored at one site with users distributed across
the network
Partitioned - Database is partitioned into disjoint fragments, each fragment assigned to one site
Replicated DB:
– fully replicated: each fragment at each site
– partially replicated: each fragment at some of the sites
Non-replicated DB (= partitioned DB)
– partitioned: each fragment resides at only one site
Comparison of replication alternatives

Fragment Allocation
Optimality of fragment allocation depend on the following:
Minimal cost - Communication + storage + processing (read and update). Cost in terms of time
(usually)
Performance - Response time and/or throughput.
Constraints - Per site constraints (storage and processing)

Chapter Review Questions


i. Discuss the Distributed database design
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER NINE: DDBMS ARCHITECTURE

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understand the DDBMS architecture

ii. understanding the database standardization process

Introduction to DDBMS Architecture


The architecture of a system defines its structure: the components of the system are identified;
the function of each component is specified; the interrelationships and interactions among the
components are defined. Applies both for computer systems as well as for software systems, e.g,
division into modules, description of modules, etc. architecture of a computer. There is a close
relationship between the architecture of a system, standardization efforts, and a reference model.
DDBMS might be implemented as homogeneous or heterogeneous DDBMS
• Homogeneous DDBMS: All sites use same DBMS product; It is much easier to design and
manage; The approach provides incremental growth and allows increased performance
• Heterogeneous DDBMS: Sites may run different DBMS products, with possibly different
underlying data models; This occurs when sites have implemented their own databases first, and
integration is considered later; Translations are required to allow for different hardware and/or
different DBMS products; Typical solution is to use gateway. A common standard to implement
DDBMS is needed!

Standardization
The standardization efforts in databases developed reference models of DBMS.
Reference Model: A conceptual framework whose purpose is to divide standardization work into
manageable pieces and to show at a general level how these pieces are related to each other. A
reference model can be thought of as an idealized architectural model of the system.
Commercial systems might deviate from reference model, still they are useful for the
standardization process A reference model can be described according to 3 different approaches:
 component-based
 function-based
 data-based

Components-based
Components of the system are defined together with the interrelationships between the
components. Good for design and implementation of the system. It might be difficult to
determine the functionality of the system from its components
Function-based
Classes of users are identified together with the functionality that the system will provide for
each class. Typically a hierarchical system with clearly defined interfaces between different
layers. The objectives of the system are clearly identified. Not clear how to achieve the
objectives. Example: ISO/OSI architecture of computer networks
Data-based
Identify the different types of the data and specify the functional units that will realize and/or use
data according to these views. Gives central importance to data (which is also the central
resource of any DBMS). Claimed to be the preferable choice for standardization of DBMS. The
full architecture of the system is not clear without the description of functional modules.
Example: ANSI/SPARC architecture of DBMS

ANSI/SPARC Architecture of DBMS


ANSI/SPARC architecture is based on data. 3 views of data: external view, conceptual view,
internal view. Defines a total of 43 interfaces between these views
Conceptual schema: Provides enterprise view of entire database
Internal schema: Describes the storage details of the relations. Relation EMP is stored on an
indexed file; Index is defined on the key attribute ENO and is called EMINX; A HEADER field
is used that might contain flags (delete, update, etc.).

External view: Specifies the view of different users/applications


Application 1: Calculates the payroll payments for engineers

Application 2: Produces a report on the budget of each project

Architectural Models for DDBMSs


Architectural Models for DDBMSs (or more generally for multiple DBMSs) can be classified
along three dimensions:
– Autonomy
– Distribution
– Heterogeneity
Autonomy: Refers to the distribution of control (not of data) and indicates the degree to which
individual DBMSs can operate independently.
– Tight integration: a single-image of the entire database is available to any user who wants to
share the information (which may reside in multiple DBs); realized such that one data manager is
in control of the processing of each user request.
– Semiautonomous systems: individual DBMSs can operate independently, but have decided to
participate in a federation to make some of their local data sharable.
– Total isolation: the individual systems are stand-alone DBMSs, which know neither of the
existence of other DBMSs nor how to comunicate with them; there is no global control.
Autonomy has different dimensions: -
– Design autonomy: each individual DBMS is free to use the data models and transaction
management techniques that it prefers.
– Communication autonomy: each individual DBMS is free to decide what information to
provide to the other DBMSs
– Execution autonomy: each individual DBMS can execture the transactions that are submitted to
it in any way that it wants to.
Distribution: Refers to the physical distribution of data over multiple sites.
– No distribution: No distribution of data at all
– Client/Server distribution: Data are concentrated on the server, while clients provide
application environment/user interface.
– Peer-to-peer distribution (also called full distribution): No distinction between client and
server machine. Each machine has full DBMS functionality
Heterogeneity: Refers to heterogeneity of the components at various levels
– hardware
– communications
– operating system
– DB components (e.g., data model, query language, transaction management algorithms)
Client-Server Architecture for DDBMS (Data-based)
General idea: Divide the functionality into two
classes:
– server functions - mainly data management, including
query processing, optimization, transaction management, etc.
– client functions - might also include some data management functions (consistency checking,
transaction management, etc.) not just user interface
Provides a two-level architecture. More efficient division of work. Different types of
client/server architecture
– Multiple client/single server
– Multiple client/multiple server
Peer-to-Peer Architecture for DDBMS (Data-based)
 Local internal schema (LIS) – Describes the local physical data organization (which
might be different on each machine)
 Local conceptual schema (LCS) – Describes logical data organization at each site.
Required since the data are fragmented and replicated.
 Global conceptual schema (GCS) – Describes the global logical view of the data. Union
of the LCSs.
 External schema (ES) – Describes the user/application view on the data
Multi-DBMS Architecture (Data-based)
Fundamental difference to peer-to-peer DBMS is in the definition of the global conceptual
schema (GCS). In a MDBMS the GCS represents only the collection of some of the local
databases that each local DBMS want to share. This leads to the question, whether the GCS
should even exist in a MDBMS? Two different architecutre models:
– Models with a GCS
– Models without GCS
Model with a GCS
GCS is the union of parts of the LCSs. Local DBMS define their own views on the local DB.
Model without a GCS
The local DBMSs present to the multi-database layer the part of their local DB they are willing
to share. External views are defined on top of LCSs.
Chapter Review Questions
i. Discuss the various Architectural Models for DDBMSs

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER TEN: SEMANTIC DATA CONTROL

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understand the database semantic data control

ii. understanding view control

Semantic Data Control


Semantic data control typically includes view management, security control, and semantic
integrity control. Informally, these functions must ensure that authorized users perform correct
operations on the database, contributing to the maintenance of database integrity. In RDBMS
semantic data control can be achieved in a uniform way; views, security constraints, and
semantic integrity constraints can be defined as rules that the system automatically enforces

View Management
Views enable full logical data independence. Views are virtual relations that are defined as the
result of a query on base relations. Views are typically not materialized. Can be considered a
dynamic window that reflects all relevant updates to the database. Views are very useful for
ensuring data security in a simple way. By selecting a subset of the database, views hide some
data. Users cannot see the hidden data
View Management in Centralized Databases
A view is a relation that is derived from a base relation via a query. It can involve selection,
projection, aggregate functions, etc. Example: The view of system analysts derived from
relation EMP
CREATE VIEW SYSAN(ENO,ENAME) AS
SELECT ENO,ENAME
FROM EMP
WHERE TITLE="Syst. Anal."
Queries expressed on views are translated into queries expressed on base relations. Example:
“Find the names of all the system analysts with their project number and
responsibility?” – Involves the view SYSAN and the relation ASG(ENO,PNO,RESP,DUR)
SELECT ENAME, PNO, RESP
FROM SYSAN, ASG
WHERE SYSN.ENO = ASG.ENO
is translated into
SELECT ENAME,PNO,RESP
FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO
AND TITLE = "Syst. Anal."
Automatic query modification is required, i.e., ANDing query qualification with view
qualification
All views can be queried as base relations, but not all view can be updated as such. Updates
through views can be handled automatically only if they can be propagated correctly to the base
relations. We classify views as updatable or not-updatable
• Updatable view: The updates to the view can be propagated to the base relations without
ambiguity.
CREATE VIEW SYSAN(ENO,ENAME) AS
SELECT ENO,ENAME
FROM EMP
WHERE TITLE="Syst. Anal."
– e.g, insertion of tuple (201,Smith) can be mapped into the insertion of a new employee (201,
Smith, “Syst. Anal.”)
– If attributes other than TITLE were hidden by the view, they would be assigned the value null
• Non-updatable view: The updates to the view cannot be propagated to the base relations
without ambiguity.
CREATE VIEW EG(ENAME,RESP) AS
SELECT ENAME,RESP
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO
– e.g, deletion of (Smith, ”Syst. Anal.”) is ambiguous, i.e., since deletion of “Smith” in EMP and
deletion of “Syst. Anal.” in ASG are both meaningful, but the system cannot decide. Current
systems are very restrictive about supportin gupdates through views. Views can be updated only
if they are derived from a single relation by selection and projection. However, it is theoretically
possible to automatically support updates of a larger class of views, e.g., joins
View Management in Distributed Databases
Definition of views in DDBMS is similar as in centralized DBMS. However, a view in a
DDBMS may be derived from fragmented relations stored at different sites. Views are
conceptually the same as the base relations, therefore we store them in the (possibly) distributed
directory/catalogue. Thus, views might be centralized at one site, partially replicated, fully
replicated. Queries on views are translated into queries on base relations, yielding distributed
queries due to possible fragmentation of data. Views derived from distributed relations may be
costly to evaluate. Optimizations are important, e.g., snapshots. A snapshot is a static view:
∗ does not reflect the updates to the base relations
∗ managed as temporary relations: the only access path is sequential scan
∗ typically used when selectivity is small (no indices can be used efficiently)
∗ is subject to periodic recalculation

Data Security
Data security protects data against unauthorized access and has two aspects:
– Data protection
– Authorization control

Data Protection
Data protection prevents unauthorized users from understanding the physical content of data.
Well established standards exist
– Data encryption standard
– Public-key encryption schemes

Authorization Control
Authorization control must guarantee that only authorized users perform operations they are
allowed to perform on the database. Three actors are involved in authorization
– users, who trigger the execution of application programms
– operations, which are embedded in applications programs
– database objects, on which the operations are performed
Authorization control can be viewed as a triple (user, operation type, object) which specifies that
the user has the right to perform an operation of operation type on an
object. Authentication of (groups of) users is typically done by username and password.
Authorization control in (D)DBMS is more complicated as in operating systems:
– In a file system: data objects are files
– In a DBMS: Data objects are views, (fragments of) relations, tuples, attributes
Grand and revoke statements are used to authorize triplets (user, operation, data object)
– GRANT <operations> ON <object> TO <users>
– REVOKE <operations> ON <object> TO <users>
Typically, the creator of objects gets all permissions
– Might even have the permission to GRANT permissions
– This requires a recursive revoke process
Privileges are stored in the directory/catalogue, conceptually as a matrix

Different materializations of the matrix are possible (by row, by columns, by element), allowing
for different optimizations e.g., by row makes the enforcement of authorization efficient, since
all rights of a user are in a single tuple

Distributed Authorization Control


Additional problems of authorization control in a distributed environment stem from the fact
that objects and subjects are distributed:
– remote user authentication
– managmenet of distributed authorization rules
– handling of views and of user groups
Remote user authentication is necessary since any site of a DDBMS may accept programs
initiated and authorized at remote sites. Two solutions are possible:
– (username, password) is replicated at all sites and are communicated between the sites,
whenever the relations at remote sites are accessed; beneficial if the users move from a site to a
site
– All sites of the DDBMS identify and authenticate themselves similarly as users do: intersite
communication is protected by the use of the site password; (username, password) is authorized
by application at the start of the session; no remote user authentication is required for accessing
remote relations once the start site has been authenticated; beneficial if users are static

Semantic Integrity Constraints


A database is said to be consistent if it satisfies a set of constraints, called semantic integrity
constraints. Maintain a database consistent by enforcing a set of constraints is a difficult
problem. Semantic integrity control evolved from procedural methods (in which the controls
were embedded in application programs) to declarative methods. avoid data dependency
problem, code redundancy, and poor performance of the
procedural methods. Two main types of constraints can be distinguished:
– Structural constraints: basic semantic properties inherent to a data model e.g., unique key
constraint in relational model
– Behavioral constraints: regulate application behavior e.g., dependencies (functional,
inclusion) in the relational model
A semantic integrity control system has 2 components:
– Integrity constraint specification
– Integrity constraint enforcement
Integrity constraints specification
In RDBMS, integrity constraints are defined as assertions, i.e., expression in tuple relational
calculus. Variables are either universally (∀) or existentially (∃) quantified; Declarative method;
Easy to define constraints. Can be seen as a query qualification which is either true or false.
Definition of database consistency clear. 3 types of integrity constraints/assertions are
distinguished:
∗ predefined
∗ precompiled
∗ general constraints
• In the following examples we use the following relations:
EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET)
ASG(ENO, PNO, RESP, DUR)
• Predefined constraints are based on simple keywords and specify the more common
contraints of the relational model
• Not-null attribute: – e.g., Employee number in EMP cannot be null
ENO NOT NULL IN EMP
• Unique key: – e.g., the pair (ENO,PNO) is the unique key in ASG
(ENO, PNO) UNIQUE IN ASG
• Foreign key:– e.g., PNO in ASG is a foreign key matching the primary key PNO in PROJ
PNO IN ASG REFERENCES PNO IN PROJ
• Functional dependency: – e.g., employee number functionally determines the employee name
ENO IN EMP DETERMINES ENAME
• Precompiled constraints express preconditions that must be satisfied by all tuples in a relation
for a given update type
• General form:
CHECK ON <relation> [WHEN <update type>] <qualification>
• Domain constraint, e.g., constrain the budget:
CHECK ON PROJ(BUDGET>500000 AND BUDGET≤1000000)
• Domain constraint on deletion, e.g., only tuples with budget 0 can be deleted:
CHECK ON PROJ WHEN DELETE (BUDGET = 0)
• Transition constraint, e.g., a budget can only increase:
CHECK ON PROJ (NEW.BUDGET > OLD.BUDGET AND
NEW.PNO = OLD.PNO)
– OLD and NEW are implicitly defined variables to identify the tuples that are subject to update
• General constraints may involve more than one relation. General form:
CHECK ON <variable>:<relation> (<qualification>)
• Functional dependency:
CHECK ON e1:EMP, e2:EMP
(e1.ENAME = e2.ENAME IF e1.ENO = e2.ENO)
• Constraint with aggregate function:
e.g., The total duration for all employees in the CAD project is less than 100
CHECK ON g:ASG, j:PROJ
( SUM(g.DUR WHERE g.PNO=j.PNO) < 100
IF j.PNAME="CAD/CAM" )

Semantic Integrity Constraints Enforcement


Enforcing semantic integrity constraints consists of rejecting update programs that violate
some integrity constraints. Thereby, the major problem is to find efficient algorithms. Two
methods to enforce integrity constraints:
– Detection:
1. Execute update u : D → Du
2. If Du is inconsistent then compensate Du → D′u or undo Du → D
∗ Also called posttest
∗ May be costly if undo is very large
– Prevention:
Execute u : D → Du only if Du will be consistent
∗ Also called pretest
∗ Generally more efficient
∗ Query modification algorithm by Stonebraker (1975) is a preventive method that is
particularly efficient in enforcing domain constraints. Add the assertion qualification
(constraint) to the update query and check it immediately for each tuple
Example: Consider a query for increasing the budget of CAD/CAM projects by 10%:

and the domain constraint

The query modification algorithm transforms the query into:


Distributed Constraints
Three classes of distributed integrity constraints/assertions are distinguished:
 Individual assertions - Single relation, single variable; Refer only to tuples to be updated
independenlty of the rest of the DB; e.g., domain constraints
 Set-oriented assertions - Single relation, multi variable (e.g., functional dependencies);
Multi-relation, multi-variable (e.g., foreign key constraints); Multiple tuples form
possibly different relations are involved
 Assertions involving aggregates - Special, costly processing of aggregates is required
Particular difficulties with distributed constraints arise from the fact that relations are fragmented
and replicated:
– Definition of assertions
– Where to store the assertions?
– How to enforce the assertions?
Definition and storage of assertions
The definition of a new integrity assertion can be started at one of the sites that store the relations
involved in the assertion, but needs to be propagated to sites that might store fragments of that
relation. Individual assertions:
∗ The assertion definition is sent to all other sites that contain fragments of the relation involved
in the assertion.
∗ At each fragment site, check for compatibility of assertion with data
∗ If compatible, store; otherwise reject
∗ If any of the sites rejects, globally reject
Set-oriented assertions:
∗ Involves joins (between fragments or relations)
∗ Maybe necessary to perform joins to check for compatibility
∗ Store if compatible
Enforcement of assertions in DDBMS is more complex than in centralized DBMS. The main
problem is to decide where (at which site) to enforce each assertion?
Depends on type of assertion, type of update, and where update is issued.
 Individual assertions:
– Update = insert
∗ enforce at the site where the update is issued (i.e., where the user inserts the tuples)
– Update = delete or modify
∗ Send the assertions to all the sites involved (i.e., where qualified tuples are updated)
∗ Each site enforce its own assertion
 Set-oriented assertions
– Single relation
∗ Similar to individual assertions with qualified updates
– Multi-relation
∗ Move data between sites to perform joins
∗ Then send the result to the query master site (the site the update is issued)

Chapter Review Questions


i. what is the semantic data control
ii. Discuss view management

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER ELEVEN: DISTRIBUTED DBMS RELIABILITY

Learning Objectives:

By the end of this chapter the learner shall be able to;

i. Understanding the distributed database reliability

ii. understanding the local recovery management

Reliability
A reliable DDBMS is one that can continue to process user requests even when the underlying
system is unreliable, i.e., failures occur
Failures – Transaction failures. System (site) failures, e.g., system crash, power supply failure.
Media failures, e.g., hard disk failures. Communication failures, e.g., lost/undeliverable
messages.
Reliability is closely related to the problem of how to maintain the atomicity and durability
properties of transactions
Recovery system: Ensures atomicity and durability of transactions in the presence of failures
(and concurrent transactions). Recovery algorithms have two parts
1. Actions taken during normal transaction processing to ensure enough information
exists to recover from failures
2. Actions taken after a failure to recover the DB contents to a state that ensures
atomicity, consistency and durability

Local Recovery Management


The local recovery manager (LRM) maintains the atomicity and durability properties of local
transactions at each site.
Architecture – Volatile storage: The main memory of the computer system (RAM). Stable
storage _ A storage that “never” looses its contents. In reality this can only be approximated by a
combination of hardware (non-volatile
storage) and software (stable-write, stable-read, clean-up) components
Two ways for the LRM to deal with update/write operations
In-place update - Physically changes the value of the data item in the stable database. As a
result, previous values are lost. Mostly used in databases
Out-of-place update - The new value(s) of updated data item(s) are stored separately from the
old value(s). Periodically, the updated values have to be integrated into the stable DB
In-Place Update
Since in-place updates cause previous values of the affected data items to be lost, it is necessary
to keep enough information about the DB updates in order to allow recovery in the case of
failures. Thus, every action of a transaction must not only perform the action, but must also write
a log record to an append-only log file
A log is the most popular structure for recording DB modifications on stable storage. Consists of
a sequence of log records that record all the update activities in the DB. Each log record
describes a significant event during transaction processing. Types of log records

With the information in the log file the recovery manager can restore the consistency of the DB
in case of a failure.
Assume the following situation when a system crash occurs

Upon recovery:
– All effects of transaction T1 should be reflected in the database ()REDO)
– None of the effects of transaction T2 should be reflected in the database ()UNDO)
REDO Protocol – REDO’ing an action means performing it again. The REDO operation uses
the log information and performs the action that might have been done before, or not done due to
failures. The REDO operation generates the new image.
UNDO Protocol – UNDO’ing an action means to restore the object to its image before the
transaction has started. The UNDO operation uses the log information and restores the old value
of the object

Logging Interface
Log pages/buffers can be written to stable storage in two ways:
Synchronously - The addition of each log record requires that the log is written to stable storage.
When the log is written synchronoously, the executtion of the transaction is supended until the
write is complete!delay in response time
Asynchronously - Log is moved to stable storage either at periodic intervals or when the buffer
fills up.
When to write log records into stable storage?
Assume a transaction T updates a page P
• Fortunate case
– System writes P in stable database
– System updates stable log for this update
– SYSTEM FAILURE OCCURS!... (before T commits)
– We can recover (undo) by restoring P to its old state by using the log
• Unfortunate case
– System writes P in stable database
– SYSTEM FAILURE OCCURS!... (before stable log is updated)
– We cannot recover from this failure because there is no log record to restore the old
value
If a system crashes before a transaction is committed, then all the operations must be undone. We
need only the before images (undo portion of the log). Once a transaction is committed, some of
its actions might have to be redone. We need the after images (redo portion of the log)
Write-Ahead-Log (WAL) Protocol – Before a stable database is updated, the undo portion of
the log should be written to the stable log. When a transaction commits, the redo portion of the
log must be written to stable log prior to the updating of the stable database
Two out-of-place strategies are shadowing and differential files
Shadowing – When an update occurs, don’t change the old page, but create a shadow page with
the new values and write it into the stable database. Update the access paths so that subsequent
accesses are to the new shadow page. The old page is retained for recovery
Differential files – For each DB file F maintain. a read-only part FR. a differential file consisting
of insertions part (DF+) and deletions part (DF−). Thus, F = (FR [ DF+) − DF−
Distributed Reliability Protocols
As with local reliability protocols, the distributed versions aim to maintain the atomicity and
durability of distributed transactions. Most problematic issues in a distributed transaction are
commit, termination, and recovery
Commit protocols - How to execute a commit command for distributed transactions. How to
ensure atomicity (and durability)?
Termination protocols - If a failure occurs at a site, how can the other operational sites deal
with it. Non-blocking: the occurrence of failures should not force the sites to wait until the
failure is repaired to terminate the transaction
Recovery protocols - When a failure occurs, how do the sites where the failure occurred deal
with it. Independent: a failed site can determine the outcome of a transaction without having to
obtain remote information.

Commit Protocols
Primary requirement of commit protocols is that they maintain the atomicity of distributed
transactions (atomic commitment) i.e., even though the exectution of the distributed transaction
involves multiple sites, some of which might fail while executing, the effects of the transaction
on the distributed DB is all-or-nothing. In the following we distinguish two roles.
– Coordinator: The process at the site where the transaction originates and which controls the
execution
– Participant: The process at the other sites that participate in executing the transaction
Centralized Two Phase Commit Protocol (2PC)
Very simple protocol that ensures the atomic commitment of distributed transactions.
Phase 1: The coordinator gets the participants ready to write the results into the database
Phase 2: Everybody writes the results into the database

Global Commit Rule


– The coordinator aborts a transaction if and only if at least one participant votes to abort
it
– The coordinator commits a transaction if and only if all of the participants vote to
commit it
Centralized since communication is only between coordinator and the participants
Linear 2PC Protocol
There is linear ordering between the sites for the purpose of communication. Minimizes the
communication, but low response time as it does not allow any parallelism
Distributed 2PC Protocol
Distributed 2PC protocol increases the communication between the nodes. Phase 2 is not needed,
since each participant sends its vote to all other participants (+
the coordinator), thus each participants can derive the global decision
2PC Protocol and Site Failures
Site failures in the 2PC protocol might lead to timeouts. Timeouts are served by termination
protocols. We use the state transition diagrams of the 2PC for the analysis. Coordinator
timeouts: One of the participants is down. Depending on the state, the coordinator can take the
following actions:
Timeout in INITIAL - Do nothing
Timeout in WAIT - Coordinator is waiting for local decisions. Cannot unilaterally commit. Can
unilaterally abort and send an appropriate message to all participants
Timeout in ABORT or COMMIT - Stay blocked and wait for the acks (indefinitely, if the site is
down indefinitely)
Participant timeouts: The coordinator site is
down. A participant site is in
Timeout in INITIAL - Participant waits for “prepare”, thus coordinator must have failed in
INITIAL state. Participant can unilaterally abort
Timeout in READY - Participant has voted to commit, but does not know the global decision.
Participant stays blocked (indefinitely, if the coordinator is permanently down), since participant
cannot change its vote or unilaterally decide to commit

The actions to be taken after a recovery from a failure are specified in the recovery protocol.
Coordinator site failure: Upon recovery, it takes the following actions:
Failure in INITIAL - Start the commit process upon recovery (since coordinator did not send
anything
to the sites)
Failure in WAIT - Restart the commit process upon recovery (by sending “prepare” again to the
participants)
Failure in ABORT or COMMIT - Nothing special if all the acks have been received from
participants. Otherwise the termination protocol is involved (re-ask the acks)

Participant site failure: The coordinator sites recovers


Failure in INITIAL - Unilaterally abort upon recovery as the coordinator will eventually timeout
since it will not receive the participant’s decision due to the failure
Failure in READY - The coordinator has been informed about the local decision. Treat as
timeout in READY state and invoke the termination protocol (re-ask the
status)
Failure in ABORT or COMMIT - Nothing special needs to be done
Problems with 2PC Protocol
A protocol is non-blocking if it permits a transaction to terminate at the operational sites without
waiting for recovery of the failed site. Significantly improves the response-time of transactions
2PC protocol is blocking – Ready implies that the participant waits for the coordinator. If
coordinator fails, site is blocked until recovery; independent recovery is not possible. The
problem is that sites might be in both: commit and abort phases.
Three Phase Commit Protocol (3PC)
3PC is a non-blocking protocol when failures are restricted to single site failures. The state
transition diagram contains.
 no state which is ”adjacent” to both a commit and an abort state.
 no non-committable state which is ”adjacent” to a commit state
Adjacent: possible to go from one status to another with a single state transition. Committable:
all sites have voted to commit a transaction (e.g.: COMMIT state). Solution: Insert another state
between the WAIT (READY) and COMMIT states
Chapter Review Questions
i. Discuss the Problems with 2PC Protocol
ii. Discuss Linear 2PC Protocol

References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4

2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER TWELVE: SAMPLE PAPERS

MOUNT KENYA UNIVERSITY

SCHOOL OF PURE AND APPLIED SCIENCES

DEPARTMENT OF INFORMATION TECHNOLOGY

EXAMINATION FOR BACHELOR OF BUSINESS INFORMATION TECHNOLOGY

BIT 4207: Distributed Databases

Instructions

Answer question ONE and any other TWO questions Time: 2Hours

QUESTION 1

a) Define the following terms (10 Marks)


a. Distributed database
b. DDBMS
c. Fragmentation
d. Replication
e. ANSI/SPARC Architecture
b) Discuss the major steps in logical database design (10 Marks)
c) Discuss the database recovery procedures (10 Marks)
QUESTION 2

a) Discuss the major steps in physical database design (10 Marks)


b) Outline the causes of database failure (10 Marks)
QUESTION 3
a) Discuss the transaction properties (10 Marks)
b) Discuss the various Architectural Models for DDBMSs (10 Marks)
QUESTION 4

a) Discuss concurrency control locking strategies (10 Marks)


b) Discuss the Distributed database design (10 Marks)
QUESTION 5

a) Discuss the steps in query processing (10 Marks)


b) Discuss the applications of distributed databases (10 Marks)

End of Exam

You might also like