Databases Ii
Databases Ii
Email: [email protected]
Web: www.mku.ac.ke
DEPARTMENT OF INFORMATION
TECHNOLOGY
By David Kibaara
TABLE OF CONTENT
TABLE OF CONTENT.....................................................................................................................................2
COURSE OUTLINE........................................................................................................................................7
Introduction...........................................................................................................................................13
Causes of Failures..................................................................................................................................47
Recovery Procedures.............................................................................................................................47
Recovery Techniques.............................................................................................................................49
Threats...................................................................................................................................................50
Backups.................................................................................................................................................52
Internal Consistency..............................................................................................................................54
Introduction to Transactions.................................................................................................................57
The ACID Properties...............................................................................................................................58
Nested Transactions..............................................................................................................................59
Implementing Transactions...................................................................................................................59
Database Transactions......................................................................................................................60
Object Transactions...........................................................................................................................60
Lock Problems.......................................................................................................................................64
Timestamp ordering..............................................................................................................................67
Query Optimization...............................................................................................................................73
Query Decomposition............................................................................................................................77
Data Localization...................................................................................................................................83
Data Independence...............................................................................................................................85
Promises of DDBSs.................................................................................................................................89
Transparency.........................................................................................................................................90
Design Problem.....................................................................................................................................95
Framework of Distribution....................................................................................................................95
Design Strategies...................................................................................................................................96
Fragmentation.......................................................................................................................................98
Fragment Allocation............................................................................................................................103
View Management..............................................................................................................................114
Data Security.......................................................................................................................................117
Data Protection...................................................................................................................................117
Authorization Control..........................................................................................................................117
Distributed Constraints........................................................................................................................122
Reliability.............................................................................................................................................124
Commit Protocols................................................................................................................................129
Course Description
This course investigates the architecture, design, and implementation of massive-scale data
systems. The course discusses foundational concepts of distributed database theory including
design and architecture, security, integrity, query processing and optimization, transaction
management, concurrency control, and fault tolerance. It then applies these concepts to both
large-scale data warehouse and cloud computing systems. The course blends theory with
practice, with each student developing both distributed database and cloud computing projects.
Prerequisites
Course Goal
The goal of this course is to teach distributed database management system theory.
Course Objectives
Causes of Failures
Recovery Procedures
Recovery Techniques
Threats
Backups
Internal Consistency
Nested Transactions
Implementing Transactions
Database Transactions
Object Transactions
Lock Problems
Timestamp ordering
Query Decomposition
Data Localization
Data Independence
Promises of DDBSs
Transparency
Design Problem
Framework of Distribution
Design Strategies
Fragmentation
Fragment Allocation
Standardization
View Management
Data Security
Data Protection
Authorization Control
Distributed Constraints
ELEVEN: DISTRIBUTED DBMS RELIABILITY
Reliability
Commit Protocols
Exam 70 %
CATS 30%
Reference
Raghu Ramakrishnan and Johannes Gerhrke. 2003. Database Management Systems, 3rd edition,
McGraw-Hill. ISBN: 978-0-07-246563-1.
Supplementary Reading
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems: The
Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th edition,
Addison Wesley. ISBN-13: 978-0136086208
Learning Objectives:
Introduction
Database design is the process of producing a detailed data model of a database. This logical data
model contains all the needed logical and physical design choices and physical storage
parameters needed to generate a design in a Data Definition Language, which can then be used to
create a database. A fully attributed data model contains detailed attributes for each entity.
Database design is a technique that involves the analysis, design, description, and specification
of data designed for automated business data processing. This technique uses models to enhance
communication between developers and customers.
The term database design can be used to describe many different parts of the design of an overall
database system. Principally, and most correctly, it can be thought of as the logical design of the
base data structures used to store the data. In the relational model these are the tables and views.
In an object database the entities and relationships map directly to object classes and named
relationships. However, the term database design could also be used to apply to the overall
process of designing, not just the base data structures, but also the forms and queries used as part
of the overall database application within the database management system (DBMS).
The process of doing database design generally consists of a number of steps which will be
carried out by the database designer. Usually, the designer must:
Determine the relationships between the different data elements.
Superimpose a logical structure upon the data on the basis of these relationships.
Data models and supporting descriptions are the tools used in database design. These tools
become the deliverables that result from applying database design. There are two primary
objectives for developing of these deliverables. The first objective is to produce documentation
that describes a customer’s perspective of data and the relationships among this data. The second
objective is to produce documentation that describes the customer organization's environment,
operations and data needs. In accomplishing these objectives, the following deliverables result:
Consider a database approach if one or more of the following conditions exist in the user
environment:
If it appears that conditions would support database development, then undertake the activities of
logical database analysis and design. When the logical schema and sub schemas are completed
they are translated into their physical counterparts. Then the physical sub schemas are supplied
as part of the data specifications for program design. The exact boundary between the last stages
of logical design and the first stages of physical analysis is difficult to assess because of the lack
of standard terminology. However, there seems to be general agreement that logical design
encompasses a DBMS-independent view of data and that physical design results in a
specification for the database structure, as it is to be physically stored. The design step between
these two that produces a schema that can be processed by a DBMS is called implementation
design.
Do not limit database development considerations to providing random access or ad hoc query
capabilities for the system. However, even if conditions appear to support database development,
postpone the decision to implement or not implement a DBMS until after completing a thorough
study of the current environment. This study must clarify any alternatives that may or may not be
preferable to DBMS implementation.
By providing each task with its own data groups, changes in the data requirements of one task
will have minimal, if any, impact on data provided for another task. By having data managed as a
synthesis, data redundancy is minimized and data consistency among tasks and activities is
improved.
Logical database design comprises two methods to derive a logical database design. The first
method is used to analyze the business performed by an organization. Following this analysis,
the second method is used to model the data that supports the business. These methods are:
Business Analysis
Data Modeling
Business Analysis
Prior to applying this method, acquired and study the following documentation:
A high level data flow diagram (DFD) depicting the major applications to be supported
and the major data sources and outputs of these applications.
Detailed DFDs depicting the functions and tasks performed and a list of the documents,
files, and informal references (e.g., memos, verbal communications, etc.) used to perform
each function.
Identify the mission, functions and operations of the organizational element that the database is
to support. The purpose of this step is to define the scope of the potential database's current and
future needs and develop a reference point for further analysis. This step covers all relevant
functional areas and be developed separately from any single application design effort.
In examining an organizational element, which may range in size from a branch to an entire
organization, the following may provide sources of information:
If available, the organization's "information plan" would be the best source. These
plans vary widely in content but must articulate the organization's current and future
management information strategy, a discussion of each system's scope, and
definitions of the dependencies between major systems (both automated and
manual) and groups of data. With this information, it is possible to determine which
functional areas must be included within the scope of the design.
If an information plan is not available, or this plan does exist but does not contain
diagrams of systems and data dependencies, it will be the designer's responsibility to
determine the scope. In this case, persons within the relevant functional areas must
be interviewed to determine how they relate to the rest of the organization. After the
areas to which they relate are determined, additional interviews can be conducted in
these newly identified areas to ascertain the extent to which they share data with the
application(s) under design.
Other potential sources of information are the Requests for Information Services
(RIS), mission, functional statements, internal revenue manuals, and senior staff
interviews.
Future changes to the organization must be considered when defining the scope of
the design effort, i.e., any major changes in operating policy, regulations, etc. Each
potential change must be identified and further defined to determine whether it could
change the definition, usage, or relationships of the data. Where a change could
affect the database in the future, the design scope should be expanded to consider
the effects of this change.
After determining the scope of the database, construct a high-level DFD to graphically depict the
boundaries.
Decompose each function into the lowest levels of work that require, on a repetitive basis,
unique sets of data. Work at this level is considered a "task", a unique unit of work consisting of
a set of steps performed in sequence. All these steps are directed toward a common goal and use
and/or create a common set of data.
Once a task has been defined, decompose it into subtasks. This decomposition must occur if one
or more of the following conditions exist:
More than one person is needed to carry out the task and each of them is required to
have a different skill and/or carries out his/her part independently.
There are different levels of authorization, i.e., different people authorize different
parts of the task.
Different frequencies or durations apply to different parts of the task.
Input documents are not used uniformly within the task.
Totally different documents are used for different parts of the task.
Many different operations are carried out within the task.
There are different primitive operations which each have separate input/output
requirements.
However, when a subtask has been defined, make certain it is limited to that particular task. If it
spans two or more tasks, it cannot be considered a subtask.
Collect all information in a precise manner using interviews and documentation techniques. This
approach is especially important when identifying operational functions because they provide the
basic input to the database design process. These functions and their associated tasks must be
identified first. Therefore, begin by identifying the organizational areas within the scope of the
design effort which perform the functions essential to conducting business. Once these functional
areas have been determined, the persons to be interviewed can be specified. The recommended
approach is as follows:
Determine the key individuals within these areas and send out questionnaires
requesting: job titles of persons within their areas of responsibility; functions
performed in each job; and a brief statement of the objective(s) of each job.
After receiving the results of the questionnaire, develop a document showing job
title, functions performed, and the objectives of these functions. Then review and
classify each job as either operational or control and planning. Once this is
completed, the contact the supervisor of each job which is identified as "operational"
and ask to select one, preferably two, persons performing that job who can be
interviewed.
Conduct the operational interviews. Keep the following three objectives in mind:
identify each operational function; identify the data associated with each of these
functions; and identify the implicit and explicit rules determining when and how
each function occurs.
When conducting operational interviews, accomplish the following steps during the interviews:
1. Begin by having each interviewee describe, in detail, the functions and tasks that are
performed on a daily or potentially daily basis. Document these major actions, decisions,
and interfaces on task analysis and decision analysis forms. These actions, decisions, and
interfaces must also be reflected on a detailed data flow diagram. This documentation can
subsequently be used to verify that all operational functions and their sequence are
correct. Repeat this same procedure for functions that occur weekly, monthly, quarterly,
and annually.
2. As the functions, tasks, and other activities are defined, determine the documents, files
and informal references (memos, verbal communications, etc.) used to perform them and
indicate these in a separate numbered list. A task/document usage matrix may also be
used specifying a task's inputs and outputs in terms of documents.
3. Once the person interviewed agrees to the contents of the documentation, discuss more
specifically each action, decision, and interface point to determine what specific
documents or references are required. Then request a copy of each document that has
been discussed.
4. Finally, identify the data elements actually used or created on each document and
compile a list of these elements. Include their definitions and lengths. Any data elements
that are not included in the dictionary must be entered.
The second type of information required for conceptual database development involves the
organization's control and planning functions and their related data needs. An in-depth
investigation of the organization's explicit and implicit operating policies is necessary. Such
information can be obtained through interviews with management. Since the nature of the
information collected will vary according to the organization and persons involved, there is no
rigid format in which the interview must be documented. However, in order to minimize the
possibility of losing or missing information, it is recommended that there be two interviewers
who could alternate posing questions and taking notes.
Conduct interviews for control and planning functions with persons whose responsibilities
include defining the goals and objectives of the organization, formulating strategies to achieve
these goals, and managing plans to implement these strategies; and with those persons directly
responsible for the performance of one or more operating areas. The objective of these
interviews is to gain, where appropriate, an overall understanding of:
The basic components of the organization and how they interact with one another.
The external environment that affects the organization directly or indirectly (i.e.,
Congressional directives, Treasury policies, etc.).
Explicit or implicit operating policies that determine how the mission is performed;
some of these may be identified when discussing the internal and external
environment.
Information used currently or required to plan organizational activities and measure
and control performance. If available, obtain examples.
Changes that are forecast that may affect the organization.
The following are steps for conducting control and planning interviews:
1. Present the designer's perception of functions and operations and seek confirmation and
clarification, i.e., clarify which are main functions, support functions, and sub-functions
or tasks.
2. Ask what additional functions, if any, are performed.
3. Ask what monitoring functions are performed and what critical indicators are used to
trigger intervention.
4. Ask what planning functions are performed and what data is used for planning purposes.
5. Express appreciation by thanking the person interviewed for his/her time.
6. If any new data elements are defined during the interviews, make certain they are
incorporated in the Enterprise Data Dictionary so that they may be properly cross-
referenced.
Collect information about data usage and identify task/data relationships. Once all functions and
tasks are identified as either operational or control and planning and their data usage has been
determined, add specificity to the task/data relationships. A task/data relationship is defined as
the unique relationship created between data items when they are used to perform a specific task.
It is critical that these relationships be carefully and thoughtfully defined.
The process of defining task/data relationships begins with analyzing the documentation
developed during the interviews. When identifying a series of unique tasks, follow and apply
these rules:
A task must be performed within one functional area. Each task must consist of a set
of serially performed steps (or serially positioned symbols on a DFD). If a decision
point occurs and one path of the decision involves a new action, in effect the current
task ends and a new one begins. Each step within a single task must be performed
within a reasonable period. If a significant amount of time can elapse between two
steps, more than one task must be defined.
Each step within the task must use the same set of data. However, if new data is
created in one step of the task and used in the next step, they may be considered as
the same set of data.
After all the data flows and other documentation have been analyzed and assigned to tasks,
compare the tasks for each duplicate interview to determine if the same ones were defined - This
is assuming that two persons with the same job title were interviewed in each relevant area.
When conflicts are found, compare the two sets of documentation to determine if one is merely
more detailed:
If the DFDs, etc., appear to be the same and differ only on levels of detail, choose
the one that best defines a complete unit of work.
If real functional differences are found, review the documents (and notes) associated
with each. Sometimes people with similar titles perform different functions due to
their seniority or competence. When major differences are found, separate any
unique tasks and add them to the list.
If differences are found and it is difficult to determine why they exist, request that
the appropriate supervisor review the task definitions developed during the
interviews. (However, do not include any portions of the interviews that are
confidential).
Once any conflicting definitions have been resolved, task/data relationships specifically
documented. Because it is likely that redundant tasks have been defined, arrange the
documentation already produced by department or area. This method increases the likelihood
that redundant tasks will be identified. It is suggested that the documentation of task/data
element relationships begin with a table such as the one shown in below.
Develop a list of all implicit and explicit constraints such as security, data integrity, response or
cyclic processing time requirements. The purpose of developing a list of all implicit and explicit
constraints is to provide information for the physical database designer to use in determining
operational considerations such as access restrictions, interfaces to other packages, and recovery
capabilities. Document constraints using either a tabular or a memo format. Examples of items to
be considered are:
Develop a list of potential future changes and the way in which they may affect operations. The
purpose of this step is to include in the database design considerations that may affect operations
in the future. Consider future changes to include anything that may affect the scope of the
organization, present operating policies, or the relationship of the organization to the external
environment. When reviewing the interviews to identify changes, highlight anything that implies
change, and, if possible, the effect(s) of that change.
Data Modeling
Data modeling is a technique that involves the analysis of data usage and the modeling the
relationships among entities. These relationships are modeled independent of any particular
hardware or software system. The objective of logical design is to clearly define and depict user
perspectives of data relationships and information needs.
The various approaches to logical database design involve two major design methodologies-
entity analysis and attribute synthesis.
In applying this method, the primary tool used is the data relationship diagram. This type of
diagram is used to facilitate agreement between the designer and users on the specific data
relationships and to convey those relationships to the physical database designer. It is a graphic
representation of data relationships. The format used must be either the data structure diagram or
entity-relationship diagram.
Simplify the modeling process by partitioning the model into the following four design
perspectives:
The organizational perspective reflects senior and middle management's view of the
organization's information requirements. It is based on how the organization
operates.
The application perspective represents the processing that must be performed to
meet organizational goals, i.e., reports, updates, etc.
The information perspective depicts the generic information relationships necessary
to support decision-making and long-term information requirements. It is
represented by user ad hoc queries, long-range information plans, and general
management requirements.
The event perspective deals with time and scheduling requirements. It represents
when things happen, e.g., frequency of reports.
There are two general rules that provide the foundation for design perspectives:
The design perspectives are modeled by three types of constructs: entity, attribute and
relationship;
In the design perspective, each component of information must be represented by one,
and only one, of these constructs.
An entity refers to an object about which information is collected, e.g., a person, place, thing, or
event. A relationship is an association between the occurrences of two or more entities. An
attribute is a property of an entity, that is, characteristic about the entity, e.g., size, color, name,
age, etc.
It is important to remember that data usage is dynamic. Perceptions change, situations change
and rigid concepts of data use are not realistic. Data involves not only values but relationships as
well and must be divided into logical groups before being molded into whatever structures are
appropriate-matrices, entity relationship diagrams, data structure diagrams, etc. If at any point it
becomes apparent that a database approach is definitely not suitable or practical for whatever
reason, take an alternative path as soon as possible to save vital resources.
Identify local views of the data. Develop local views for the organization, application,
information, and event design-perspectives.
For each of the functions, activities and tasks identified, there exists what may be called "sub
perspective" or local views of the data. Normally there will be several local views of the data
depending on the perspective. These views correspond to self-contained areas of data that are
related to functional areas. The selection of a local view will depend on the particular perspective
and the size of the functional area. Factors which must be considered in formulating local views
include a manageable scope and minimum dependence on, or interaction with, other views.
The primary vehicles for determining local views will be the task/data element matrices and the
task analysis and description forms constructed during logical database analysis.
Formulate Entities
For each local view, formulate the entities that are required to capture the necessary information
about that particular view.
At this point the designer is confronted with two major considerations. The first consideration
deals with the existence of multiple entity instances and can be addressed by using the concept of
"type" or "role". For example, the population of the entity EMPLOYEE can be categorized into
employees of "type": computer systems analyst, secretary, auditor, etc. It is important, at this
stage of logical design to capture the relevant types and model each as a specific entity. The
generalization of these types into the generic entity EMPLOYEE will be considered in the next
stage of conceptual design where user views are consolidated.
The second consideration deals with the use of the entity construct itself. Often a piece of
information can be modeled as either an entity, attribute, or relationship. For example, the fact
that two employees are married can be modeled using the entity MARRIAGE, the relationship
IS-MARRIED-TO, or the attribute CURRENT-SPOUSE. Therefore, at this point in the design
process the designer must be guided by two rules. First, use the construct that seems most
natural. If this later proves to be wrong, it will be factored out in subsequent design steps.
Second, avoid redundancy in the use of modeling constructs; use one and only one construct to
model a piece of information.
One rule of thumb, which has been successfully used to restrict the number of entities identified
so that a local view can be properly represented, is the "magic number seven, plus or minus two."
This states that the number of facts (information clusters) that a person can manage at any one
time is about seven, give or take two. Therefore, when this is applied to the database design
process, the number of entities contained in a local view must, at the most, be nine, but probably
closer to six or seven. If this restriction cannot be met, perhaps the scope of the local view is too
large.
Give careful consideration to the selection and assignment of an entity name. Since an entity
represents a fact, give a precise name to this fact. This is also important later when views are
consolidated because that subsequent stage deals with homonyms and synonyms. If the name
given to an entity does not clearly distinguish that entity, the integration and consolidation
process will carry this distortion even further.
Finally, select identifying attributes for each entity. Although a particular collection of attributes
may be used as the basis for formulating entities, the significant attribute is the identifier (or
primary key) that uniquely distinguishes the individual entity instances (occurrences), for
example, employee number. This entity identifier is composed of one or more attributes whose
value set is unique. This is also important later in the consolidation phase because the identifying
attribute values are in a one-to-one correspondence with the entity instances. Therefore, two
entities with the same identifiers may to some degree be redundant. However, this will depend
on their descriptive attributes and the degree of generalization.
Specify Relationships
Identify relationships between the entities. In this step, additional information is added to the
local view by forming associations among the entity instances. There are several types of
relationships that can exist between entities. These include:
Optional relationships
Mandatory relationships
Exclusive relationships
Contingent relationships
Conditional relationships
In an optional relationship the existence of either entity in the relationship is not dependent on
that relationship. For example, there are two entities, OFFICE and EMPLOYEE. Although an
office may be occupied by an employee, they can exist independently.
In a mandatory relationship, the existence of both entities is dependent on that relationship.
An exclusive relationship is a relationship of three entities where one is considered the prime
entity that can be related to either one of the other entities but not both.
In a contingent relationship the existence of one of the entities in the relationship is dependent on
that relationship. A VEHICLE is made from many PARTS.
A conditional relationship is a special case of the contingent relationship. When it occurs the
arrow must be labeled with the condition of existence.
Relationships can exist in several forms. The associations can be one-to-one (1:1), one-to-many
(1:N) or many-to-many (N:N). A one-to-one association is shown by a single-headed arrow and
indicates that the relationship involves only one logical record, entity or entity class of each type.
A one-to-many association is shown by a double-headed arrow and documents the fact that a
single entity, entity class or logical record of one type can be related to more than one of another
type. A many-to-many association is shown by a double-headed arrow in both directions.
An informal procedure for identifying relationships is to pair each entity in the local view with
all other entities contained in that view. Then for each pair, ask if a meaningful question can be
proposed involving both entities or if both entities may be used in the same transaction. If the
answer is yes to either question, determine the type of relationship that is needed to form the
association. Next, determine which relationships are most significant and which are redundant.
Of course, this can be done only with a detailed understanding of the design perspective under
consideration.
Add descriptive attributes. Attributes can be divided into two classes-those that serve to identify
entity instances and those that provide the descriptive properties of entities. The identifier
attributes, which uniquely identify an entity, were added when the entities were formulated.
Descriptive attributes help describe the entity. Examples of descriptive attributes are color, size,
location, date, name and amount.
In this step of local view modeling, the descriptive attributes are added to the previously defined
entities. Only single-valued attributes are allowed for the description of an entity.
Consolidate local views and design perspectives. Consolidation of the local views into a single
information' structure is the major effort in the logical database design. It is here that the separate
views and applications are unified into a potential database. Three underlying concepts that form
the basis for consolidating design perspectives; these concepts are identity, aggregation, and
generalization.
Identity is a concept which refers to synonymous elements. Two or more elements are said to be
identical, or to have an identity relationship, if they are synonyms. Although the identity concept
is quite simple, the determination of synonyms is not. Owing to inadequate data representation
methods, the knowledge of data semantics is really quite limited. Typically, an in-depth
understanding of the user environments is required to determine if synonyms exist. Determining
whether similar definitions may be resolved to identical definitions, or if one of the other element
relationships really applies, requires a clear and detailed understanding of user functions and data
needs.
Since aggregation and generalization are quite similar in structure and application, one element
may participate in both aggregation and generalization relationships.
Inferences can be drawn about the aggregation dimension from the generalization dimension and
vice versa, e.g., it can be inferred that each instance of "EXECUTIVE" is also an aggregation of
Name, SSN, and Address. See Figure below.
There are three consolidation types. These types may be combined in various ways to construct
any type of relationship between objects (elements) in different user views. By combining
consolidation types, powerful and complex relationships can be represented. In fact, we
recommend that most semantic relationships be represented by some combination of these types
of consolidation. The consolidation types are:
1. Select perspectives.
2. Order local views within each perspective.
3. Consolidate local views within each perspective.
4. Resolve conflicts.
Select perspectives. First, confirm the sequence of consolidation by following the order of design
perspectives. Since this order is general, check it against the objectives of the database being
designed. For example, if you are designing a database for a process-oriented organization, you
might consider the key perspectives to be the application and event perspectives and therefore
begin the process with these.
Order local views within each perspective. Once the design perspectives have been ordered,
focus the consolidation process on local views within each perspective. Several views comprise
the perspective chosen and this second step orders these views for the consolidation process. The
order must correspond to each local view's importance with respect to specific design objectives
for the database.
Consolidate local views within each perspective. This step is the heart of the consolidation
process. For simplicity and convenience, use binary consolidation, i.e., integrating only two user
views at a time. This avoids the confusion of trying to consolidate too many views. The order of
consolidation is determined by the previous step where the local views within a perspective have
been placed in a particular order. The process proceeds as follows:
1. Take the top two views in the perspective being considered and consolidate these using
the basic consolidation principles.
2. Using the binary approach, merge the next local view with the previously consolidated
local views. Continue this process until the last view is merged. When the consolidation
process is completed for the first design perspective, the next design perspective is
introduced and this process continues until all perspectives are integrated.
Resolve conflicts. Conflicts can arise in the consolidation process for a number of reasons,
primarily because of the number of people involved and the lack of semantic power in our
modeling constructs. They may also be caused by incomplete or erroneous specification of
requirements. Although the majority of these conflicts are dealt with in the consolidation step
using the rules previously discussed, any remaining conflicts that have to be dealt with by
designer decisions are taken care of in this step. When a design decision is made, it is important
to "backtrack" to the point in the consolidation process where these constructs were entered into
the design. At this point the implications of the design decision are considered and also their
effects on the consolidation process.
The purpose of this step is to present the data model. Use data relationship diagrams to document
local views and their consolidation. These must take the form of an entity-relationship diagram
or a data structure diagram.
Verify the data model. The purpose of this step is to verify the accuracy of the data model and
obtain user concurrence on the proposed database design.
The process of developing the information structure involves summarizing and interpreting large
amounts of data concerning how different parts of an organization create and/or use that data.
However, it is extremely difficult to identify and understand all data relationships and the
conditions under which they may or may not exist.
Although the design process is highly structured, it is still probable that some relationships will
be missed and/or expressed incorrectly. In addition, since the development of the information
structure is the only mechanism that defines explicitly how different parts of an organization use
and manage data, it is reasonable to expect that management, with this newfound knowledge,
might possibly consider some changes. Because of this possibility, it is necessary to provide
management with an understanding of the data relationships shown in the information structure
and how these relationships affect the way in which the organization performs, or can perform,
its mission. Each relationship in the design and each relationship excluded from the design must
be identified and expressed in very clear statements that can be reviewed and approved by
management. Following management's review, the design will, if necessary, be adjusted to
reflect its decisions.
The verification process is separated into two parts, self-analysis and user review. In self
analysis, the analyst must insure that:
All entities have been fully defined for each function and activity identified.
All entities have at least one relation to another entity.
All attributes have been associated with their respective entities.
All data elements have been defined in the Enterprise Data Dictionary.
All processes in the data flow diagram can be supported by the database when the
respective data inputs and outputs are automated.
All previously identified potential changes have been assessed for their impact on
the database and necessary adjustments to the database have been determined.
To obtain user concurrence on the design of the database, perform the following steps, in the
form of a walk-through, to interpret the information structure for the user:
1. State what each entity is dependent upon (i.e., if an arrow points to it). Example: All
ORDERS must be from CUSTOMERS with established accounts;
2. State what attributes are used to describe each entity;
3. Define the perceived access path for each entity;
4. Define the implications of each arrow (i.e., one-to-one, one-to-many etc.);
5. Define what information cannot exist if an occurrence of an entity is removed from the
database.
Give the user an opportunity to comment on any perceived discrepancies in the actual operations
or usage. If changes need to be made, then give the user the opportunity to review the full design
at the completion of the changes. Once all changes have been made and both the relationship
diagram and the data definitions have been updated, obtain user concurrence on the design
specifications.
The major objective of implementation design is to produce a schema that satisfies the full range
of user requirements and that can be processed by a DBMS. These extend from integrity and
consistency constraints to the ability to efficiently handle any projected growth in the size and/or
complexity of the database. However, these must be considerable interaction with the application
program design activities that are going on simultaneously with database design. Analyze high-
level program specifications and program design guidance supplied to correspond to the
proposed database structure.
The guidance provide in this section serves a dual purpose. First, it provides general guidelines
for physical database design. Various techniques and options used in physical design are
discussed as well as when and how they must be used to meet specific requirements. These
guidelines are generic in nature and for this reason are intended to provide a basic understanding
of physical database design prior to using specific vendor documentation. Second, since database
management systems vary according to the physical implementation techniques and options they
support, these guidelines will prove useful during the database management software
procurement. They will provide a means for evaluating whether a DBMS under consideration
can physically handle data in a manner that will meet user requirements.
The usefulness of these guidelines is directly related to the where one is in a development life
cycle and the level of expertise of a developer. This document assumes the reader is familiar
with database concepts and terminology since designers will most likely be database
administrators or senior computer specialists. To aid the reader, a glossary and bibliography are
provided.
The criterion for determining physical design is quite different from that of logical design.
Selection of placement and structure is determined by evaluating such requirements as
operational efficiency, response time, system constraints and security concerns. This physical
design layout must be routinely adjusted to improve the system operation, while maintaining the
user's logical view of data. The physical structuring or design will often be quite different from
the user's perception of how the data is stored.
The following steps provide general guidance for physical database design. Since much of the
effort will depend on the availability of data and resources, the sequence of these steps is
flexible:
These critical factors will dictate the weight placed on the various physical design considerations
to be discussed. The following are examples of user requirements. Notice that with each
requirement there is an example of an associated trade-off.
Retrieval time decreases with a simple database structure; however, to meet the logical design
requirements, it may be necessary to implement a more complex multilevel structure.
Ease of recovery increases with a simple structure but satisfying data relationships may require
more complex mechanisms.
Privacy requirements may require stringent security such as encryption or data segmentation.
These procedures decrease performance, however, in terms of update and retrieval time.
Active files, especially those accessed in real time, dictate high-speed devices; however, this will
represent increased cost.
By attempting to determine the primary type of processing, the designer has a framework of
physical requirements with which to begin design. Three environments will be discussed;
however, keep in mind that these are merely guidelines since most systems will not fit neatly into
one general set of requirements. These considerations will often conflict with the user's
requirements or security needs (to be discussed), thus forcing the designer to make decisions
regarding priority.
Normally this environment requires fast response time, and multiple run units will actively share
DBMS facilities. In order to meet response time specifications, cost may increase due to the
necessity for additional system resources. Recovery may be critical in such a volatile
environment, and whenever possible, use a simple structure. In a CODASYL (network)
structure, this time specification would translate into reduced data levels, number of network
relationships, number of sets and size of set occurrences. In a high volume processing
environment requests are most frequently random in nature requiring small amounts of
information transfer; thus affecting page and buffering considerations.
Low volume systems generally process more data per request, indicating run units may remain in
the system longer. There is the likelihood of more sequential requests and reports, and response
time is probably not the critical issue. Resources may be more limited in this environment,
implying smaller buffers and perhaps fewer peripherals. With the possibility of fewer resources,
those resources may need to be more highly utilized. On-line recovery techniques may be
unavailable since the resource requirements are costly. Although the number of transactions is
low in this environment, the probability of multiple simultaneous run units accessing the same
data may be high.
When a batch environment is indicated, the designer is left with maximum flexibility since the
requirement is reasonable turnaround time and effective use of resources. Because of job
scheduling options, concurrency problems can be controlled. Recovery tends to be less critical
and will be determined by such factors as file volatility, the time necessary to rerun update
programs, and the availability of input data. For example, if the input data is readily available,
the update programs short and processing 85 percent retrieval; the choice may be made to avoid
the overhead of maintaining an on-line recovery file.
The DBMS must first physically support the logical design requirements. That is, based on the
logical data model, the package must support the required hierarchical, network or relational
structure. Early stages of analysis must provide enough information to determine this basic
structure. From a physical database design point of view, an analysis must then be made as to
how effectively the DBMS can handle the organizational and environmental considerations. If
the proposed package fails to provide adequate support of the requirements, the project manager
must be notified. The notification must include the specific point(s) of failure, anticipated
impact(s), and any suggestions or alternatives for alleviating the failure(s).
This procedure involves selecting the physical storage and access methods as well as secondary
and multiple key implementation techniques. DBMS packages vary as to the options offered.
The use of vendor documentation, providing specific software handling details, will be necessary
to complete this process.
Perform Sizing of Data
Obtain the specifics of sizing from vendor documentation as each DBMS handles space
requirements in a different manner. Consider Sizing in conjunction with designing the placement
of data. Once data records, files and other DBMS specifics have been sized according to a
proposed design, a decision may be made, because of the space allocation involved, to change
the design. Data compaction techniques may be considered at this point. Flexibility to make
changes and reevaluate trade-offs during this entire procedure is of critical importance.
The DBMS selected must have the options necessary to implement security and recovery
requirements. Implementation of these considerations will often cause trade-offs in other design
areas.
Deliverables
Decision Analysis and Description Forms must identity such items as type of decision, the
decision maker, and the nature of the decision.
Task Analysis and Description Forms must include the name of the task, its description
(overview), the people/departments involved, and subtasks and their relationships.
Task/Data Element Usage Matrix
A task/data element usage matrix relates each data element to one or more tasks.
Data Models
Data relationship diagrams depict the relationships between entities. These are tools that provide
one way of logically showing how data within an organization is related. They must be models
using conventions for either data structure diagrams or entity relationship diagrams. The term
"entity" refers to an object of interest-person, place, thing, event-about which information is
collected. When constructing either of these diagrams it is recommended that the entities be
limited to those of fundamental importance to the organization.
Entity-Attribute Lists
Entity-attribute relation lists may be derived from the Enterprise Data Dictionary listings.
Data definition lists may be derived from Enterprise Data Dictionary listings.
Duplication of effort can be eliminated if there are existing documents available containing
physical specifications. Use the following resources for developing documentation:
1. DBMS provided documentation - For example, a listing of the scheme from a Codasyl
DBMS will provide such detail as data names, sizing, placement and access methods.
2. Data Dictionary – The Data Dictionary listing provides certain physical specifications,
such as, data format and length.
3. Project documentation - All documentation submitted as Physical Database
Specifications must be organized in a macro to micro manner or global to specific. That
is, begin at the schema level, moving to subschema, indices, data elements, etc. The
objective is to organize documentation in a manner that is clear and easy for the user to
read.
Names
Where appropriate in the documentation, identify names of physical database items. Specifically,
the items will be all those defined to the DBMS software, such as: Schema; Subschema; Set;
Record; Field; Key; Index Names
Where data naming standards are applicable, these standards shall be met to the extent possible
with the DBMS software. For example, if the DBMS does not permit hyphens in naming, an
exception would be made to the standard "all words in a name must be separated with a hyphen".
Data Structure/Sizing
Identification of data elements, associations within and between record types as well as sizing
requirements are identified and documented during the logical database design process. The
physical representation of data structures will vary from the logical, however, since the
physically stored data must adhere to specific DBMS characteristics. As applicable to the
DBMS, the following structures must be documented: Records; Blocks/Pages; Files
Describe the physical record layout. In this description, include the data fields, embedded
pointers, spare data fields, and database management system overhead (flags, codes, etc).
Besides data record types, document any other record types such as index records. If records are
handled as physical blocks or pages, provide the following:
Specify the amount of space allocated for each database file. This must be consistent with the
total record and block/page sizing documentation described above.
Data Placement
1. For each record type: State the storage and access method used; Describe the storage and
assess method; Where applicable, identify the physical data location (track, cylinder);
Where an algorithm access method is used:
2. Give the primary record key to be used by the algorithm; Describe the algorithm used,
including the number and size of randomized address spaces available to the algorithm;
Give the packing density, and the strategy for its determination.
3. Where an index sequential access method is used: Give the primary record key; State
indexing strategy/levels; State initial load strategy.
4. Where a chains access method is used: Give the access path to the record type (i.e., Is this
primary access of detail record though the master record or is this a chain of secondary
keys?); List the pointer options used (i.e., forward, backward, owner, etc.); Indicate
whether the chain is scattered or stored contiguously with the master.
5. Where an index access method is used, identify keys used to index a record type.
Database planning
The database-planning phase begins when a customer requests to develop a database project. It is
set of tasks or activities, which decide the resources required in the database development and
time limits of different activities. During planning phase, four major activities are performed.
Review and approve the database project request.
Prioritize the database project request.
Allocate resources such as money, people and tools.
Arrange a development team to develop the database project.
Database planning should also include the development of standards that govern how data will
be collected, how the format should be specified, what necessary documentation will be needed.
Summary
Define the problems and constraints
Define the objectives
Define scope and boundaries
Requirements Analysis
Requirements analysis is done in order to understand the problem, which is to be solved. It is
very important activity for the development of database system. The person responsible for the
requirements analysis is often called "Analyst". In requirements analysis phase, the requirements
and expectations of the users are collected and analyzed. The collected requirements help to
understand the system that does not yet exist. There are two major activities in requirements
analysis.
Problem understanding or analysis
Requirement specifications.
Most important stage
Labor-intensive
Involves assessing the information needs of an organization so the database can be
designed to meet those needs
Summary
Examine the current system operation.
DBMS Selection
In this phase an appropriate DBMS is selected to support the information system. A number of
factors are involved in DBMS selection. They may be technical and economical factors. The
technical factors are concerned with the suitability of the DBMS for information system. The
following technical factors are considered.
Type of DBMS such as relational, object-oriented etc
Storage structure and access methods that the DBMS supports.
User and programmer interfaces available.
Type of query languages.
Development tools etc.
Implementation
After the design phase and selecting a suitable DBMS, the database system is implemented. The
purpose of this phase is to construct and install the information system according to the plan and
design as described in previous phases. Implementation involves a series of steps leading to
operational information system that includes creating database definitions (such as tables,
indexes etc), developing applications, testing the system, data conversion and loading,
developing operational procedures and documentation, training the users and populating the
database. In the context of information engineering, it involves two steps.
Database definitions: The tables developed in the ER diagram are converted into SQL
statements
Creating applications. System Administrator has installed and configured an RDBMS
Operational Maintenance
Once the database system is implemented, the operational maintenance phase of the database
system begins. The operational maintenance is the process of monitoring and maintaining the
database system. Maintenance includes activities such as adding new fields, changing the size of
existing field, adding new tables, and so on. As the database system requirement change, it
becomes necessary to add new tables or remove existing tables and to reorganize some files by
changing primary access methods or by dropping old indexes and constructing new ones. Some
queries or transactions may be rewritten for better performance. Database tuning or
reorganization continues throughout the life of database and while the requirements keep
changing.
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER TWO: DATABASE RECOVERYAND DATABASE
SECURITY
Learning Objectives:
Causes of Failures
Transaction-Local - The failure of an individual transaction can be caused in several ways:
Transaction – Induced Abort e.g. insufficient funds
Unforeseen Transaction Failure, arising from bugs in the application programs
System-Induced Abort – e.g. when Transaction Manager explicitly aborts a
transaction because it conflicts with another transaction or to break a deadlock.
Site Failures - Occurs due to failure of local CPU and results in a System Crashing.
Total Failure – all sites in DDS are down
Partial Failure – some sites are down
Media Failures e.g. head crashes
Network Failures - a failure may occur in the communications links.
Disasters e.g. fire or power failures
Carelessness - unintentional destruction of data by users
Sabotage - intentional corruption of destruction of data, h/w or s/w facilities
Recovery Procedures
Determining which data structures are intact and which ones need recovery.
Ensuring that no work has been lost nor incorrect data entered in the database.
recovery of individual offline table spaces or files while the rest of a database is
operational
the ability to apply redo log entries in parallel to reduce the amount of time for recovery
Export and Import utilities for archiving and restoring data in a logical data format, rather
than a physical file backup
All operations on the database carried out by all transactions are recorded in the log file
(journal). Each log record contains
Transaction identifier
Recovery Techniques
Restart Procedures – No transactions are accepted until the database has been repaired.
Includes:
Emergency Restart - follows when a system fails without warning e.g. due to power
failure.
Cold Restart - system restarted from archive when the log and/or restart file has
been corrupted.
Warm Restart - follows controlled shutdown of system
Threats
Threats
Discretionary security
Mandatory security
Account creation
Privilege granting
Privilege revocation
Discretionary Privileges
Two levels of assigning privileges
Account level: CREATE Acc, ALTER Acc, DROP Acc, SELECT Acc
Authentication using user ids and passwords. System privileges allow a user to create or
manipulate objects, but do not give access to actual database objects. Object privileges are used
to allow access to a specific database object, such as a particular table or view and are given at
the view level. Privileges can be granted and revoked. Privileges can also be propagated
Roles
Roles are used to ease the management task of assigning a multitude of privileges to users. Roles
are first created and then given sets of privileges that can be assigned to users and other roles.
Users can be given multiple roles.
Three default roles:
Connect Role allows user login and the ability to create their own tables, indexes, etc.
Resource Role is similar to the Connect Role, but allows for more advanced rights such
as the creation of triggers and procedures.
Database Administrator Role is granted all system privileges needed to administer the
database and users.
Profiles
Profiles allow the administrator to place specific restrictions and controls on a number of system
resources, password use etc. These profiles can be defined, named, and then assigned to specific
users or groups of users.
Two types of profiles: system resource profiles and product profiles
System resource profiles can be used to put user limits on certain system resources such
as CPU time, No. of data blocks that can be read per session or program call, the number
of concurrent active sessions, idle time, and the maximum connection time for a user.
Product profiles can be used to prevent users from accessing specific commands or all
commands
Profiles cab be used to prevent intentional or unintentional system resource "hogs"
Access Control
Access control for Operating Systems
Deals with unrelated data
Deals with entire files
Access control for Databases
Deals with records and fields
Concerned with inference of one field from another
Access control list for several hundred files is easier to implement than access control list for a
database!
Backups
Logical backups or "exports" take a snapshot of the database at a given point in time by
user or specific table(s) and allow recovery of the full database or of single tables if needed.
Recovery
Database Audits
Audit Trail - A database log that is used mainly for security purpose. Audit trail desirable in
order to:
Determine who did what
Prevent incremental access
Audit trail of all accesses is impractical:
Slow
Large
Possible over reporting. pass through problem - field may be accessed during select operation but
values never reported to user
Replication
Database replication facilities can be used to create a duplicate fail-over database site in case of
system failure of the primary database. A replicated database can also be useful for off-loading
large processing intensive queries.
Parallel Servers
Parallel Server makes use of two or more servers in a cluster which access a single database. A
cluster can provide load balancing, can scale up more easily, and if a server in the cluster fails
only a sub-set of users may be affected.
Data Partitioning
Data partitioning can be used by administrators to aid in the management of very large tables.
Large tables can be broken into smaller tables by using data partitioning. One advantage of
partitioning is that data that is more frequently accessed can be partitioned and placed on faster
hard drives. This helps to ensure faster access times for users.
Database Integrity
Concern that the database as a whole is protected from damage
Element Integrity - Concern that the value of a specific element is written or changed only by
actions of authorized users
Element Accuracy - Concern that only correct values are written into the elements of a database
Problem with DDBS - Failure of system while modifying data
Results
Single field - half of a field being updated may show the old data
Multiple fields - no single field reflects an obvious error
Solution - Update in two phases
First phase - Intent Phase
DBMS gathers the information and other resources needed to perform the update
Makes no changes to database.
Second Phase - Commit Phase
Write commit flag to database
DBMS make permanent changes
If the system fails during second phase, the database may contain incomplete data, but this can
be repaired by performing all activities of the second phase
Internal Consistency
Error Detection and Correction Code
Parity checks
Cyclic redundancy checks (CRC)
Hamming codes
Shadow Fields
Copy of entire attributes or records
Second copy can provide replacement
Constraints
State Constraints
Describes the condition of the entire database.
Transition Constraints
Describes conditions necessary before changes can be applied to database
Multilevel Data Bases
Three characteristics of database security:
The security of a single element may differ from the security of other elements of the
same record or from values of the same attribute (implies security should be
implemented for individual elements)
Several grades of security may be needed and may represent ranges of allowable
knowledge, which may overlap
The security of an aggregate may differ from the security of the individual elements
Every combination of elements in a database may also have a distinct sensitivity. The
combination may be more or less sensitive than any of its individual elements
"Manhattan" (not sensitive)
Cryptographic Checksum
Trusted Front-End (Guard)
User identifies self to front-end; front-end authenticates user's identity
User issues a query to front-end
Front-end verifies user's authorization to data
Front-end issues query to database manager
Database manager performs I/O access
Database manager returns result of query to front-end
Front-end verifies validity of data via checksum and checks classification of data
against security level of user
Front-end transmits data to untrusted front-end for formatting.
Untrusted front-end transmits data to user
View
A subset of a database, containing exactly the information that a user is entitled to
access
Can represent a single user's subset database, so that all of a user's queries access only
that data
Layered Implementation
Integrated with a trusted operating system to form trusted database manager base
Third level - Translates views into the base relations of the database
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER THREE: TRANSACTION CONTROL
Learning Objectives:
Introduction to Transactions
Transactions are collections of actions that potentially modify two or more entities. An example
of a transaction is a transfer of funds between two bank accounts. The transaction consists of
debiting the source account, crediting the target account, and recording the fact that this
occurred. An important message of this article is that transactions are not simply the domain of
databases, instead they are issues that are potentially pertinent to all of your architectural tiers.
Let’s start with a few definitions. Bernstein and Newcomer (1997) distinguish between:
A transaction-processing (TP) system is the hardware and software that implements the
transaction programs. A TP monitor is a portion of a TP system that acts as a kind of funnel or
concentrator for transaction programs, connecting multiple clients to multiple server programs
(potentially accessing multiple data sources). In a distributed system, a TP monitor will also
optimize the use of the network and hardware resources. Examples of TP monitors include
IBM’s Customer Information Control System (CICS), IBM’s Information Management System
(IMS), BEA’s Tuxedo, and Microsoft Transaction Server (MTS).
The focus of this article is on the fundamentals of online transactions (e.g. the technical side of
things). The critical concepts are:
ACID
Two-phase commits
Nested transactions
Atomicity. The whole transaction occurs or nothing in the transaction occurs; there is
no in between. In SQL, the changes become permanent when a COMMIT statement is
issued, and they are aborted when a ROLLBACK statement is issued. For example, the
transfer of funds between two accounts is a transaction. If we transfer $20 from account
A to account B, then at the end of the transaction A’s balance will be $20 lower and B’s
balance will be $20 higher (if the transaction is completed) or neither balance will have
changed (if the transaction is aborted).
Consistency. When the transaction starts the entities are in a consistent state, and when
the transaction ends the entities are once again in a consistent, albeit different, state.
The implication is that the referential integrity rules and applicable business rules still
apply after the transaction is completed.
Isolation. All transactions work as if they alone were operating on the entities. For
example, assume that a bank account contains $200 and each of us is trying to
withdraw $50. Regardless of the order of the two transactions, at the end of them the
account balance will be $100, assuming that both transactions work. This is true even if
both transactions occur simultaneously. Without the isolation property two
simultaneous withdrawals of $50 could result in a balance of $150 (both transactions
saw a balance of $200 at the same time, so both wrote a new balance of $150). Isolation
is often referred to as serializability.
Durability. The entities are stored in a persistent media, such as a relational database
or file, so that if the system crashes the transactions are still permanent.
Nested Transactions
So far I have discussed flat transactions, transactions whose steps are individual activities. A
nested transaction is a transaction where some of its steps are other transactions, referred to as
sub transactions. Nested transactions have several important features:
1. Database transactions
2. Object transactions
3. Distributed object transactions
4. Including non-transactional steps
Database Transactions
The simplest way for an application to implement transactions is to use the features supplied by
the database. Transactions can be started, attempted, then committed or aborted via SQL code.
Better yet, database APIs such as Java Database Connectivity (JDBC) and Open Database
Connectivity (ODBC) provide classes that support basic transactional functionality.
Object Transactions
At the time of this writing support for transaction control is one of the most pressing issues in the
web services community and full support for nested transactions is underway within the EJB
community. As you see in Figure below, databases aren’t the only things that can be involved in
transactions. The fact is that objects, services, components, legacy applications, and non-
relational data sources can all be included in transactions.
The advantage of adding behaviors implemented by objects (and similarly services, components,
and so on) to transactions are that they become far more robust. Can you imagine using a code
editor, word processor, or drawing program without an undo function? If not, then I believe it
becomes reasonable to expect both behavior invocation as well as data transformations as steps
of a transaction. Unfortunately this strategy comes with a significant disadvantage – increased
complexity. For this to work your business objects need to be transactionally aware. Any
behavior that can be invoked as a step in a transaction requires supporting attempt, commit, and
abort/rollback operations. Adding support for object-based transactions is a non-trivial
endeavor.
Just like it is possible to have distributed data transactions it is possible to have distributed object
transactions as well. To be more accurate, as you see in Figure above it’s just distributed
transactions period – it’s not just about databases any more, but it’s databases plus objects plus
services plus components plus… and so on.
Sometimes you find that you need to include a non-transactional source within a transaction. A
perfect example is an update to information contained in an LDAP directory or the invocation of
a web service, neither of which at the time of this writing support transactions. The problem is
as soon as a step within a transaction is non-transactional the transaction really isn’t a transaction
any more. You have four basic strategies available to you for dealing with this situation:
1. Remove the non-transactional step from your transaction. In practice this is rarely an
option, but if it's a viable strategy then consider doing so.
2. Implement commit. This strategy, which could be thought of as the “hope the parent
transaction doesn’t abort” strategy, enables you to include a non-transactional step within
your transaction. You will need to simulate the attempt, commit, and abort protocol used
by the transaction manager. The attempt and abort behaviors are simply stubs that do
nothing other than implement the requisite protocol logic. The one behavior that you do
implement, the commit, will invoke the non-transactional functionality that you want. A
different flavor of this approach, which I’ve never seen used in practice, would put the
logic in the attempt phase instead of the commit phase.
3. Implement attempt and abort. This is an extension to the previous technique whereby
you basically implement the “do” and “undo” logic but not the commit. In this case, the
work is done in the attempt phase; the assumption is that the rest of the transaction will
work, but if it doesn’t, you still support the ability to roll back the work. This is an
“almost transaction” because it doesn’t avoid the problems with collisions described
earlier.
4. Make it transactional. With this approach, you fully implement the requisite attempt,
commit, and abort behaviors. The implication is that you will need to implement all the
logic to lock the affected resources and to recover from any collisions. An example of
this approach is supported by the J2EE Connector Architecture (JCA), in particular by the
LocalTransaction interface.
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER FOUR: CONCURRENCY CONTROL
Learning Objectives:
To illustrate the concept of concurrency control, consider two travelers who go to electronic
kiosks at the same time to purchase a train ticket to the same destination on the same train.
There's only one seat left in the coach, but without concurrency control, it's possible that both
travelers will end up purchasing a ticket for that one seat. However, with concurrency control,
the database wouldn't allow this to happen. Both travelers would still be able to access the train
seating database, but concurrency control would preserve data accuracy and allow only one
traveler to purchase the seat.
This example also illustrates the importance of addressing this issue in a multi-user database.
Obviously, one could quickly run into problems with the inaccurate data that can result from
several transactions occurring simultaneously and writing over each other. The following section
provides strategies for implementing concurrency control.
Concurrency Control Locking Strategies
Pessimistic Locking: This concurrency control strategy involves keeping an entity in a database
locked the entire time it exists in the database's memory. This limits or prevents users from
altering the data entity that is locked. There are two types of locks that fall under the category of
pessimistic locking: write lock and read lock.
With write lock, everyone but the holder of the lock is prevented from reading, updating, or
deleting the entity. With read lock, other users can read the entity, but no one except for the lock
holder can update or delete it.
Optimistic Locking: This strategy can be used when instances of simultaneous transactions, or
collisions, are expected to be infrequent. In contrast with pessimistic locking, optimistic locking
doesn't try to prevent the collisions from occurring. Instead, it aims to detect these collisions and
resolve them on the chance occasions when they occur.
Pessimistic locking provides a guarantee that database changes are made safely. However, it
becomes less viable as the number of simultaneous users or the number of entities involved in a
transaction increase because the potential for having to wait for a lock to release will increase.
Optimistic locking can alleviate the problem of waiting for locks to release, but then users have
the potential to experience collisions when attempting to update the database.
Lock Problems
Deadlock:
When dealing with locks two problems can arise, the first of which being deadlock. Deadlock
refers to a particular situation where two or more processes are each waiting for another to
release a resource, or more than two processes are waiting for resources in a circular chain.
Deadlock is a common problem in multiprocessing where many processes share a specific type
of mutually exclusive resource. Some computers, usually those intended for the time-sharing
and/or real-time markets, are often equipped with a hardware lock, or hard lock, which
guarantees exclusive access to processes, forcing serialization. Deadlocks are particularly
disconcerting because there is no general solution to avoid them.
A fitting analogy of the deadlock problem could be a situation like when you go to unlock your
car door and your passenger pulls the handle at the exact same time, leaving the door still locked.
If you have ever been in a situation where the passenger is impatient and keeps trying to open the
door, it can be very frustrating. Basically you can get stuck in an endless cycle, and since both
actions cannot be satisfied, deadlock occurs.
Livelock:
Livelock is a special case of resource starvation. A livelock is similar to a deadlock, except that
the states of the processes involved constantly change with regard to one another wile never
progressing. The general definition only states that a specific process is not progressing. For
example, the system keeps selecting the same transaction for rollback causing the transaction to
never finish executing. Another livelock situation can come about when the system is deciding
which transaction gets a lock and which waits in a conflict situation.
An illustration of livelock occurs when numerous people arrive at a four way stop, and are not
quite sure who should proceed next. If no one makes a solid decision to go, and all the cars just
keep creeping into the intersection afraid that someone else will possibly hit them, then a kind of
livelock can happen.
Basic Timestamping:
Basic timestamping is a concurrency control mechanism that eliminates deadlock. This method
doesn’t use locks to control concurrency, so it is impossible for deadlock to occur. According to
this method a unique timestamp is assigned to each transaction, usually showing when it was
started. This effectively allows an age to be assigned to transactions and an order to be assigned.
Data items have both a read-timestamp and a write-timestamp. These timestamps are updated
each time the data item is read or updated respectively.
Problems arise in this system when a transaction tries to read a data item which has been written
by a younger transaction. This is called a late read. This means that the data item has changed
since the initial transaction start time and the solution is to roll back the timestamp and acquire a
new one. Another problem occurs when a transaction tries to write a data item which has been
read by a younger transaction. This is called a late write. This means that the data item has been
read by another transaction since the start time of the transaction that is altering it. The solution
for this problem is the same as for the late read problem. The timestamp must be rolled back and
a new one acquired.
Adhering to the rules of the basic timestamping process allows the transactions to be serialized
and a chronological schedule of transactions can then be created. Timestamping may not be
practical in the case of larger databases with high levels of transactions. A large amount of
storage space would have to be dedicated to storing the timestamps in these cases.
You have five basic strategies that you can apply to resolve collisions:
1. Give up.
2. Display the problem and let the user decide.
It is important to recognize that the granularity of a collision counts. Assume that both of us are
working with a copy of the same Customer entity. If you update a customer’s name and I update
their shopping preferences, then we can still recover from this collision. In effect the collision
occurred at the entity level, we updated the same customer, but not at the attribute level. It is
very common to detect potential collisions at the entity level then get smart about resolving them
at the attribute level.
Collision Resolution Strategies
You have five basic strategies that you can apply to resolve collisions:
1. Give up.
2. Display the problem and let the user decide.
It is important to recognize that the granularity of a collision counts. Assume that both of us are
working with a copy of the same Customer entity. If you update a customer’s name and I update
their shopping preferences, then we can still recover from this collision. In effect the collision
occurred at the entity level, we updated the same customer, but not at the attribute level. It is
very common to detect potential collisions at the entity level then get smart about resolving them
at the attribute level.
King and Robinson (1981) proposed an alternative technique for achieving concurrency control,
called optimistic concurrency control. This is based on the observation that, in most applications,
the chance of two transactions accessing the same object is low. We will allow transactions to
proceed as if there were no possibility of conflict with other transactions: a transaction does not
have to obtain or check for locks.
This is the working phase. Each transaction has a tentative version (private workspace) of the
objects it updates - copy of the most recently committed version. Write operations record new
values as tentative values. Before a transaction can commit, a validation is performed on all the
data items to see whether the data conflicts with operations of other transactions. This is the
validation phase.
If the validation fails, then the transaction will have to be aborted and restarted later. If the
transaction succeeds, then the changes in the tentative version are made permanent. This is the
update phase. Optimistic control is deadlock free and allows for maximum parallelism (at the
expense of possibly restarting transactions)
Timestamp ordering
Another approach to concurrency control was presented by Reed in 1983. This is called
timestamp ordering. Each transaction is assigned a unique timestamp when it begins (can be
from a physical or logical clock). Each object in the system has a read and write timestamp
associated with it (two timestamps per object). The read timestamp is the timestamp of the last
committed transaction that read the object. The write timestamp is the timestamp of the last
committed transaction that modified the object (note - the timestamps are obtained from the
transaction timestamp - the start of that transaction)
If a transaction wants to write an object, it compares its own timestamp with the object’s
read and write timestamps. If the object’s timestamps are older, then the ordering is good.
If a transaction wants to read an object, it compares its own timestamp with the object’s
write timestamp. If the object’s write timestamp is older than the current transaction, then
the ordering is good.
If a transaction attempts to access an object and does not detect proper ordering, the transaction
is aborted and restarted (improper ordering means that a newer transaction came in and modified
data before the older one could access the data or read data that the older one wants to modify).
If transactions are executed serially, i.e., sequentially with no overlap in time, no transaction
concurrency exists. However, if concurrent transactions with interleaving operations are allowed
in an uncontrolled manner, some unexpected, undesirable result may occur. Here are some
typical examples:
The lost update problem: A second transaction writes a second value of a data-item
(datum) on top of a first value written by a first concurrent transaction, and the first
value is lost to other transactions running concurrently which need, by their
precedence, to read the first value. The transactions that have read the wrong value end
with incorrect results.
The dirty read problem: Transactions read a value written by a transaction that has
been later aborted. This value disappears from the database upon abort, and should not
have been read by any transaction ("dirty read"). The reading transactions end with
incorrect results.
The incorrect summary problem: While one transaction takes a summary over the
values of all the instances of a repeated data-item, a second transaction updates some
instances of that data-item. The resulting summary does not reflect a correct result for
any (usually needed for correctness) precedence order between the two transactions (if
one is executed before the other), but rather some random result, depending on the
timing of the updates, and whether certain update results have been included in the
summary or not.
Most high-performance transactional systems need to run transactions concurrently to meet their
performance requirements. Thus, without concurrency control such systems can neither provide
correct results nor maintain their databases consistent.
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER FIVE: OVERVIEW OF QUERY PROCESSING
Learning Objectives:
– Expression 2 avoids the expensive and large intermediate Cartesian product, and therefore
typically is better.
We make the following assumptions about the data fragmentation
– Data is (horizontally) fragmented:
Strategy 2
Query Optimization
Query optimization is a crucial and difficult part of the overall query processing. Objective of
query optimization is to minimize the following cost function:
I/O cost + CPU cost + communication cost
Two different scenarios are considered:
– Wide area networks - Communication cost dominates
· low bandwidth
· low speed
· high protocol overhead
Most algorithms ignore all other cost components
– Local area networks - Communication cost not that dominant; Total cost function should be
considered
Ordering of the operators of relational algebra is crucial for efficient query processing. Rule of
thumb: move expensive operators at the end of query processing. Cost of RA operations:
Query Optimization Issues
Several issues have to be considered in query optimization. Types of query optimizers
– wrt the search techniques (exhaustive search, heuristics)
– wrt the time when the query is optimized (static, dynamic)
• Statistics
• Decision sites
• Network topology
• Use of semijoins
Types of Query Optimizers wrt Search Techniques
– Exhaustive search: Cost-based; Optimal; Combinatorial complexity in the number of relations
– Heuristics: Not optimal; Regroups common sub-expressions; Performs selection, projection
first; Replaces a join by a series of semijoins; Reorders operations to reduce intermediate relation
size; Optimizes individual operations.
Types of Query Optimizers wrt Optimization Timing
– Static: Query is optimized prior to the execution; As a consequence it is difficult to estimate
the size of the intermediate results; Typically amortizes over many executions
– Dynamic: Optimization is done at run time; Provides exact information on the intermediate
relation sizes; Have to re-optimize for multiple executions
– Hybrid - First, the query is compiled using a static algorithm; Then, if the error in estimate
sizes greater than threshold, the query is re-optimized at run time
Statistics
– Relation/fragments: Cardinality; Size of a tuple; Fraction of tuples participating in a join with
another relation/fragment
– Attribute: Cardinality of domain; Actual number of distinct values; Distribution of attribute
values (e.g., histograms)
– Common assumptions: Independence between different attribute values; Uniform distribution
of attribute values within their domain
Decision sites
– Centralized: Single site determines the ”best” schedule; Simple; Knowledge about the entire
distributed database is needed
– Distributed: Cooperation among sites to determine the schedule; Only local information is
needed; Cooperation comes with an overhead cost
– Hybrid: One site determines the global schedule; Each site optimizes the local sub-queries
Network topology
– Wide area networks (WAN) point-to-point
Characteristics: Low bandwidth; Low speed; High protocol overhead
Communication cost dominate; all other cost factors are ignored
Global schedule to minimize communication cost
Local schedules according to centralized query optimization
– Local area networks (LAN)
Communication cost not that dominant
Total cost function should be considered
Broadcasting can be exploited (joins)
Special algorithms exist for star networks
Use of Semijoins
Reduce the size of the join operands by first computing semijoins
Particularly relevant when the main cost is the communication cost
Improves the processing of distributed join operations by reducing the size of data
exchange between sites
However, the number of messages as well as local processing time is increased
Distributed Query Processing Steps
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER SIX: QUERY DECOMPOSITION AND DATA
LOCALIZATION
Learning Objectives:
Query Decomposition
Query decomposition: Mapping of calculus query (SQL) to algebra operations (select, project,
join, rename). Both input and output queries refer to global relations, without knowledge of the
distribution of data. The output query is semantically correct and good in the sense that
redundant work is avoided. Query decomposistion consists of 4 steps:
1. Normalization: Transform query to a normalized form
2. Analysis: Detect and reject ”incorrect” queries; possible only for a subset of relational
calculus
3. Elimination of redundancy: Eliminate redundant predicates
4. Rewriting: Transform query to RA and optimize query
Normalization: Transform the query to a normalized form to facilitate further processing.
Consists mainly of two steps.
1. Lexical and syntactic analysis
– Check validity (similar to compilers)
– Check for attributes and relations
– Type checking on the qualification
2. Put into normal form
– With SQL, the query qualification (WHERE clause) is the most difficult part as it might
– In the disjunctive normal form, the query can be processed as independent conjunctive
subqueries linked by unions (corresponding to the disjunction)
In general, the generic query is inefficient since important restructurings and simplifications
can be done.
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER SEVEN: DISTRIBUTED DATABASES
Learning Objectives:
Distributed database system is the union of what appear to be two diametrically opposed
approaches to data processing: database systems and computer network, Computer networks
promote a mode of work that goes against centralization. Key issues to understand this
combination. The most important objective of DB technology is integration not centralization.
Integration is possible without centralization, i.e., integration of databases and networking does
not mean centralization (in fact quite opposite). Goal of distributed database systems: achieve
data integration and data distribution transparency
A distributed computing system is a collection of autonomous processing elements that are
interconnected by a computer network. The elements cooperate in order to perform the assigned
task. The term “distributed” is very broadly used. The exact meaning of the word depends on the
context.
What can be distributed?
Processing logic
Functions
Data
Control
Classification of distributed systems with respect to various criteria
Degree of coupling, i.e., how closely the processing elements are connected e.g.,
measured as ratio of amount of data exchanged to amount of local processing;
weak coupling, strong coupling
Interconnection structure: point-to-point connection between processing elements;
common interconnection channel
Synchronization: synchronous; asynchronous
The following systems are parallel database systems and are quite different from (though related
to) distributed DB systems
Applications of Distributed Databases
• Manufacturing, especially multi-plant manufacturing
• Military command and control
• Airlines
• Hotel chains
• Any organization which has a decentralized organization structure
Promises of DDBSs
Distributed Database Systems deliver the following advantages:
• Higher reliability
• Improved performance
• Easier system expansion
• Transparency of distributed and replicated data
Higher reliability
• Replication of components
• No single points of failure
• e.g., a broken communication link or processing element does not bring down the entire system
• Distributed transaction processing guarantees the consistency of the database and concurrency
Improved performance
• Proximity of data to its points of use
– Reduces remote access delays
– Requires some support for fragmentation and replication
• Parallelism in execution
– Inter-query parallelism
– Intra-query parallelism
• Update and read-only queries influence the design of DDBSs substantially
– If mostly read-only access is required, as much as possible of the data should be
replicated
– Writing becomes more complicated with replicated data
Easier system expansion
• Issue is database scaling
• Emergence of microprocessor and workstation technologies
– Network of workstations much cheaper than a single mainframe computer
• Data communication cost versus telecommunication cost
• Increasing database size
Transparency
• Refers to the separation of the higher-level semantics of the system from the lower-level
implementation issues
• A transparent system “hides” the implementation details from the users.
• A fully transparent DBMS provides high-level support for the development of complex
applications.
Various forms of transparency can be distingushed for DDBMSs:
• Network transparency (also called distribution transparency)
– Location transparency
– Naming transparency
• Replication transparency
• Fragmentation transparency
• Transaction transparency
– Concurrency transparency
– Failure transparency
• Performance transparency
Network/Distribution transparency allows a user to perceive a DDBS as a single, logical
entity. The user is protected from the operational details of the network (or even does not know
about the existence of the network). The user does not need to know the location of data items
and a command used to perform a task is independent from the location of the data and the site
the task is performed (location transparency). A unique name is provided for each object in the
database (naming transparency); In absence of this, users are required to embed the location
name as part of an identifier
Different ways to ensure naming transparency:
• Solution 1: Create a central name server; however, this results in
– loss of some local autonomy
– central site may become a bottleneck
– low availability (if the central site fails remaining sites cannot create new objects)
• Solution 2: Prefix object with identifier of site that created it
– e.g., branch created at site S1 might be named S1.BRANCH
– Also need to identify each fragment and its copies
– e.g., copy 2 of fragment 3 of Branch created at site S1 might be referred to as
S1.BRANCH.F3.C2
• An approach that resolves these problems uses aliases for each database object
– Thus, S1.BRANCH.F3.C2 might be known as local branch by user at site S1
– DDBMS has task of mapping an alias to appropriate database object
Replication transparency ensures that the user is not involved in the managment of copies of
some data. The user should even not be aware about the existence of replicas, rather should work
as if there exists a single copy of the data. Replication of data is needed for various reasons e.g.,
increased efficiency for read-only data access
Fragmentation transparency ensures that the user is not aware of and is not involved in the
fragmentation of the data. The user is not involved in finding query processing strategies over
fragments or formulating queries over fragments. The evaluation of a query that is specified over
an entire relation but now has to be performed on top of the fragments requires an appropriate
query evaluation strategy. Fragmentation is commonly done for reasons of performance,
availability, and reliability. Two fragmentation alternatives; Horizontal fragmentation: divide a
relation into a subsets of tuples; Vertical fragmentation: divide a relation by columns
Transaction transparency ensures that all distributed transactions maintain integrity and
consistency of the DDB and support concurrency. Each distributed transaction is divided into a
number of sub-transactions (a sub-transaction for each site that has relevant data) that
concurrently access data at different locations. DDBMS must ensure the indivisibility of both the
global transaction and each of the sub-transactions. Can be further divided into: -
– Concurrency transparency
– Failure transparency
Concurrency transparency guarantees that transactions must execute independently and are
logically consistent, i.e., executing a set of transactions in parallel gives the same result as if the
transactions were executed in some arbitrary serial order. Same fundamental principles as for
centralized DBMS, but more complicated to realize:
– DDBMS must ensure that global and local transactions do not interfere with each other
– DDBMS must ensure consistency of all sub-transactions of global transaction
Replication makes concurrency even more complicated. If a copy of a replicated data item is
updated, update must be propagated to all copies
– Option 1: Propagate changes as part of original transaction, making it an atomic
operation; however, if one site holding a copy is not reachable, then the transaction is
delayed until the site is reachable.
– Option 2: Limit update propagation to only those sites currently available; remaining
sites are updated when they become available again.
– Option 3: Allow updates to copies to happen asynchronously, sometime after the
original update; delay in regaining consistency may range from a few seconds to several
hours
Failure transparency: DDBMS must ensure atomicity and durability of the global transaction,
i.e., the sub-transactions of the global transaction either all commit or all abort. Thus, DDBMS
must synchronize global transaction to ensure that all sub-transactions have completed
successfully before recording a final COMMIT for the global transaction. The solution should be
robust in presence of site and network failures
Performance transparency: DDBMS must perform as if it were a centralized DBMS. DDBMS
should not suffer any performance degradation due to the distributed
Architecture. DDBMS should determine most cost-effective strategy to execute a request.
Distributed Query Processor (DQP) maps data request into an ordered sequence of operations on
local databases. DQP must consider fragmentation, replication, and allocation schemas. DQP has
to decide:
– which fragment to access
– which copy of a fragment to use
– which location to use
DQP produces execution strategy optimized with respect to some cost function. Typically, costs
associated with a distributed request include: I/O cost, CPU cost, and communication cost
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER EIGHT: DISTRIBUTED DATABASE DESIGN
Learning Objectives:
Design Problem
Design problem of distributed systems: Making decisions about the placement of data and
programs across the sites of a computer network as well as possibly designing the network
itself. In DDBMS, the distribution of applications involves
– Distribution of the DDBMS software
– Distribution of applications that run on the database
Distribution of applications will not be considered in the following; instead the distribution of
data is studied.\
Framework of Distribution
Dimension for the analysis of distributed systems:
– Level of sharing: no sharing, data sharing, data and program sharing
– Behavior of access patterns: static, dynamic
– Level of knowledge on access pattern behavior: no information, partial information,complete
information
Design Strategies
• Top-down approach – Designing systems from scratch. Homogeneous systems
• Bottom-up approach – The databases already exist at a number of sites. The databases should
be connected to solve common tasks
Top-down design strategy
Distribution design is the central part of the design in DDBMSs (the other tasks are similar to
traditional databases). Objective: Design the LCSs by distributing the entities (relations) over
the sites. Two main aspects have to be designed carefully:
∗ Fragmentation - Relation may be divided into a number of sub-relations, which are
distributed
∗ Allocation and replication- Each fragment is stored at site with ”optimal” distribution. Copy
of fragment may be maintained at several sites
Distribution design issues:
– Why fragment at all?
– How to fragment?
– How much to fragment?
– How to test correctness?
– How to allocate?
Bottom-up design strategy
Fragmentation
What is a reasonable unit of distribution? Relation or fragment of relation?
• Relations as unit of distribution:
If the relation is not replicated, we get a high volume of remote data accesses. If the relation is
replicated, we get unnecessary replications, which cause problems in executing updates and
waste disk space Might be an Ok solution, if queries need all the data in the relation and data
stays at the only sites that uses the data
• Fragments of relationas as unit of distribution:
Application views are usually subsets of relations. Thus, locality of accesses of applications is
defined on subsets of relations. Permits a number of transactions to execute concurrently, since
they will access different portions of a relation. Parallel execution of a single query (intra-query
concurrency). However, semantic data control (especially integrity enforcement) is more
difficult. Fragments of relations are (usually) the appropriate unit of distribution. Fragmentation
aims to improve:
– Reliability
– Performance
– Balanced storage capacity and costs
– Communication costs
– Security
The following information is used to decide fragmentation:
– Quantitative information: frequency of queries, site, where query is run, selectivity of the
queries, etc.
– Qualitative information: types of access of data, read/write, etc.
Types of Fragmentation
– Horizontal: partitions a relation along its tuples
Computing horizontal fragmentation (idea). Compute the frequency of the individual queries of
the site q1, . . . , qQ. Rewrite the queries of the site in the conjunctive normal form (disjunction
of conjunctions); the conjunctions are called minterms. Compute the selectivity of the
minterms. Find the minimal and complete set of minterms (predicates)
∗ The set of predicates is complete if and only if any two tuples in the same fragment are
referenced with the same probability by any application
∗ The set of predicates is minimal if and only if there is at least one query that accesses the
fragment
Vertical Fragmentation
Objective of vertical fragmentation is to partition a relation into a set of smaller relations so that
many of the applications will run on only one fragment. Vertical fragmentation of a relation R
produces fragments R1,R2, . . . , each of which contains a subset of R’s attributes. Vertical
fragmentation is defined using the projection operation of the relational algebra:
Example:
Fragment Allocation
Optimality of fragment allocation depend on the following:
Minimal cost - Communication + storage + processing (read and update). Cost in terms of time
(usually)
Performance - Response time and/or throughput.
Constraints - Per site constraints (storage and processing)
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER NINE: DDBMS ARCHITECTURE
Learning Objectives:
Standardization
The standardization efforts in databases developed reference models of DBMS.
Reference Model: A conceptual framework whose purpose is to divide standardization work into
manageable pieces and to show at a general level how these pieces are related to each other. A
reference model can be thought of as an idealized architectural model of the system.
Commercial systems might deviate from reference model, still they are useful for the
standardization process A reference model can be described according to 3 different approaches:
component-based
function-based
data-based
Components-based
Components of the system are defined together with the interrelationships between the
components. Good for design and implementation of the system. It might be difficult to
determine the functionality of the system from its components
Function-based
Classes of users are identified together with the functionality that the system will provide for
each class. Typically a hierarchical system with clearly defined interfaces between different
layers. The objectives of the system are clearly identified. Not clear how to achieve the
objectives. Example: ISO/OSI architecture of computer networks
Data-based
Identify the different types of the data and specify the functional units that will realize and/or use
data according to these views. Gives central importance to data (which is also the central
resource of any DBMS). Claimed to be the preferable choice for standardization of DBMS. The
full architecture of the system is not clear without the description of functional modules.
Example: ANSI/SPARC architecture of DBMS
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER TEN: SEMANTIC DATA CONTROL
Learning Objectives:
View Management
Views enable full logical data independence. Views are virtual relations that are defined as the
result of a query on base relations. Views are typically not materialized. Can be considered a
dynamic window that reflects all relevant updates to the database. Views are very useful for
ensuring data security in a simple way. By selecting a subset of the database, views hide some
data. Users cannot see the hidden data
View Management in Centralized Databases
A view is a relation that is derived from a base relation via a query. It can involve selection,
projection, aggregate functions, etc. Example: The view of system analysts derived from
relation EMP
CREATE VIEW SYSAN(ENO,ENAME) AS
SELECT ENO,ENAME
FROM EMP
WHERE TITLE="Syst. Anal."
Queries expressed on views are translated into queries expressed on base relations. Example:
“Find the names of all the system analysts with their project number and
responsibility?” – Involves the view SYSAN and the relation ASG(ENO,PNO,RESP,DUR)
SELECT ENAME, PNO, RESP
FROM SYSAN, ASG
WHERE SYSN.ENO = ASG.ENO
is translated into
SELECT ENAME,PNO,RESP
FROM EMP, ASG
WHERE EMP.ENO = ASG.ENO
AND TITLE = "Syst. Anal."
Automatic query modification is required, i.e., ANDing query qualification with view
qualification
All views can be queried as base relations, but not all view can be updated as such. Updates
through views can be handled automatically only if they can be propagated correctly to the base
relations. We classify views as updatable or not-updatable
• Updatable view: The updates to the view can be propagated to the base relations without
ambiguity.
CREATE VIEW SYSAN(ENO,ENAME) AS
SELECT ENO,ENAME
FROM EMP
WHERE TITLE="Syst. Anal."
– e.g, insertion of tuple (201,Smith) can be mapped into the insertion of a new employee (201,
Smith, “Syst. Anal.”)
– If attributes other than TITLE were hidden by the view, they would be assigned the value null
• Non-updatable view: The updates to the view cannot be propagated to the base relations
without ambiguity.
CREATE VIEW EG(ENAME,RESP) AS
SELECT ENAME,RESP
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO
– e.g, deletion of (Smith, ”Syst. Anal.”) is ambiguous, i.e., since deletion of “Smith” in EMP and
deletion of “Syst. Anal.” in ASG are both meaningful, but the system cannot decide. Current
systems are very restrictive about supportin gupdates through views. Views can be updated only
if they are derived from a single relation by selection and projection. However, it is theoretically
possible to automatically support updates of a larger class of views, e.g., joins
View Management in Distributed Databases
Definition of views in DDBMS is similar as in centralized DBMS. However, a view in a
DDBMS may be derived from fragmented relations stored at different sites. Views are
conceptually the same as the base relations, therefore we store them in the (possibly) distributed
directory/catalogue. Thus, views might be centralized at one site, partially replicated, fully
replicated. Queries on views are translated into queries on base relations, yielding distributed
queries due to possible fragmentation of data. Views derived from distributed relations may be
costly to evaluate. Optimizations are important, e.g., snapshots. A snapshot is a static view:
∗ does not reflect the updates to the base relations
∗ managed as temporary relations: the only access path is sequential scan
∗ typically used when selectivity is small (no indices can be used efficiently)
∗ is subject to periodic recalculation
Data Security
Data security protects data against unauthorized access and has two aspects:
– Data protection
– Authorization control
Data Protection
Data protection prevents unauthorized users from understanding the physical content of data.
Well established standards exist
– Data encryption standard
– Public-key encryption schemes
Authorization Control
Authorization control must guarantee that only authorized users perform operations they are
allowed to perform on the database. Three actors are involved in authorization
– users, who trigger the execution of application programms
– operations, which are embedded in applications programs
– database objects, on which the operations are performed
Authorization control can be viewed as a triple (user, operation type, object) which specifies that
the user has the right to perform an operation of operation type on an
object. Authentication of (groups of) users is typically done by username and password.
Authorization control in (D)DBMS is more complicated as in operating systems:
– In a file system: data objects are files
– In a DBMS: Data objects are views, (fragments of) relations, tuples, attributes
Grand and revoke statements are used to authorize triplets (user, operation, data object)
– GRANT <operations> ON <object> TO <users>
– REVOKE <operations> ON <object> TO <users>
Typically, the creator of objects gets all permissions
– Might even have the permission to GRANT permissions
– This requires a recursive revoke process
Privileges are stored in the directory/catalogue, conceptually as a matrix
Different materializations of the matrix are possible (by row, by columns, by element), allowing
for different optimizations e.g., by row makes the enforcement of authorization efficient, since
all rights of a user are in a single tuple
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER ELEVEN: DISTRIBUTED DBMS RELIABILITY
Learning Objectives:
Reliability
A reliable DDBMS is one that can continue to process user requests even when the underlying
system is unreliable, i.e., failures occur
Failures – Transaction failures. System (site) failures, e.g., system crash, power supply failure.
Media failures, e.g., hard disk failures. Communication failures, e.g., lost/undeliverable
messages.
Reliability is closely related to the problem of how to maintain the atomicity and durability
properties of transactions
Recovery system: Ensures atomicity and durability of transactions in the presence of failures
(and concurrent transactions). Recovery algorithms have two parts
1. Actions taken during normal transaction processing to ensure enough information
exists to recover from failures
2. Actions taken after a failure to recover the DB contents to a state that ensures
atomicity, consistency and durability
With the information in the log file the recovery manager can restore the consistency of the DB
in case of a failure.
Assume the following situation when a system crash occurs
Upon recovery:
– All effects of transaction T1 should be reflected in the database ()REDO)
– None of the effects of transaction T2 should be reflected in the database ()UNDO)
REDO Protocol – REDO’ing an action means performing it again. The REDO operation uses
the log information and performs the action that might have been done before, or not done due to
failures. The REDO operation generates the new image.
UNDO Protocol – UNDO’ing an action means to restore the object to its image before the
transaction has started. The UNDO operation uses the log information and restores the old value
of the object
Logging Interface
Log pages/buffers can be written to stable storage in two ways:
Synchronously - The addition of each log record requires that the log is written to stable storage.
When the log is written synchronoously, the executtion of the transaction is supended until the
write is complete!delay in response time
Asynchronously - Log is moved to stable storage either at periodic intervals or when the buffer
fills up.
When to write log records into stable storage?
Assume a transaction T updates a page P
• Fortunate case
– System writes P in stable database
– System updates stable log for this update
– SYSTEM FAILURE OCCURS!... (before T commits)
– We can recover (undo) by restoring P to its old state by using the log
• Unfortunate case
– System writes P in stable database
– SYSTEM FAILURE OCCURS!... (before stable log is updated)
– We cannot recover from this failure because there is no log record to restore the old
value
If a system crashes before a transaction is committed, then all the operations must be undone. We
need only the before images (undo portion of the log). Once a transaction is committed, some of
its actions might have to be redone. We need the after images (redo portion of the log)
Write-Ahead-Log (WAL) Protocol – Before a stable database is updated, the undo portion of
the log should be written to the stable log. When a transaction commits, the redo portion of the
log must be written to stable log prior to the updating of the stable database
Two out-of-place strategies are shadowing and differential files
Shadowing – When an update occurs, don’t change the old page, but create a shadow page with
the new values and write it into the stable database. Update the access paths so that subsequent
accesses are to the new shadow page. The old page is retained for recovery
Differential files – For each DB file F maintain. a read-only part FR. a differential file consisting
of insertions part (DF+) and deletions part (DF−). Thus, F = (FR [ DF+) − DF−
Distributed Reliability Protocols
As with local reliability protocols, the distributed versions aim to maintain the atomicity and
durability of distributed transactions. Most problematic issues in a distributed transaction are
commit, termination, and recovery
Commit protocols - How to execute a commit command for distributed transactions. How to
ensure atomicity (and durability)?
Termination protocols - If a failure occurs at a site, how can the other operational sites deal
with it. Non-blocking: the occurrence of failures should not force the sites to wait until the
failure is repaired to terminate the transaction
Recovery protocols - When a failure occurs, how do the sites where the failure occurred deal
with it. Independent: a failed site can determine the outcome of a transaction without having to
obtain remote information.
Commit Protocols
Primary requirement of commit protocols is that they maintain the atomicity of distributed
transactions (atomic commitment) i.e., even though the exectution of the distributed transaction
involves multiple sites, some of which might fail while executing, the effects of the transaction
on the distributed DB is all-or-nothing. In the following we distinguish two roles.
– Coordinator: The process at the site where the transaction originates and which controls the
execution
– Participant: The process at the other sites that participate in executing the transaction
Centralized Two Phase Commit Protocol (2PC)
Very simple protocol that ensures the atomic commitment of distributed transactions.
Phase 1: The coordinator gets the participants ready to write the results into the database
Phase 2: Everybody writes the results into the database
The actions to be taken after a recovery from a failure are specified in the recovery protocol.
Coordinator site failure: Upon recovery, it takes the following actions:
Failure in INITIAL - Start the commit process upon recovery (since coordinator did not send
anything
to the sites)
Failure in WAIT - Restart the commit process upon recovery (by sending “prepare” again to the
participants)
Failure in ABORT or COMMIT - Nothing special if all the acks have been received from
participants. Otherwise the termination protocol is involved (re-ask the acks)
References
1. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database Systems:
The Complete Book. Prentice Hall, 2nd edition. ISBN: 978-0-13-187325-4
2. Ramez Elmasri and Shamkrant Navathe. 2010. Fundamentals of Database Systems, 6th
edition, Addison Wesley. ISBN-13: 978-0136086208
CHAPTER TWELVE: SAMPLE PAPERS
Instructions
Answer question ONE and any other TWO questions Time: 2Hours
QUESTION 1
End of Exam