DBMS Notes
DBMS Notes
ii
COURSE OUTLINE
Contact hours: 42
Pre-requisites: Database Systems BIT 2102
Purpose:
Expected Learning Outcomes of the Course:
By the end of the course, students should be able to:-
i. Evaluate various file organization methods, distribution strategies, concurrency
mechanisms, query optimization techniques, and database recovery and security
procedures.
ii. Critically examine and reflect on the theoretical aspect of the process of normalization
and its practical implementation.
iii. Evaluate and use various modern database management systems tools in building a
small but realistic database system.
iv. Describe advances in trends and emerging standards in database technology;
understand the comparative features and crucial aspects of data and knowledge
management technology and the impact of the Internet on database technology
Course Content:
Course Assessment
Examination - 70%
Continuous Assessment Test (CATS) - 20%
Assignments - 10%
Total - 100%
ii
COURSE OUTLINE
iii
iv
Week One
Database
The data in the database will be expected to be both integrated and shared particularly on
multi-user systems
Integration - The database may be thought of as a unification of several otherwise
distinct files, with any redundancy among these files eliminated
Shared - individual pieces of data in the database may be shared among several
different users
Hardware
These are secondary storage on which the database physically resides, together with the
associated I/O devices, device controllers etc.
Procedures
These are instructions and rules that govern design and use of databases e.g. starting and
stopping database, making backups, handling failures.
Database Architecture
External Level – concerned with the way users perceive the database
Conceptual Level – concerned with abstract representation of the database in its entirety
Internal Level – concerned with the way data is actually stored
Data Independence
1. Logical Data Independence – users and user programs are independent of logical
structure of the database
2. Physical Data Independence – the separation of structural information about the data
from the programs that manipulate and use the data i.e. the immunity of application
programs to changes in the storage structure and access strategy
– System definition
– Requirement collection and analysis
– Database design
– DBMS selection
– Application design
– Prototyping
– Implementation
– Data conversion and loading
– Testing
– Operational maintenance
Database Design
Conceptual database design – Is the process of constructing a model of the information used in
an organization, independent of all physical considerations?
Step 1 Build local conceptual data model for each user view
– Identify entity types
– Identify relationship types
– Identify and associate attributes with entity or relationship
– Determine attributes domains
– Determine candidate and primary key attributes
– Specialize/generalize entity types (optional step)
– Draw Entity-Relationship diagram
– Review local conceptual data model with user
Logical Database Design for the Relational Model: the process of constructing a model of the
info used in an organization based on a specific data model, but independent of a particular
DBMS and other physical considerations
Step 2 Build and validate local data model for each user view
– Map local conceptual data model to local logical data model
– Derive relations from local logical data model
– Validate model using normalization
– Validate model against user transactions
– Draw Entity-Relationship diagram
– Define integrity constraints
– Review local logical data model with user
Step 3 Build and validate global logical data model
– Merge local logical data models into global model
Data model
A data model is a collection of tools for describing
– Data
– Data relationships
– Data semantics
– Data constraints
Data models include the following
§ Relational model
§ Entity-Relationship data model (mainly for database design)
§ Object-based data models (Object-oriented and Object-relational)
§ Semi structured data model (XML)
§ Other older models:
Network model
Hierarchical model
Relational databases
A Relational Database is: -
– A database that represents data as a collection of tables in which all data
relationships are represented by common values in related tables
– Database technology involving tables (relations) representing entities and
primary/foreign keys representing relationships
xi. Excessive Program Maintenance. Can take up to 80% of information systems budget
SOLUTION:
The solution to these problems is the DATABASE Approach. The approach has the following
characteristics
i. Emphasizes the integration and sharing of data throughout the organization
ii. Central repository of shared data
iii. Data is managed by a controlling agent
iv. Stored in a standardized, convenient form
10
Databases can be very large. Databases touch all aspects of our lives
Workgroup databases
– Collaborative project within small team of people
– Database server and client are LAN connected
– Data security, data integrity and concurrent updating issues
Departmental/divisional databases
– Functional unit within an organization
– Most common type of database
Enterprise database
– Support organization-wide operation and decision making
12
Types of databases:
Databases can be classified by:
1. Location
Centralized:
– Supports data located at a single site
Distributed:
– Supports data distributed across several sites
2. use:
Transactional (or production):
– Supports a company’s day-to-day operations
Data warehouse:
– Stores data used to generate information required to make tactical or
strategic decisions
– Often used to store historical data
– Structure is quite different
The DDL compiler generates a set of table templates stored in a data dictionary
The Data dictionary contains metadata (i.e., data about data). This consists of
o Database schema
o Integrity constraints
– Primary key (ID uniquely identifies instructors)
– Referential integrity (references constraint in SQL) e.g. dept_name value
in any instructor tuple must appear in department relation
o Authorization
1990s:
Large decision support and data-mining applications
Large multi-terabyte data warehouses
Emergence of Web commerce
Early 2000s:
XML and XQuery standards
Automated database administration
Later 2000s:
Giant data storage systems Google BigTable, Yahoo PNuts, Amazon, ..
14
Review Questions
1. Define the following terms
Data
Meta data
End user data
Database
Database management system
Database system
2. Discuss the main characteristics of database approach and how it defers from traditional
file system.
3. Discuss the components of database systems
4. Compare any three common databases used in the corporate world.
5. Discuss the capabilities provided by a typical DBMS.
6. Outline the main components of a DBMS
15
Week Two
Data Models:
A database can be modeled as: a collection of entities and relationship among entities.
§ An Entity Model (also called Entity-Relationship Model) is a representation of the data
used by a system. It shows how data is logically grouped together and the relationships
between these logical data groupings as defined by the business requirements of the
system.
§ An Entity Model comprises:
§ Entity Relationship Diagram (ERD)
§ Associated Documentation of Entities, Relationships and Attributes.
Entity
An entity is an object that exists and is distinguishable from other objects. Example: a specific
person, company, event, plant etc.
§ It is a logical grouping of data relevant to the application under investigation. An
identifiable object, concept or activity belonging to the application about which
descriptive data is stored.
§ An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays
An entity usually has attributes (i.e., data elements) that further describe it. Each
attribute is a characteristic of the entity. An entity must possess a set of one or more
attributes that uniquely identify it (called a primary key).
Attribute
Entities have attributes Example: people have names and addresses
16
§ One-to-one - For any A there may only be one member of B and for any B there is only
one member of A associated with it at any time.
§ One to Many - For any A there may be many members of B and for any B there is only
one member of A associated with it.
Example Customer – Order
§ Many to Many - For any A there may be many members of B and for any B there may be
many members of A associated with it. Example Customer – Product
§ One-to-One relationships are, in general, discouraged and it is normal to merge the two
entities involved into one
§ One-to-Many relationships defines the association between a master entity and a detail
entity
§ Many-to-Many relationships must be decomposed into One-to-Many relationships
Constructing the Entity Model
Step 1 - Develop the Initial Entities
§ Identify the objects about which the system currently holds data. This may be done
using interviews, notes, document reviews, observation
Step 2 - Identify Initial Relationships
§ Successful identification of relationships between entities demands constant reference
to the business requirements and environments of the system.
Step 3 Validate against Current System
§ The initial ERD is validated with users. The ERD is logical and so does not show physical
aspects of the current system, instead of "Blue Enquiry Form", "Order Form", "Employee
Master Listing“use "Enquiry", "Order" "Employee".
Step 4 - Extend EM to Cover New Requirements
17
§ The EM is extended to cover entities needed to fulfil the new requirements of the
system.
Step 5 - Rationalize the Structure
§ A review of the ERD is required to ensure that it is the minimum logical model required
to support the system requirements.
§ If there are two access paths between two entities, the access logic may be is duplicated
and redundant.
§ If link entities are found to contain similar data, they may be merged.
Step 6 - Revalidate the EM with Users
More on Relationships
§ Recursive Relationships - A recursive relationship is one in which a the same entity
participates more than once in different roles
Example:
Employee married to Employee
Employee manages Employee
§ Optional/Mandatory Relationships
In an optional relationship, an entity occurrence in one entity need not have an
occurrence in the other entity associated with it.
§ Cardinality Constraints - describes the maximum number of possible relationships for
each participating entity
• It is a function of policies established by an enterprise
• The most common is binary (1:1), one-to-many (1:M) and many-to-many (M:N)
§ Cardinality Constraints - Example
A Lecturer teaches at most three Classes per semester and each Class has at most one
Lecturer. A student is limited to six classes per semester and class size is limited to 35.
Participation Constraint (Existence – Dependent) - Determines whether the existence of
an entity depends on it being related to another entity through the relationship.
Example: Course generates Class
18
19
§ Remove repeating groups and propagate higher level primary keys by partitioning the
un-normalized relation
Second Normal Form
§ A relation is in SNF if and only if it is in FNF and every non-key attribute is fully
functionally dependent on the key attribute
* Remove any partial dependencies by partitioning the relation
* A relation that is in FNF and has no composite key is necessarily in SNF!
Third Normal Form
§ A relation is in TNF if and only if it is in SNF and every non-key attribute is independent
of all other non-key attributes
* Remove any non-key attributes that depend on other non-key dependencies
(transitive dependencies)
Functional Dependency
§ Given a relation R, attribute Y of R is functionally dependent on attribute X of R, if and
only if each X-value in R has associated with it precisely one Y-value in R (at any one
time)
R.X R.Y (R.X functionally determines R.Y)
Transitive Dependency
§ A B C
If B depends on A and C depends on B, then C transitively depends on A via B
Example Staff No name, salary, branchno, address
branch No address
Normalization Steps
§ Represent the un-normalized relation
List Attributes
Identify Repeating Groups
Identify Key Attributes
§ Convert to FNF
• By removing Repeating Groups
§ Understand the Dependencies
• Functional Dependency Diagrams may be used
§ Convert to SNF
• by removing partial dependencies
§ Convert to TNF
• by removing transitive dependencies
20
21
22
Review Questions
1. An engineering consultancy firm supplies temporary specialized staff to bigger companies in
the country to work on their project for certain amount of time. The table below lists the
time spent by each of the company’s employees at other companies to carry out projects.
The National Insurance Number (NIN) is unique for every member of staff.
23
24
Week Three
Properties of Relations
§ The relation has a distinct name
§ Each attribute has a distinct name
§ Each cell of the relation contains one atomic value
§ The values of an attribute are from the same domain
§ The order of attributes has no significance
§ No duplicate rows
§ The order of rows has no significance (theoretically)
Relational Keys
i. Candidate Key - a set of unique identifiers
ii. Primary Key - The candidate key selected to be the primary key
iii. Composite Key - a candidate key that consists of two or more attributes
iv. Foreign Key - A foreign key is an attribute(s) in one relation whose values are
required to match those of the primary key of some relation
The Data Manipulation Languages (DML) of relational databases is based on Relational Algebra.
Each operator of the relational algebra takes either one or two relations as its input and
produces a new relation as its output
26
A DML is a language which enables users to access and manipulate data. The goal is to provide
efficient human interaction with the system. There are two types of DML:
i. procedural: the user specifies what data is needed and how to get it
ii. nonprocedural: the user only specifies what data is needed
A query language is a portion of a DML involving information retrieval only. The terms DML and
query language are often used synonymously.
Data manipulation languages were initially only used by computer programs, but (with the
advent of SQL) have come to be used by people, as well.
Data Manipulation Languages have their functional capability organized by the initial word in a
statement, which is almost always a verb. In the case of SQL, these verbs are:
1. Select
2. Insert
3. Update
4. Delete
Relational Algebra
§ SELECT - Extracts specified tuples (rows/records) from a specified relation
§ PROJECT - Extracts specified attributes (columns/fields) from a specified relation
§ JOIN - Produces a relation from two specified relations consisting of all possible
concatenated pairs of tuples, one from each of the two specified relations, such that in
each pair the two tuples satisfy some specified condition
§ DIVIDE - Takes two relations, one binary and one unary, and builds a relation consisting
of all values of one attribute of the binary relation that match (in the other attribute) all
values in the unary relation
§ UNION - Produces a relation consisting of all tuples appearing in either or both of two
specified relations
27
A Data Definition Language (DDL) is a computer language for defining data structures. The
term was first introduced in relation to the Codasyl database model, where the schema* of the
database was written in a Data Definition Language describing the records, fields, and "sets"
making up the user Data Model. Initially it referred to a subset of SQL, but is now used in a
generic sense to refer to any formal language for describing data or information structures, like
XML schemas.
*Schema: The schema of a database system is its structure described in a formal language
supported by the database management system (DBMS). In a relational database, the schema
defines the tables, the fields in each table, and the relationships between fields and tables.
Schemas are generally stored in a data dictionary. Although a schema is defined in text
database language, the term is often used to refer to a graphical depiction of the database
structure
Review Questions
1. Define the following terms
Data model
Referential integrity
Primary candidates
28
Candidate key
Value set (domain)
Attribute
Attribute value
2. Differentiate between the following terms: -
Entity type and entity set
Relationship type, relationship instance and relationship set.
Weak entity types and strong (regular) entity types
3. Describe briefly the following data models
Relational data models
Network data models
Hierarchical data models
4. What is a participatory role? When is it necessary to use role names in the
description of relationship type?
5. Define the following terms:-
a. Internal schema
b. Conceptual schema
c. External schema
d. Data independence
e. Data definition language
f. Query language
g. Data manipulation language
6. Differentiate between
Database state and database schema
Logical data independence and physical data independence
Procedural and non-procedural DMLs.
7. Discuss the weaknesses of a relational database model
29
Week Four
Query optimisation
Objectives
A query optimizer is essentially a program which is used for the efficient evaluation of relational
queries, making use of relevant statistic information.
Objective
The main objective of query optimization is to choose the most efficient strategy for
implementing a given relational query, and thereby improve on the system performance
30
A database system based on either the hierarchical model or network model requires users to
navigate across the database to locate and retrieve the desired data. A relational database
system, however, allows users simply to state what data they require and leave the system to
locate that data in the database.
Although there are many different ways available to perform a single user query, this automatic
navigation can be optimized by choosing and executing the most efficient plan, selected on the
basis of system statistical information. By executing queries in the most cost-effective way, the
overall system performance can be improved /enhanced.
The speed discrepancy between CPU and I/O devices is still enormous despite rapid recent
advances in computer technology. As an I/O access is one of the most costly and frequently
performed operations in a typical database system, it is the main aim of the query optimiser to
minimize the I/O activities by choosing the cheapest query plan for a given query, usually the
one with a minimum number of tuple I/O operations.
Student, Lending and Book. The attributes highlighted are the keys for the relevant relations
shown in below(fig v4.1)
V4.1
Effects of Optimization - 1
Retrieve the names of students who have borrowed the book B1.
The database contains 100 students and 10,000 lendings, of which only
50 are for book B1.
It is possible to hold up to 50 tuples in memory, without having to write
back to disk.
Query Execution Plan A - No Optimization
V4.3
Query Execution Plan A - No
Optimization
Join-Select-Project
1.Join relations Student and Lending over
Stud_No
2.Select the result of Step 1 for just the tuples for
book B1
3.Project the result of Step 2 over Stud_Name to
get result (50 max)
Total tuple I/O: 1,020,000
In this first approach, the operations required by the query are performed in the sequence:
Join-Select-Project
We calculate the number of database accesses (tuple I/O operations) occurred in each
operation.
1. Join relations Student and Lending over Stud_no.
- For each Student row, every Lending row will be retrieved
32
2. Select the result of Step 1 for just the tuples for book B1.
- This results reading the 10,000 joined tuples (obtained in step
1) Back into memory.
- Then Select produces a relation containing only 50 tuples, which
Can be kept in memory (see assumption).
10,000 + 0 = 10,000
3. Project the result of Step 2 over Stud_Name to get result (50 max).
- This results in reading a relation of 50 tuples (obtained in step
2) Which is already in memory, and producing a final relation of no
more than 50 tuples, again in memory.
0+0=0
Therefore, the total number of tuple I/Os for query plan A is:
(1,010,000 + 10,000).
33
With Optimization
Select-Join-Project
1. Select the relation Lending to just the tuples
for Book B1
2. Join the result of Step 1 to relation Student
over Stud_No
3. Project the result of Step 2 over Stud_Name
Total tuple I/O: 10,100
In this approach, the sequence of the operations has been changed to the
following:
Select-Join-Project
1.Select the relation lending for just the tuples for Bppk B1 a relation with
50 tuples,which will be kept in
memory (see assumption).
The number of tuple I/Os: 10,000 + 0 = 10,000
2. Join the result of Step 1 with relation Student over Stud_No.
Therefore, the total number of tuple I/Os for query plan B is (10,000 + 100)
Total tuple I/O: 10,100
Comparison of A and B
Although plan A and plan B will produce the same end result for the given
query, the order in which the required relational operations are executed is
different. It is obvious that plan B requires far fewer tuple I/O accesses than
plan A, and therefore is far more efficient in terms of performance.
If, furthermore, use was made in Plan B of an index on Book-No for the Lending
relation, the number of tuples read in Step 1 would be further reduced from
10,000 to just 50. Similarly, an index on Student.Stud-No would reduce
34
Four Stages of
Optimization - 1
1. Convert query into internal form
suitable for machine manipulation -
e.g. Query tree, Relational algebra
2. Further convert internal form into
equivalent and more efficient canonical
form, using well-defined transformation
rules (laws)
- Query tree;
- Relational algebra.
V4.5
Four Stages of Optimization – 2
3. Choose set of candidate low-level
procedures, using statistics about the
database
– Low-level operation
– Implementation procedure
35
– Cost formula
4. Generate query plans and choose the
best (cheapest) plan by evaluating the
cost formulae
Having completed steps 1 and 2, the query optimizer must now decide how to evaluate
the transformed query. The basic principle here is to consider the transformed query
representation as specifying a series of low-level operations, using statistical
information about the database
The main results produced in this stage are:
Transformation Rules
V4.6
Transformation Rules - 1
Rule 1
(A where Restrict-1) where Restrict-2
≡ A (where Restrict-1 AND Restrict-2)
Rule 2
A ([Project]) where Restrict
≡ (A where Restrict) [Project]
Rule 3
(A [Project-1]) [Project-2] ≡ A [Project-2]
Rule 4
(A Join B) where Restrict-on-A AND Restrict-on-B
≡ (A where Restrict-on-A) Join (B where Restrict-on-B)
• Rule 1
Transform a sequence of restrictions against a given relation into
a single restriction.
(A where Restrict-1) where Restrict-2
≡ A (where Restrict-1 AND Restrict-2)
• Rule 2
36
• Rule 4
Distributivity (for restrictions and projections).
V4.7
Transformation Rules - 2
Rule 5
where p OR (q AND r)
≡ where (p OR q) AND (p OR r)
Rule 6
(A Join B ) Join C ≡ A Join (B Join C)
Rule 7
Perform projections as early as possible
Rule 8
Perform restrictions as early as possible
• Rule 5
Distributivity (for logical
expressions).
where p OR (q AND r)
≡ where (p OR q) AND (p OR r)
• Rule 6
Choose an optimal ordering of
the joins to keep the
intermediate results low in size.
A Join (B Join C)
37
(A Join B ) Join C ≡
• Rule 7
Perform projections as early as possible.
• Rule 8
Perform restrictions as early as possible.
6 Database Statistics
Database Statistics
System catalogue, or data dictionary
Typical statistics maintained by it include:
l For each base table
– cardinality
– number of pages for this table
l For each column of each base table
– maximum, minimum, and average value
– actual values and their frequencies
l For each index
– whether a ‘clustering index’
– number of leaf pages
– number of levels
Various decisions which have to be made in the optimization process are based upon the
database statistics stored in the system, often in the form of a system catalogue or a data
dictionary.
Typical statistics maintained by the system include
For each base relation, for example:
- cardinality of the relation;
- number of pages for this relation.
Summary
The following key points should be emphasized for this topic:
Effects:
The dramatic effects of optimization are illustrated by the
example. Students should be able to compare different query executing
plans by calculating the number of tuple I/Os for each plan, as shown in the
example.
Processes:
The main tasks involved in each of the four stages of optimization.
Information needed:
The critical role of database statistics in the form of a system
catalogue O r data dictionary.
Consider the following 3 tables: Student, Lending & Book. The attributes
highlighted are
the keys for the relevant relations:
Retrieve the names of students who have borrowed the book B1.
The database contains 100 students and 10,000 lendings, of which only
50 are for book B1.
It is possible to hold up to 50 tuples in memory, without having to write
back to disk.
In this first approach, the operations required by the query are performed in the
sequence:
Join-Select-Project
2. Select the result of Step 1 for just the tuples for book B1.
- This results reading the 10,000 joined tuples (obtained in step
1) Back into memory.
40
10,000 + 0 = 10,000
3. Project the result of Step 2 over Stud_Name to get result (50 max).
- This results in reading a relation of 50 tuples (obtained in step
2) which is already in memory, and producing a final relation of
no more than 50 tuples, again in memory.
The number of tuple I/O in this step is:
0+0=0
Therefore, the total number of tuple I/Os for query plan A is:
(1,010,000 + 10,000).
In this approach, the sequence of the operations has been changed to the
following:
Select-Join-Project
1. Select the relation lending for just the tuples for Book B1.
41
Transformation Rules
• Rule 1
Transform a sequence of restrictions against a given relation into a single
restriction.
(A where Restrict-1) where Restrict-2
A (where Restrict-1 AND Restrict-2)
• Rule 2
Transform a restriction of a projection into a projection of a restriction.
• Rule 5
Distributivity (for logical expressions).
where p OR (q AND r)
where (p OR q) AND (p OR r)
• Rule 6
Choose an optimal ordering of the joins to keep the intermediate results
low in size.
42
• Rule 8
Perform restrictions as early as possible.
Review Questions
1. Discuss the reasons for converting SQL queries into relational algebra before
optimization is done.
2. Briefly discuss the process of query optimization
3. Why are queries converted into canonical form during optimization?
4. What is meant by the term heuristic optimization? Discuss the main heuristics that are
applied during query optimization
Week Five
Introduction to SQL
What is SQL?
To build a web site that shows data from a database, you will need:
43
44
Example:
Note that primary key declaration on an attribute automatically ensures not null
Alter table
Alter table r add A D
– Where A is the name of the attribute to be added to relation r and D is the domain of A.
– All tuples in the relation are assigned null as the value for the new attribute.
Alter table r drop A
– where A is the name of an attribute of relation r
– Dropping of attributes not supported by many databases
The SQL data-manipulation language (DML) provides the ability to query information, and
insert, delete and update tuples
A typical SQL query has the form:
Select A1, A2, ..., An
From r1, r2, ..., rm
Where P
– Ai represents an attribute
– Ri represents a relation
– P is a predicate.
Views
In some cases, it is not desirable for all users to see the entire logical model (that is, all the
actual relations stored in the database.)
Consider a person who needs to know an instructors name and department, but not the salary.
This person should see a relation described, in SQL, by select ID, name, dept_name from
instructor
A view provides a mechanism to hide certain data from the view of certain users.
Any relation that is not of the conceptual model but is made visible to a user as a “virtual
relation” is called a view.
View Definition
A view is defined using the create view statement which has the form
Create view v as < query expression >
Where <query expression> is any legal SQL expression.
The view name is represented by v.
Once a view is defined, the view name can be used to refer to the virtual relation that the view
generates.
46
View definition is not the same as creating a new relation by evaluating the query expression
Rather, a view definition causes the saving of an expression; the expression is substituted into
queries using the view.
Example Views
A view of instructors without their salary
Create view faculty as
Select ID, name, dept_name
From instructor
Find all instructors in the Biology department
Select name
From faculty
Where dept_name = ‘Biology’
SQL Transactions
A transaction is a unit of work. An Atomic transaction either fully executed or rolled back as if it
never occurred. Isolation from concurrent transactions
Transactions begin implicitly Ended by commit work or rollback work
But default on most databases: each SQL statement commits automatically. One can turn off
auto commit for a session (e.g. using API)
In SQL: 1999 can use: begin atomic …. end Not supported on most databases
Integrity Constraints
Integrity constraints guard against accidental damage to the database, by ensuring that
authorized changes to the database do not result in a loss of data consistency.
Integrity Constraints on a Single Relation are
not null
primary key
unique
check (P), where P is a predicate
Referential Integrity
Ensures that a value that appears in one relation for a given set of attributes also appears for a
certain set of attributes in another relation. Example: If “Biology” is a department name
appearing in one of the tuples in the instructor relation, then there exists a tuple in the
department relation for “Biology”.
47
Authorization
Forms of authorization on parts of the database:
Read - allows reading, but not modification of data.
Insert - allows insertion of new data, but not modification of existing data.
Update - allows modification, but not deletion of data.
Delete - allows deletion of data.
Privileges in SQL
select: allows read access to relation,or the ability to query using the view
Example: grant users U1, U2, and U3 select authorization on the instructor relation:
grant select on instructor to U1, U2, U3
insert: the ability to insert tuples
update: the ability to update using the SQL update statement
delete: the ability to delete tuples.
all privileges: used as a short form for all the allowable privileges
Joined Relations
Join operations take two relations and return as a result another relation.
A join operation is a Cartesian product which requires that tuples in the two relations match
(under some condition). It also specifies the attributes that are present in the result of the join
The join operations are typically used as sub query expressions in the from clause
48
Relation course
Relation prerequisites
Observe that
prerequisite information is missing for CS-315 and
course information is missing for CS-437
Outer Join
§ Is an extension of the join operation that avoids loss of information
§ Computes the join and then adds tuples form one relation that does not match tuples in
the other relation to the result of the join.
§ Uses null values.
Joined Relations
Join operations take two relations and return as a result another relation.
– These additional operations are typically used as subquery expressions in the from
clause
Join condition – defines which tuples in the two relations match, and what attributes are
present in the result of the join.
49
Join type – defines how tuples in each relation that do not match any tuple in the other relation
(based on the join condition) are treated.
The LEFT JOIN keyword returns all rows from the left table (table1), with the matching rows in
the right table (table2). The result is NULL in the right side when there is no match.
or:
SELECT column_name(s)
FROM table1
LEFT OUTER JOIN table2
ON table1.column_name=table2.column_name;
Demo Database
50
10308 2 7 1996-09-18 3
10309 37 3 1996-09-19 1
10310 77 8 1996-09-20 2
The following SQL statement will return all customers, and any orders they might have:
Example
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
LEFT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
51
The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows
in the left table (table1). The result is NULL in the left side when there is no match.
or:
SELECT column_name(s)
FROM table1
RIGHT OUTER JOIN table2
ON table1.column_name=table2.column_name;
Demo Database
52
10308 2 7 1996-09-18 3
10309 37 3 1996-09-19 1
10310 77 8 1996-09-20 2
The following SQL statement will return all orders, and any customers that might have placed
them:
Example
SELECT Customers.CustomerName, Orders.OrderID
FROM Customers
RIGHT JOIN Orders
ON Customers.CustomerID=Orders.CustomerID
ORDER BY Customers.CustomerName;
The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right
table (table2).
The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.
Demo Database
53
10308 2 7 1996-09-18 3
10309 37 3 1996-09-19 1
10310 77 8 1996-09-20 2
The following SQL statement selects all customers, and all orders:
CustomerName OrderID
Alfreds Futterkiste
54
10382
10351
Note: The FULL OUTER JOIN keyword returns all the rows from the left table (Customers), and
all the rows from the right table (Orders). If there are rows in "Customers" that do not have
matches in "Orders", or if there are rows in "Orders" that do not have matches in "Customers",
those rows will be listed as well.
The UNION operator is used to combine the result-set of two or more SELECT statements.
Notice that each SELECT statement within the UNION must have the same number of columns.
The columns must also have similar data types. Also, the columns in each SELECT statement
must be in the same order.
Note: The UNION operator selects only distinct values by default. To allow duplicate values, use
the ALL keyword with UNION.
Note: The column names in the result-set of a UNION are usually equal to the column names in
the first SELECT statement in the UNION.
Demo Database
Charlotte 49 Gilbert
1 Exotic Liquid Londona EC1 4SD UK
Cooper St.
The following SQL statement selects all the different cities (only distinct values) from the
"Customers" and the "Suppliers" tables:
Example
SELECT City FROM Customers
UNION
SELECT City FROM Suppliers
ORDER BY City;
56
Note: UNION cannot be used to list ALL cities from the two tables. If several customers and
suppliers share the same city, each city will only be listed once. UNION selects only distinct
values. Use UNION ALL to also select duplicate values!
The following SQL statement uses UNION ALL to select all (duplicate values also) cities from the
"Customers" and "Suppliers" tables:
Example
SELECT City FROM Customers
UNION ALL
SELECT City FROM Suppliers
ORDER BY City;
The following SQL statement uses UNION ALL to select all (duplicate values also) German cities
from the "Customers" and "Suppliers" tables:
Example
SELECT City, Country FROM Customers
WHERE Country='Germany'
UNION ALL
SELECT City, Country FROM Suppliers
WHERE Country='Germany'
ORDER BY City;
Advanced SQL
JDBC and ODBC
57
JDBC Code
public static void JDBCexample(String dbid, String userid, String passwd)
{
try {
Class.forName ("oracle.jdbc.driver.OracleDriver");
Connection conn = DriverManager.getConnection(
"jdbc:oracle:thin:@db.yale.edu:2000:univdb", userid, passwd);
Statement stmt = conn.createStatement();
….. Do actual work ……..
Stmt.close();
Conn.close();
}
catch
(SQLException
sqle) {
System.out.println("SQLException : " + sqle);
}
Update to database
try {
stmt.executeUpdate(
"insert into instructor values(’77987’, ’Kim’, ’Physics’, 98000)");
} catch (SQLException sqle)
{ System.out.println("Could not insert tuple. " + sqle);
}
SQL Injection
§ Suppose query is constructed using "select * from instructor where name = ’" + name +
"’"
§ Suppose the user, instead of entering a name, enters: X’ or ’Y’ = ’Y
§ then the resulting statement becomes:
– "select * from instructor where name = ’" + "X’ or ’Y’ = ’Y" + "’"
– which is: select * from instructor where name = ’X’ or ’Y’ = ’Y’
– User could have even used X’; update instructor set salary = salary + 10000; --
§ Prepared statement internally uses: "select * from instructor where name = ’X\’ or \’Y\’
= \’Y’ Always use prepared statements, with user inputs as parameters
Embedded SQL
§ The SQL standard defines embeddings of SQL in a variety of programming languages
such as C, Java, and Cobol.
§ A language to which SQL queries are embedded is referred to as a host language, and
the SQL structures permitted in the host language comprise embedded SQL.
§ The basic form of these languages follows that of the System R embedding of SQL into
PL/I.
§ EXEC SQL statement is used to identify embedded SQL request to the preprocessor
59
Example Query
From within a host language, find the ID and name of students who have completed more than
the number of credits stored in variable credit_amount.
EXEC SQL
declare c cursor for
select ID, name
from student
where tot_cred > :credit_amount
END_EXEC
Review Questions
1. List the data types that are allowed in SQL
2. How does SQL allow implementation of the entity integrity and referential integrity
constraints?
3. What is a view in SQL and how is it defined?
4. Discuss the problems that may arise when one attempts to update a view
5. Describe the six clauses in the syntax of an SQL query and show what type of constructs
can be can be specified in each of the six clauses. Which of the six clauses are required
and which are optional?
6. Describe conceptually how an SQL query will be executed by specifying the conceptual
order of executing each of the six clauses. Create a Student table and a course table
using SQL.
7. Feed some student records and Course records using SQL.
8. Use SQL to retrieve data from both the tables to show the Student details and the
course they take.
60
Week Six
Transaction management
Objectives
Transaction Concept
Transaction State
Concurrent Executions
Recoverability
Implementation of Isolation
Transaction Definition in SQL
Transaction Concept
A transaction is a unit of program execution that accesses and possibly updates various data
items.
Two main issues to deal with:
i. Failures of various kinds, such as hardware failures and system crashes
ii. Concurrent execution of multiple transactions
Transaction Management
* What if the system fails?
* What if more than one user is concurrently updating the same data?
What is a transaction?
61
A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions Ti
and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures
Transactions Management deals with the problem of always keeping the database in a
consistent state even when concurrent accesses and failures occur
The database can (and usually is) temporarily inconsistent during the execution of a
transaction. The important point is that the database should be consistent when the
transaction terminates
Transaction:
• A transaction is a series of actions, carried out by a single user or application
program, which must be treated as a logical unit of work.
• A transaction is a unit of consistent and reliable computation.
• Example the transfer of funds from one account to another: either the entire
operations are carried out or none of the operation is carried out.
• Transactions transform the database from consistent state to another consistent
state.
Transaction State
Active – the initial state; the transaction stays in this state while it is executing
Partially committed – after the final statement has been executed.
Failed -- after the discovery that normal execution can no longer proceed.
Aborted – after the transaction has been rolled back and the database restored to its state
prior to the start of the transaction. Two options after it has been aborted: restart the
transaction can be done only if no internal logical error kill the transaction
Committed – after successful completion.
62
Transaction State
ACID Properties
A database transaction, by definition, must be atomic, consistent, isolated and durable.
Database practitioners often refer to these properties of database transactions using the
acronym ACID.
A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the database.
Isolation. Although multiple transactions may execute concurrently, each transaction must be
unaware of other concurrently executing transactions. Intermediate transaction results must be
hidden from other concurrently executed transactions. That is, for every pair of transactions Ti
and Tj, it appears to Ti that either Tj, finished execution before Ti started, or Tj started
execution after Ti finished.
Durability. After a transaction completes successfully, the changes it has made to the database
persist, even if there are system failures.
63
Concurrent Executions
Transactions in SQL
§ Consider the following SQL statement
UPDATE Project
SET Budget=Budget*1.1
WHERE Pname=‘Engineering’
§ This can be translated into a transaction as follows:
Begin_Transaction Budget_Update
Begin
EXEC SQL UPDATE Project
64
SET Budget=Budget*1.1
WHERE Pname=‘Engineering’
End
Note that
§ (Implicit beginning of transaction)
– SELECT …;
– UPDATE …;
– ……
– ROLLBACK | COMMIT;
§ ROLLBACK (transaction abort)
– Will undo the the partial effects of the transaction
– May be initiated by the DBMS
65
Concurrency Control
DBMS Components
Transactions Manager - It is the responsibility of the Transactions Manager of a DBMS
to ensure transactions do not interfere with one another and corrupt the database.
Scheduler - Responsible for implementing a particular strategy for concurrency control.
Also called the Lock Manager (for those systems that use locking as the means of
managing concurrency)
Recovery Manager- Ensures that the database is restored to the state it was in before
the start of the transaction.
Buffer Manager - Responsible for the transfer of data between disk storage and
memory.
Conflicting Operations
§ Two operations on the same data item conflict if at least one of the operations is a write
– r(X) and w(X) conflict
– w(X) and r(X) conflict
– w(X) and w(X) conflict
– r(X) and r(X) do not
– r/w(X) and r/w(Y) do not
§ Order of conflicting operations matters
– If T1 .r(A) precedes T2 .w(A), then conceptually, T1 should precede T2
66
68
Locking
• If a transaction wants to read an object, it must first request a shared lock (S mode) on
that object
• If a transaction wants to modify an object, it must first request an exclusive lock (X
mode) on that object
• Allow one exclusive lock, or multiple shared locks
69
70
Deadlocks
System is deadlocked if there is a set of transactions such that every transaction in the set is
waiting for another transaction in the set.
71
Deadlock prevention protocols ensure that the system will never enter into a deadlock state.
Some prevention strategies : Require that each transaction locks all its data items before it
begins execution (predeclaration).
Impose partial ordering of all data items and require that a transaction can lock data items only
in the order specified by the partial order (graph-based protocol).
72
Starvation happens if same transaction is always chosen as victim. Include the number of
rollbacks in the cost factor to avoid starvation
73
Review Questions
1. Discuss the meaning of concurrent execution of database transaction in a multi-user
system.
2. Discuss why concurrency control is necessary in transaction management.
3. State and explain using suitable diagram the states that a transaction goes through
during execution.
4. Discuss the ACID properties of a transaction.
5. Define the following terms as used in transcations
Schedule
Serial schedule
Serializable schedule
74
Week Seven
Database Security
Database security can be defined as the protection of the database against
• unauthorized access to or modification of the database,
• denial of service to authorized users and
• provision of service to unauthorized users
It also includes the measures necessary to detect, document, and counter threats
Threats
Browsing - accessing info
Misuse – malice, errors of omission etc.
Penetration – Unauthorized access
Systems Flaws – h/w and s/w errors
Component Failure – malfunctioning of h/w, s/w or media
Tampering – attacks to physical and logical components
Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing
Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding
Browsing - accessing info
Misuse – malice, errors of omission etc.
Penetration – Unauthorized access
Systems Flaws – h/w and s/w errors
Component Failure – malfunctioning of h/w, s/w or media
Tampering – attacks to physical and logical components
Eavesdropping – passive surveillance of telecomm channel e.g. tapping, sniffing
Denial of Service – preventing or delaying performance e.g. jamming, traffic flooding
75
76
Journaling
Process of keeping and maintaining a log file (or journal) of all changes made to
database to enable recovery to be undertaken effectively in the event of failure.
Integrity
Prevents data from becoming invalid, and hence giving misleading or incorrect
results.
Encryption
Encoding the data by a special algorithm that renders the data unreadable by any
program without the decryption key.
Redundant array of independent disks (RAID)
Hardware that the DBMS runs on must be fault-tolerant, meaning that the DBMS
should continue to operate even if one of the hardware components fails.
Suggests having redundant components that can be seamlessly integrated into the
working system whenever there are failures.
The main hardware components that should be fault-tolerant include disk drives,
disk controllers, CPU, power supplies, and cooling fans.
Disk drives are the most vulnerable components with the shortest times between
failure of any of the hardware components.
One solution is to provide a large disk array comprising an arrangement of several
independent disks that are organized to improve reliability and at the same time
increase performance.
Database recovery
Database Recovery is the process of restoring the database to a correct state in the event of a
failure
Causes of Failures:
Transaction-Local - The failure of an individual transaction can be caused in several ways:
* Transaction – Induced Abort e.g. insufficient funds
* Unforeseen Transaction Failure, arising from bugs in the application
programs
* System-Induced Abort – e.g. when Transaction Manager explicitly aborts
a transaction because it conflicts with another transaction or to break a
deadlock.
§ Site Failures - Occurs due to failure of local CPU and results in a System Crashing.
77
Recovery Procedures
Recovering from any type of system failure requires the following:
• Determining which data structures are intact and which ones need recovery.
• Following the appropriate recovery steps.
• Restarting the database so that it can resume normal operations.
• Ensuring that no work has been lost nor incorrect data entered in the database
Log File
§ All operations on the database carried out by all transactions are recorded in the log
file (journal).
§ Each log record contains
Transaction identifier
Type of log record e.g. begin, write, commit, abort etc.
Identifier of data object affected
Before-image of data object
After-image of data object
Log management information
§ The Log is also used for monitoring and audit purposes.
Recovery Techniques
§ Restart Procedures – No transactions are accepted until the database has been
repaired. Includes:
• Emergency Restart - follows when a system fails without warning e.g. due to
power failure.
• Cold Restart - system restarted from archive when the log and/or restart file has
been corrupted.
• Warm Restart - follows controlled shutdown of system
§ Backups - particularly useful if there has been extensive damage to database.
§ Mirroring - Two complete copies of the DB are maintained on-line on different stable
storage devices. Used in environments with non-stop, fault-tolerant operations.
§ Undo/Redo - Undoing and Redoing a transaction after failure. The Transaction Manager
keeps an active-list, an abort-list and a commit-list; transactions that have began,
transactions aborted, and committed transactions respectively
Review Questions
i. Checkpoint
ii. Database security
iii. Threat
80
Week Eight
Concepts
Distributed Database: A logically interrelated collection of shared data (and a description of
this data), physically distributed over a computer network.
Distributed DBMS: Software system that permits the management of the distributed database
and makes the distribution transparent to users.
§ Collection of logically-related shared data.
§ Data split into fragments.
§ Fragments may be replicated.
§ Fragments/replicas allocated to sites.
§ Sites linked by a communications network.
§ Data at each site is under control of a DBMS.
§ DBMSs handle local applications autonomously.
§ Each DBMS participates in at least one global application
81
Distributed DBMS
Distributed Processing
A centralized database that can be accessed over a computer network.
Parallel DBMS
A DBMS running across multiple processors and disks designed to execute operations in
parallel, whenever possible, to improve performance.
§ Based on premise that single processor systems can no longer meet requirements for
cost-effective scalability, reliability, and performance.
§ Parallel DBMSs link multiple, smaller machines to achieve same throughput as single,
larger machine, with greater scalability and reliability.
82
Advantages of DDBMSs
§ Reflects organizational structure
§ Improved share ability and local autonomy
§ Improved availability
§ Improved reliability
§ Improved performance
§ Economics
§ Modular growth
Disadvantages of DDBMSs
§ Complexity
§ Cost
§ Security
§ Integrity control more difficult
§ Lack of standards
83
§ Lack of experience
§ Database design more complex
Types of DDBMS
1) Homogeneous DDBMS
2) Heterogeneous DDBMS
Homogeneous DDBMS
§ All sites use same DBMS product.
§ Much easier to design and manage.
§ Approach provides incremental growth and allows increased performance.
Heterogeneous DDBMS
§ Sites may run different DBMS products, with possibly different underlying data models.
§ Occurs when sites have implemented their own databases and integration is considered
later.
§ Translations required to allow for:
o Different hardware.
o Different DBMS products.
o Different hardware and different DBMS products.
§ Typical solution is to use gateways.
§ DBMS that resides transparently on top of existing database and file systems and
presents a single database to its users.
§ Allows users to access and share data without requiring physical database integration.
§ Unfederated MDBS (no local users) and federated MDBS.
Overview of Networking
Network - Interconnected collection of autonomous computers, capable of exchanging
information.
Local Area Network (LAN) intended for connecting computers at same site.
Wide Area Network (WAN) used when computers or LANs need to be connected over
long distances.
WAN relatively slow and less reliable than LANs. DDBMS using LAN provides much faster
response time than one using WAN.
Functions of a DDBMS
§ Expect DDBMS to have at least the functionality of a DBMS.
§ Also to have following functionality:
– Extended communication services.
– Extended Data Dictionary.
– Distributed query processing.
– Extended concurrency control.
– Extended recovery services.
85
Components of a DDBMS
Fragmentation
Relation may be divided into a number of sub-relations, which are then distributed.
Definition and allocation of fragments carried out strategically to achieve:
– Locality of Reference.
87
Data Allocation
Each fragment is stored at site with “optimal” distribution.
Four alternative strategies regarding placement of data:
– Centralized,
– Partitioned (or Fragmented),
– Complete Replication,
– Selective Replication.
• Centralized: Consists of single database and DBMS stored at one site with users
distributed across the network.
• Partitioned: Database partitioned into disjoint fragments, each fragment
assigned to one site.
• Complete Replication: Consists of maintaining complete copy of database at
each site.
• Selective Replication: Combination of partitioning, replication, and
centralization.
88
Replication
Copy of fragment may be maintained at several sites.
Comparison of Strategies for Data Distribution
Why Fragment?
§ Usage
o Applications work with views rather than entire relations.
§ Efficiency
o Data is stored close to where it is most frequently used.
o Data that is not needed by local applications is not stored.
§ Parallelism
o With fragments as unit of distribution, transaction can be divided into several
subqueries that operate on fragments.
§ Security
o Data not required by local applications is not stored and so not available to
unauthorized users.
Disadvantages
o Performance,
o Integrity.
Correctness of Fragmentation
There are three correctness rules:
– Completeness,
– Reconstruction,
– Disjointness.
89
Completeness
If relation R is decomposed into fragments R1, R2, ... Rn, each data item that can be
found in R must appear in at least one fragment.
Reconstruction
• Must be possible to define a relational operation that will reconstruct R from the
fragments.
• Reconstruction for horizontal fragmentation is Union operation and Join for vertical .
Disjointness
• If data item di appears in fragment Ri, then it should not appear in any other
fragment.
• Exception: vertical fragmentation, where primary key attributes must be repeated to
allow reconstruction.
• For horizontal fragmentation, data item is a tuple.
• For vertical fragmentation, data item is an attribute.
Types of Fragmentation
Four types of fragmentation:
– Horizontal,
– Vertical,
– Mixed,
– Derived.
– Other possibility is no fragmentation:
– If relation is small and not updated frequently, may be better not to fragment
relation.
90
Horizontal Fragmentation
§ Consists of a subset of the tuples of a relation.
§ Defined using Selection operation of relational algebra:
p(R)
§ For example:
P1 = type=‘House’(PropertyForRent)
P2 = type=‘Flat’(PropertyForRent)
91
Mixed Fragmentation
Mixed Fragmentation
§ Consists of a horizontal fragment that is vertically fragmented, or a vertical fragment
that is horizontally fragmented.
§ Defined using Selection and Projection operations of relational algebra:
o p(a1, ... ,an(R)) or
o a1, ... ,an(σp(R))
Example - Mixed Fragmentation
S1 = staffNo, position, sex, DOB, salary(Staff)
S2 = staffNo, fName, lName, branchNo(Staff)
S21 = branchNo=‘B003’(S2)
S22 = branchNo=‘B005’(S2)
S23 = branchNo=‘B007’(S2)
Derived Horizontal Fragmentation
§ A horizontal fragment that is based on horizontal fragmentation of a parent relation.
§ Ensures that fragments that are frequently joined together are at same site.
§ Defined using Semijoin operation of relational algebra:
Ri = R F Si, 1iw
Example - Derived Horizontal Fragmentation
S3 = branchNo=‘B003’(Staff)
S4 = branchNo=‘B005’(Staff)
S5 = branchNo=‘B007’(Staff)
Could use derived fragmentation for Property:
Pi = PropertyForRent branchNo Si, 3 i 5
Derived Horizontal Fragmentation
§ If relation contains more than one foreign key, need to select one as parent.
§ Choice can be based on fragmentation used most frequently or fragmentation with
better join characteristics.
92
Transparencies in a DDBMS
§ Distribution Transparency
– Fragmentation Transparency
– Location Transparency
– Replication Transparency
– Local Mapping Transparency
– Naming Transparency
§ Transaction Transparency
– Concurrency Transparency
– Failure Transparency
§ Performance Transparency
– DBMS Transparency
Distribution transparency allows user to perceive database as single, logical entity.
If DDBMS exhibits distribution transparency, user does not need to know:
– data is fragmented (fragmentation transparency),
– location of data items (location transparency),
– otherwise call this local mapping transparency.
With replication transparency, user is unaware of replication of fragments
Naming Transparency
§ Each item in a DDB must have a unique name.
§ DDBMS must ensure that no two sites create a database object with same name.
§ One solution is to create central name server. However, this results in:
– loss of some local autonomy;
– central site may become a bottleneck;
– low availability; if the central site fails, remaining sites cannot create any new
objects.
93
§ Alternative solution - prefix object with identifier of site that created it.
§ For example, Branch created at site S1 might be named S1.BRANCH.
§ Also need to identify each fragment and its copies.
§ Thus, copy 2 of fragment 3 of Branch created at site S 1 might be referred to as
S1.BRANCH.F3.C2.
§ However, this results in loss of distribution transparency
§ An approach that resolves these problems uses aliases for each database object.
§ Thus, S1.BRANCH.F3.C2 might be known as Local Branch by user at site S1.
§ DDBMS has task of mapping an alias to appropriate database object.
Transaction Transparency
§ Ensures that all distributed transactions maintain distributed database’s integrity and
consistency.
§ Distributed transaction accesses data stored at more than one location.
§ Each transaction is divided into a number of sub-transactions, one for each site that has
to be accessed.
§ DDBMS must ensure the indivisibility of both the global transaction and each of the sub-
transactions.
Example - Distributed Transaction
T prints out names of all staff, using schema defined above as S 1, S2, S21, S22, and S23. Define
three sub-transactions TS3, TS5, and TS7 to represent agents at sites 3, 5, and 7.
Concurrency Transparency
§ All transactions must execute independently and be logically consistent with results
obtained if transactions executed one at a time, in some arbitrary serial order.
§ Same fundamental principles as for centralized DBMS.
§ DDBMS must ensure both global and local transactions do not interfere with each other.
§ Similarly, DDBMS must ensure consistency of all subtransactions of global transaction.
94
Classification of Transactions
In IBM’s Distributed Relational Database Architecture (DRDA), four types of
transactions:
– Remote request
– Remote unit of work
– Distributed unit of work
– Distributed request.
Failure Transparency
§ DDBMS must ensure atomicity and durability of global transaction.
§ Means ensuring that subtransactions of global transaction either all commit or all abort.
§ Thus, DDBMS must synchronize global transaction to ensure that all subtransactions
have completed successfully before recording a final COMMIT for global transaction.
§ Must do this in presence of site and network failures.
95
Performance Transparency
§ DDBMS must perform as if it were a centralized DBMS.
– DDBMS should not suffer any performance degradation due to distributed
architecture.
– DDBMS should determine most cost-effective strategy to execute a request.
§ Distributed Query Processor (DQP) maps data request into ordered sequence of
operations on local databases.
§ Must consider fragmentation, replication, and allocation schemas.
§ DQP has to decide:
– which fragment to access;
– which copy of a fragment to use;
– which location to use.
§ DQP produces execution strategy optimized with respect to some cost function.
§ Typically, costs associated with a distributed request include:
– I/O cost;
– CPU cost;
– Communication cost.
Assume:
Each tuple in each relation is 100 characters long.
10 renters with maximum price greater than £200,000.
100 000 viewings for properties in Aberdeen.
Computation time negligible compared to communication time.
96
97
Review Questions
1. What are the reasons for and advantages of distributed databases.
2. What is fragmentation? . What are the main types of fragments?
3. Why is the concept of fragmentation useful in distributed database design?
4. Outline the major advantages and disadvantages of:-
- Fragmentation
- Replication
5. Why is data replication useful in DDBMSs? . What typical Chapters of data are replicated?
6. How is a horizontal partitioning of a relation specified? How can a relation be put back together
from a complete horizontal partitioning?
7. How is a vertical partitioning of a relation specified? How can a relation be put back together
from a complete vertical partitioning?
8. Discuss the main problems of distributed databases.
9. List and explain any five advantages of a distributed database management system.
10. With help of illustrative diagram, main discuss three architectures of parallel DBMS
98
Week Nine
The concepts of inheritance and polymorphism can best be exploited if the underlying system
supports run-time type checking and late binding - the operation to be used is chosen at run
time and converted to a program address
Object-Oriented Databases: Basic Definitions
In an object-oriented system everything is an object
§ Objects are encapsulated which means that there is no way to access an object except
through the public interface - a set of operations - specified for it
§ To emphasize object independence, objects communicate by message passing
§ A class is a template that special operations (like new) can use to create new objects
§ Inheritance makes it possible to declare a class as a specialization of another class, thus
supporting the management of (hierarchical) relationships among classes as well as the
reusability of software
§ The ability of different objects to respond differently to the same message is known as
polymorphism
§ The concepts of inheritance and polymorphism can best be exploited if the underlying
system supports run-time type checking and late binding - operation names are
translated into program addresses at run-time
O-O Databases: Top-Down & Bottom-Up Approach
§ Historical roots in
– database technology, including semantic data models
– abstraction-based programming languages, e.g., Simula-67, Smalltalk
– Knowledge-representation of AI technology
§ Bottom-Up approach - by database community
– Structurally object-oriented
§ Top-Down approach - by programming languages people
– behavioral object-orientation
– persistence - objects remain in existence even if an application program ends
– object sharing - independent applications may use the same data concurrently
100
2. Semantic
– data models
– Aggregation
– Generalization
3. Object-Oriented Programming
– Complex objects
– Object Identity
– Classes and Methods
– Encapsulation
– Inheritance
– Extensibility
4. Object-Oriented Data Model
101
Sets are critical because they are a natural way of representing collections from the real
world
Tuples are critical because they are a natural way of representing properties of an entity
Lists or arrays are important because they capture order, which occurs in the real world
The object constructors must be orthogonal: any constructor should apply to any object
Operations on a complex object must propagate transitively to all its components
102
103
Extensibility (8)
The database system comes with a set of predefined types. These types can be used at
will by programmers to write their applications. This set of types must be extensible in
the following sense:
– there is a means to define new types and
– there is no distinction in usage between system defined and user defined types
Relational database systems rely on a simple data model. It offers a fixed number of
predefined atomic data types and very few type constructors.
In order to permit the integration of arbitrary data types into a data model, it must be
possible to define new basic data types, to use previously defined data types as building
blocks for other data types, or to nest data structures to arbitrary levels
Since object-oriented database systems are meant to offer an interface which integrates
a programming language and data management facilities they must allow the
programmer to define and execute any kind of operations
Persistence (9)
Persistence is the ability of the programmer to have her/his data survive the execution
of a process, in order to eventually reuse it in another process
Persistence should be orthogonal, i.e., each object, independent of its type, is allowed
to become persistent as such
Persistence should also be implicit: the user should not have to explicitly move or copy
data to make it persistent
Persistence can be achieved in several ways:
A first solution is to define the persistence property in the methods by using operations
which make an object persistent
– Another straightforward solution is to make everything persistent
– Different approach is to let the system designer specify which classes are to be
persistent or to store only data which is explicitly identified by a unique name
Secondary Storage Management (10)
Secondary storage management is a classical feature of database management systems.
It is usually supported through a set of mechanisms. These include index management,
data clustering, data buffering, access path selection and query optimization.
None of these is visible to the user: they are simply performance features
The application programmer should not have to write code to maintain indices, to
allocate disk storage, or to move data between disk and main memory
Since objects in object-oriented database systems may be rather complex as well as
extremely large it seems to be necessary to support access strategies within objects (e.g.
105
– high-level (descriptive)
– optimizable
– complete
– generic
– powerful
– adequate
– extensible
– efficient
107
SQL3
Most relevant to extended relational approach
ANSI and ISO draft standard
“Work on SQL3 is well underway, but the final standards is several years away”
Most important new features:
– Abstract data types w. attributes and routines
– Inheritance of ADTs and tables
– Collections, parameterized types
– Procedures, functions: computational completeness
108
109
110
Review Questions
111
Week Ten
Internet
Worldwide collection of interconnected networks.
§ Began in late ‘60s in ARPANET, a US DOD project, investigating how to build
networks that could withstand partial outages.
§ Starting with a few nodes, Internet estimated to have over 945 million users by end
of 2004.
§ 2 billion users projected by 2010.
§ About 3.5 billion documents on Internet (550 billion if intranets/extranets included).
Intranet and Extranet
112
The Web
Hypermedia-based system that provides a simple ‘point and click’ means of browsing
information on the Internet using hyperlinks.
§ Information presented on Web pages, which can contain text, graphics, pictures,
sound, and video.
§ Can also contain hyperlinks to other Web pages, which allow users to navigate in a
non-sequential way through information.
§ Web documents written using HTML.
§ Web consists of network of computers that can act in two roles:
– as servers, providing information;
– as clients (browsers), requesting information.
§ Protocol that governs exchange of information between Web server and browser is
HTTP and locations within documents identified as a URL.
§ Much of Web’s success is due to its simplicity and platform-independence.
113
114
115
Web Services
Collection of functions packaged as single entity and published to network for use by other
programs.
– Web services are important paradigm in building applications and business processes
for the integration of heterogeneous applications.
– Based on open standards and focus on communication and collaboration among people
and applications.
– Unlike other Web-based applications, Web services have no user interface and are not
targeted for browsers. Instead, consist of reusable software components designed to be
consumed by other applications.
Web Services – Technologies & Standards
– eXtensible Markup Language (XML).
– SOAP (Simple Object Access Protocol) protocol, based on XML, used for communication
over Internet.
– WSDL (Web Services Description Language) protocol, again based on XML, used to
describe the Web service.
– UDDI (Universal Discovery, Description and Integration) protocol used to register the
Web service for prospective users.
– Common example is stock quote facility, which receives a request for current price of a
specified stock and responds with requested price.
– Second example is Microsoft MapPoint Web service that allows high quality maps,
driving directions, and other location information to be integrated into a user
application, business process, or Web site.
116
117
CGI – Environment
118
– A CGI script is any script designed to accept and return data that conforms to the CGI
specification.
– Before server launches script, prepares number of environment variables
representing current state of the server, who is requesting the information, and so
on.
– Script picks this up and reads STDIN.
– Then performs necessary processing and writes its output to STDOUT.
– Script responsible for sending MIME header, which allows browser to differentiate
between components.
– CGI scripts can be written in almost any language, provided it supports reading and
writing of an operating system’s environment variables.
Four primary methods for passing information from browser to a CGI script:
o Passing parameters on the command line.
o Passing environment variables to CGI programs.
o Passing data to CGI programs via standard input.
o Using extra path information.
119
CGI – Advantages
CGI is the de facto standard for interfacing Web servers with external applications.
Possibly most commonly used method for interfacing Web applications to data sources.
Advantages:
– simplicity,
– language independence,
– Web server independence,
– wide acceptance.
CGI – Disadvantages
– Communication between client and database server must always go through
Web server.
– Lack of efficiency and transaction support, and difficulty validating user input
inherited from statelessness of HTTP protocol.
– HTTP never intended for long exchanges or interactivity.
– Server has to generate a new process or thread for each CGI script.
– Security.
HTTP Cookies
Cookies can make CGI scripts more interactive.
Cookies are small text files stored on Web client.
CGI script creates cookie and has Web server send it to client’s browser to store
on hard disk.
Later, when client revisits Web site and uses a CGI script that requests this
cookie, client’s browser sends information stored in the cookie.
120
121
Since almost every Web browser vendor has already licensed Java and implemented
an embedded JVM, Java applications can currently be deployed on most end-user
platforms.
Before Java application can be executed, it must first be loaded into memory.
Done by Class Loader, which takes ‘.class’ file(s) containing bytecodes and transfers
it into memory.
Class file can be loaded from local hard drive or downloaded from network.
Finally, bytecodes must be verified to ensure that they are valid and do not violate
Java’s security restrictions.
Loosely speaking, Java is a ‘safe’ C++.
Safety features include strong static type checking, automatic garbage collection,
and absence of machine pointers at language level.
Safety is central design goal: ability to safely transmit Java code across Internet.
Security is also integral part of Java’s design - sandbox ensures untrusted application
cannot gain access to system resources.
Java 2 Platform
In mid-1999, Sun announced it would pursue a distinct and integrated Java enterprise
platform:
– J2ME: aimed at embedded and consumer-electronics platforms.
– J2SE: aimed at typical desktop and workstation environments. Serves as
foundation for J2EE and Web services.
– J2EE: aimed at robust, scalable, multiuser, and secure enterprise applications.
– J2EE was designed to simplify complex problems with development,
deployment, and management of multi-tier enterprise applications.
122
Discuss 5 ways to access a database: JDBC, SQLJ, CMP, JDO, and JSP.
JDBC
Modeled after ODBC, JDBC API supports basic SQL functionality.
With JDBC, Java can be used as host language for writing database applications.
On top of JDBC, higher-level APIs can be built.
Currently, two types of higher-level APIs:
– An embedded SQL for Java (e.g. SQLJ).
– A direct mapping of relational database tables to Java classes (e.g. TopLink from
Oracle).
JDBC API consists of two main interfaces: an API for application writers, and a lower-
level driver API for driver writers.
123
JDBC - Advantages/Disadvantages
Advantage of using ODBC drivers is that they are a de facto standard for PC database
access, and are available for many DBMSs, for very low price.
Disadvantages with this approach:
– Non-pure JDBC driver will not necessarily work with a Web browser.
– Currently downloaded applet can connect only to database located on host
machine.
– Deployment costs increase.
SQLJ
Another JDBC-based approach uses Java with static embedded SQL.
SQLJ comprises a set of clauses that extend Java to include SQL constructs as statements
and expressions.
SQLJ translator transforms SQLJ clauses into standard Java code that accesses database
through a CLI.
Comparison of JDBC and SQLJ
SQLJ is based on static embedded SQL while JDBC is based on dynamic SQL.
Thus, SQLJ facilitates static analysis for syntax checking, type checking, and schema
checking, which may help produce more reliable programs at loss of some functionality.
It also potentially allows DBMS to generate an execution strategy for the query, thereby
improving performance of the query.
JDBC is low-level middleware tool with features to interface Java application with
RDBMS.
Developers need to design relational schema to which they will map Java objects, and
write code to map Java objects to rows of relations.
Problems:
– need to be aware of two different paradigms (object and relational);
– need to design relational schema to map onto object design;
– need to write mapping code
EJBs
EJBs have 3 elements in common:
– an indirection mechanism;
– a bean implementation;
124
– a deployment description.
With indirection mechanism clients do not invoke EJB methods directly.
Session and entity beans provide access to their operations via interfaces.
home interface defines methods that manage lifecycle of a bean. The corresponding
server-side implementation classes are generated at deployment time.
To provide access to other operations, bean can expose a local interface (if client and
bean are colocated), a remote interface, or both.
Local interfaces expose methods to clients running in same container or JVM.
Remote interfaces make methods available to clients no matter where deployed.
When a client invokes create() method (which returns an interface) on home interface,
EJB container calls ejbCreate() to instantiate bean, at which point client can access bean
through remote or local interface returned by create().
Bean implementation is a Java class that implements business logic defined in remote
interface.
Transactional semantics are described declaratively and captured in the deployment
descriptor.
Deployment descriptor, written in XML, lists a bean’s properties and elements, which
may include:
– home interface, remote interface, local interface;
– Web service endpoint interface,
– bean implementation class,
– JNDI name for bean, transaction attributes, security attributes, and per-method
descriptors.
125
126
<method-params></method-params>
</query-method>
<result-type-mapping>Local</result-type-mapping>
<ejb-ql><![CDATA[SELECT OBJECT(s) FROM Staff s]]></ejb-ql>
</query>
<query>
<query-method>
<method-name>findByStaffName</method-name>
<method-params>java.lang.String</method-params>
</query-method>
<result-type-mapping>Local</result-type-mapping>
<ejb-ql><![CDATA[SELECT OBJECT(s)
FROM Staff s WHERE s.name = ?1]]>
</ejb-ql>
</query>
127
128
Allows developers to construct complex object graphs in memory and make them
persistent simply by creating a reference to graph from a persistent instance.
Instances have to be explicitly deleted.
Java Servlets
Servlets are programs that run on Java-enabled Web server and build Web pages,
analogous to CGI.
Have a number of advantages over CGI:
– improved performance;
– portability;
– extensibility;
– simpler session management;
– improved security and reliability.
129
§ Java API for XML-based RPC (JAX-RPC), sends SOAP method calls to remote clients over
Internet and receives results.
§ Client written in language other than Java can access a Web service developed and
deployed on Java platform.
§ Also, client written in Java can communicate with service developed and deployed using
some other platform.
§ Java API for XML Registries (JAXR), provides standard way to access business registries
and share information.
§ JAXR gives Java developers a uniform way to use business registries based on open
standards (such as ebXML) or industry consortium-led specifications (such as UDDI).
130
OLE DB Architecture
131
ASP Architecture
132
Microsoft .NET
§ Number of limitations with Microsoft’s platform:
– a number of languages supported with different programming models (J2EE
composed solely of Java);
– no automatic state management;
– relatively simple user interfaces for Web compared to traditional Windows user
interfaces;
– need to abstract operating system (Windows API difficult to program).
§ Next, and current, evolution in Microsoft’s Web solution strategy was development of
.NET.
§ Various tools, services, technologies in .NET:
– Windows Server,
– BizTalk Server (to build XML-based business processes across applications and
organizations),
– Commerce Server (to build scalable e-Commerce solutions),
– Application Center (to deploy and manage scalable Web applications),
– Mobile Information Server (to support handheld devices),
– SQL Server,
– Microsoft Visual Studio .NET
– Microsoft .NET Framework (CLR + Class Library).
133
.NET Framework
134
ADO.NET ARCHITECTURE
136
§ In addition, .NET Framework provides set of classes that conform to all the underlying
communication standards, such as SOAP, WSDL, and XML.
§ Microsoft UDDI SDK enables developers to add UDDI functionality to development tools,
installation programs, and any other software that needs to register or locate and bind
remote Web services.
137
138
139
Business Intelligence
Functions to track, extract, and analyze business intelligence to support strategic
decision-making:
– Oracle Reports Services enable users to run Oracle Reports over Internet.
– Oracle Discoverer allows users to produce queries, reports, and analysis of
information from databases, OLTP systems, and data warehouses using a Web
browser.
– Oracle Clickstream provides services to capture and analyze aggregate
information about Web site usage.
– Oracle Personalization enables users to track activity of specific user and
personalize information for that user.
Review Questions
1. Discuss and do a critical review of the technologies for enabling web databases
Integration
2. How are XHML, HTML, and SGML related?
3. Explain the following term s in your own words
HTML, HTTP, URL, search engine, XML,XML schema, XQuery, SGML, web crawler,
4. How might we consider storing XML data in a database? What are the advantages and
disadvantages of this approach?
5. It is sometimes said that XML data is “schemaless”. How do you go about querying data that
has no schema? How would you design a query language for such data?.Discuss.
140
Week Eleven
Three-Tier Architecture
1) Warehouse database server
– Almost always a relational DBMS; rarely flat files
2) OLAP servers
– Relational OLAP (ROLAP): extended relational DBMS that maps operations
on multidimensional data to standard relational operations.
– Multidimensional OLAP (MOLAP): special purpose server that directly
implements multidimensional data and operations.
3) Clients
– Query and reporting tools.
142
– Analysis tools
– Data mining tools (e.g., trend analysis, prediction)
Data Mart
Data Mart is a subset of data warehouse that supports the requirements of a particular
department or business function.
143
Objectives
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
Solution
Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns, constraints) from
data in large databases
144
145
146
148
Review Questions
1. Define a data warehouse. What are the major components of a data warehouse?
2. Discuss the importance of data mining in databases
3. Discuss the new trends and challenges of data warehousing technology
151
Week Twelve
Multi-Dimensional Databases
Learning Objectives
1. Multidimensional Databases
– What are they? Why are they needed?
2. Contrast MDD and Relational Databases
3. When is MDD (In) appropriate?
4. MDD Features
– Rotation, Ranging, Roll-Up and Drill-Down, Computations
5. Benefits of MDD
Organization of Concepts
Functions
Definition MDDB vs. Relational Database
Data Requirements Rotation
Ranging Association
Transaction Processing vs. Roll-Up
Data Warehousing Drill-Down Availability Bias
Data Structures Computations Sequencial Pattern
Example Benefits Classifying
Clustering
Architecture Hybrid
Tools
Emergent Applications
Data Mapping
Example Architecture of an IBM PowerPlay
Data Warehouse
152
153
RELATIONAL STRUCTURE
154
2. Perspectives are embedded directly into the structure in the multidimensional model
All possible combinations of perspectives containing a specific attribute
(the color BLUE, for example) line up along the dimension position for
that attribute.
3. Perspectives are placed in fields in the relational model - tells us nothing about field
contents.
4. MDD makes data browsing and manipulation intuitive to the end-user
5. Any data manipulation action possible with a MDD is also possible using relational
technology
6. Substantial cognitive advantages in query formulation
7. Substantial computational performance advantages in query processing when using
MDD
155
M
O
D
E
L
DEALERSHIP
COLOR
• Assume that each dimension has 10 positions, as shown in the cube above
• How many records would be there in a relational table?
• Implications for viewing data from an end-user standpoint?
156
157
158
159
160
161
162
Benefits of MDD
• Cognitive Advantages for the User
• Ease of Data Presentation and Navigation
163
– Obtaining the same views in a relational world requires the end user to either
write complex SQL queries or use an SQL generator against the relational
database to convert the table outputs into a more intuitive format.
• Ease of Maintenance
– Because data is stored in the same way as it is viewed (i.e. according to its
fundamental attributes), no additional overhead is required to translate user
queries into requests for data
• Performance
– Multidimensional databases achieve performance levels that are difficult to
match in a relational environment.
Review Questions
1. Explain what are Multi dimensional database s and their usefulness
2. Differentiate between MDD and Relational Databases
3. Discuss the Features of MDD
4. What are the benefits of MDD
164
References
1. Database Systems by Connolly & Begg 5th Edition
2. Fundamentals of Database Systems by Elmasri & Navathe 5th edition
3. Date C. J. (1990) An Introduction to Database Systems. Volume 1, 5th Edition Addison
Wesley
4. Korth HF & Silberschatz A (1991)
5. Database System Concepts Second Edition McGraw-Hill
6. Object Data Management. OO & Extended Relational Database, Systems Addison-
Wesley Publishing Atkinson,M., DeWitt,D., Maier,D.,Bancilhon,F., Dittrich,K.,
Zdonik,S.(1990)
7. Object-Oriented Database Systems Morgan Kaufmann Publishers Won Kim (1991)
8. Introduction to Object-Oriented Databases, The MIT Press, Cambridge (Massachusetts)
Unland,R., Schlagetter,G.
9. Object-Oriented Database Systems: State of the Art and Research Problems. In: Expert
Database Systems, 1992, Academic Press Ltd
165