Advanced Database
Advanced Database
Course Objectives
At the end of this course you will be able to:
Understand the database query processing and optimization
Know the basics of transaction management
Understand database security
Use different recovery methods when there is a database failure Design a
distributed database system in homogenous and heterogeneous environments
Understand how to give privilege for the user of the database
Advanced Database Management
Table of Contents
CHAPTER ONE..........................................................................................................................................4
QUERY PROCESSING AND OPTIMIZATION.......................................................................................4
Relational Algebra...................................................................................................................................4
TRANSLATING SQL QUERIES INTO RELATIONAL ALGEBRA...................................................4
QUERY PROCESSING........................................................................................................................11
Heuristic Query Tree Optimization.......................................................................................................14
Summary of Heuristics for Algebraic Optimization:.............................................................................17
Cost-based query optimization:.............................................................................................................18
Semantic Query Optimization:..............................................................................................................19
CHAPTER TWO.......................................................................................................................................20
DATABASE SECURITY AND AUTHORIZATION...............................................................................20
Introduction to Database Security Issues...............................................................................................20
Authentication.......................................................................................................................................20
Authorization/Privilege.........................................................................................................................20
Database Security and the DBA............................................................................................................21
Comparing DAC and MAC...................................................................................................................25
Introduction to Statistical Database Security.........................................................................................25
Types of Cryptosystems........................................................................................................................26
CHAPTER THREE...................................................................................................................................28
TRANSACTION PROCESSING CONCEPTS.........................................................................................28
Introduction to Transaction Processing..................................................................................................28
ACID properties of the transactions......................................................................................................29
SIMPLE MODEL OF A DATABASE..................................................................................................29
Transaction Atomicity and Durability...................................................................................................30
Serializability.........................................................................................................................................33
Transactions as SQL Statements............................................................................................................36
Summary...............................................................................................................................................38
CHAPTER FOUR.....................................................................................................................................40
CONCURRENCY CONTROL TECHNIQUES........................................................................................40
Introduction to Concurrency control techniques....................................................................................40
The Two-Phase Locking Protocol.........................................................................................................41
Deadlock Detection and Recovery.........................................................................................................43
Recovery from Deadlock.......................................................................................................................43
Timestamp-Based Protocols..................................................................................................................44
The Timestamp-Ordering Protocol........................................................................................................45
Purpose of Concurrency Control...........................................................................................................46
CHAPTER FIVE.......................................................................................................................................48
DATABASE RECOVERY TECHNIQUES..............................................................................................48
Recovery Outline and Categorization of Recovery Algorithms.............................................................48
Write-Ahead Logging, Steal/No-Steal, and Force/No-Force.................................................................49
CHAPTER ONE
QUERY PROCESSING AND OPTIMIZATION
Relational Algebra
The basic set of operations for the relational model is known as the relational algebra. These
operations enable a user to specify basic retrieval requests.
The result of retrieval is a new relation, which may have been formed from one or more
relations.
The algebra operations thus produce new relations, which can be further manipulated using
operations of the same algebra.
A sequence of relational algebra operations forms a relational algebra expression, whose result
will also be a relation that represents the result of a database query (or retrieval request).
B) PROJECT Operation
This operation selects certain columns from the table and discards the other columns.
The PROJECT creates a vertical partitioning one with the needed columns (attributes
containing results of the operation and other containing the discarded Columns.
Example: To list each employee’s first and last name and salary, the following is used:
π LAME, FNAME, SALARY (EMPLOYEE)
The general form of the project operation is: π<attribute list>(R)
π (pi) is the symbol used to represent the project operation<attribute list> is the desired list of
attributes from relation R.
The project operation removes any duplicate tuples. This is because the result of the project
operation must be a set of tuples. Mathematical sets do not allow duplicate elements.
PROJECT Operation Properties
The number of tuples in the results of projection π<attribute list>(R) is always less or equal to the
number of tuples in R. If the lists of attributes include a key of R, then the number of tuples is
equal to the number of tuples in R. π<list> (π<list2>(R)) = π<list>(R) as long as <list2> contains
the attributes in <list2>.
2. Relational Algebra Operations from Set Theory
A. UNION OPERATION
The result of this operation, denoted by RUS is a relation that includes all tuples that are either in R or
in S or in both R and S. Duplicate tuples is eliminated.
Example: To retrieve the social security numbers of all employees who either work in department 5
or directly supervise an employee who works in department 5, we can use the union operation as
follows:
DEP5_EMPS σ DNO=5(EMPLOYEE)
RESULT1 π SSN (DEP5_EMPS)
RESULT2 (SSN) π SUPERSSN (DEP5_EMPS)
RESULT RESULT1 U RESULT2
The union operation produces the tuples that are in either RESULT1 or RESULT2 or both.
The two operands must be “type compatible”.
TYPE COMPATABILITY
Instructor U Student
Customers Customer
Requires (1000+50) disk accesses to read from Staff and Branch relations
Creates temporary relation of Cartesian Product (1000*50) tuples
Requires (1000*50) disk access to read in temporary relation and test predicate
Total Work = (1000+50) + 2*(1000*50) = 101,050 I/O operations
Query 2 (Better)
Again requires (1000+50) disk accesses to read from Staff and Branch
Joins Staff and Branch on branchNo with 1000 tuples (1 employee : 1 branch )
Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) = 3050 I/O operations
3300% Improvement over Query 1
Query 3 (Best)
Processing can be divided into Decomposition, Optimization, Execution, and Code generation main
categories.
1. Query Decomposition
It is the process of transforming a high-level query into a relational algebra query, and to check
that the query is syntactically and semantically correct.
It Consists of parsing and validation
Typical stages in query decomposition are:
5. Commutativity of THETA JOIN/ Cartesian product RXS is equivalent to SXR. Also holds for
Equi-join and Natural join
b. If the predicate is in the forms of c1^ c2 and c1 involves only attributes of R and c2
involves only attributes of S, then the selection and theta join operations commute.
8. Commutativity of the set operations: UNION and INTERSECTION but not STE DIFFERENCE
Heuristic approach will be implemented by using the above transformation rules in the following
sequence or steps. Sequence for applying Transformation Rules
1. Use Rule 1—Cascade Selection
2. Use
Rule 2. Commutatively of SELECTION
Rule 4. Commuting SELECTION with PROJECTION
Rule 6 Commuting SELECTION with JOIN and CARTESIAN
Rule 10 Commuting SELECTION with SET OPERATIONS
Query graph:
A graph data structure that corresponds to a relational calculus expression. It does not indicate an
order on which operations to perform first. Nodes represent Relations. Ovals represent constant
nodes. Edges represent Join & Selection conditions. Attributes to be retrieved from relations
represented in square brackets.
Drawback: - Does not indicate an order on which operations are performed first.
There is only a single graph corresponding to each query.
Fig2: move select down the tree using cascade & commutatively rule of select operation
Fig3: rearrange of leaf nodes, using commutatively & associativity of binary operations.
Fig 5: break-move of project using cascade & commuting rules of project operations.
Query Execution Plans
An execution plan for a relational algebra query consists of a combination of the relational
algebra query tree and information about the access methods to be used for each relation as well
as the methods to be used in computing the relational operators stored in the tree.
Materialized evaluation: the result of an operation is stored as a temporary relation.
Pipelined evaluation: as the result of an operator is produced, it is forwarded to the next operator
in sequence.
Summary of Heuristics for Algebraic Optimization:
1) The main heuristic is to apply first the operations that reduce the size of intermediate results.
2) Perform select operations as early as possible to reduce the number of tuples and perform project
operations as early as possible to reduce the number of attributes. (This is done by moving select
and project operations as far down the tree as possible.)
3) The select and join operations that are most restrictive should be executed before other similar
operations. (This is done by reordering the leaf nodes of the tree among themselves and adjusting
the rest of the tree appropriately.)
2. The basic set of operations for the relational model is known as the _________________
4. _____________ operation is used to select a subset of the tuples from a relation that satisfy a
selection condition.
A) Delete C) Update
B) Select D) Project
5. Which one is Binary Relational Operations?
A) Selection C) Join
B) Deletion D) Projection
8. Which one is not catalog information used in cost function used to exhibit relational algebra?
9. Which one is not part of information about indexes and indexing attributes of a file?
10. The disk access cost can again be analyzed in terms of:
A) Searching C) Writing
B) Reading D) All can be answer
CHAPTER TWO
DATABASE SECURITY AND AUTHORIZATION
Introduction to Database Security Issues
In today's society, some information is extremely important that needs to be protected.
For example, disclosure or modification of military information could cause danger to
national security.
A good database security management system has to handle the possible database threats.
Threat may be any situation or event, whether intentional (planned) or accidental, that may
adversely affect a system and consequently the organization.
Threats to databases: It may result in degradation of some/all security goals like;
Loss of Integrity
Only authorized users should be allowed to modify data.
For example, students may be allowed to see their grades, but not allowed to modify
them.
Loss of Availability-if DB is not available for those users/ to which they have a legal right to
uses the data.
Authorized users should not be denied access.
For example, an instructor who wishes to change a grade should be allowed to do so.
Loss of Confidentiality
Information should not be disclosed to unauthorized users.
For example, a student should not be allowed to examine other students' grades.
Authentication
All users of the database will have different access levels and permission for different data
objects.
Authentication is the process of checking whether the user is the one with the privilege for the
access level. Thus, the system will check whether the user with a specific username and
password is trying to use the resource.
Authorization/Privilege
Authorization refers to the process that determines the mode in which a particular (previously
authenticated) client is allowed to access a specific resource controlled by a server.
Any database access request will have the following three major components.
1. Requested Operation: what kind of operation is requested by a specific query?
2. Requested Object: on which resource or data of the database is the operation sought to
be applied?
3. Requesting User: who is the user requesting the operation on the specified object?
Forms of user authorization
There are different forms of user authorization on the resource of the database. These include:
1. Read Authorization: the user with this privilege is allowed only to read the content of the
data object.
2. Insert Authorization: the user with this privilege is allowed only to insert new records or
items to the data object.
3. Update Authorization: users with this privilege are allowed to modify content of attributes
but are not authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a record and not
anything else.
Two restrictions are enforced on data access based on the subject/object classifications:
A subject S is not allowed read access to an object O unless class(S) ≥ class (O).
A subject S is not allowed to write an object O unless class(S) ≤ class (O).
To incorporate multilevel security notions into the relational database model, it is common to
consider attribute values and rows as data objects. Hence, each attribute A is associated with a
classification attribute C in the schema.
In addition, in some models, a tuple classification attribute TC is added to the relation attributes
to provide a classification for each tuple as a whole.
Hence, a multilevel relation schema R with n attributes would be represented as
R(A1,C1,A2,C2, …, An,Cn,TC) where each Ci represents the classification attribute
associated with attribute Ai.
The value of the TC attribute in each tuple t – which is the highest of all attribute classification
values within t – provides a general classification for the tuple itself.
Whereas, each Ci provides a finer security classification for each attribute value within the tuple.
A multilevel relation will appear to contain different data to subjects (users) with different
clearance levels.
A user with a security clearance S would see the same relation shown above (a) since all row
classification are less than or equal to S as shown in (a).
However, a user with security clearance C would not allow to see values for salary of Brown and
job performance of Smith, since they have higher classification as shown in (b)
For a user with security clearance U, filtering introduces null values for attributes values whose
security classification is higher than the user’s security clearance as shown in (c)
A user with security clearance C may request for update on the values of job performance of
smith to ‘Excellent’ and the view will allow him to do so. However, the user shouldn't be allowed
to overwrite the existing value at the higher classification level.
Solution: to create Ploy Station for smith row at the lower classification level C as
shown in (d)
1.
CHAPTER THREE
TRANSACTION PROCESSING CONCEPTS
Introduction to Transaction Processing
Single-User System:
At most one user at a time can use the database management system. Eg. Personal computer
system.
Multi-user System:
Many users can access the DBMS concurrently. Eg. Airline reservation, Bank and the like
system are operated by many users who submit transaction concurrently to the system.
This is achieved by multi programming, which allows the computer to execute multiple
programs /processes at the same time.
Concurrency
Interleaved processing:
Concurrent execution of processes is interleaved in a single CPU using for example, round robin
algorithm
Advantages:
keeps the CPU busy when the process requires I/O by switching to execute another process rather
than remaining idle during I/O time and hence this will increase system throughput (average no.
of transactions completed within a given time)
Prevents long process from delaying other processes (minimize unpredictable delay in the
response time).
Parallel processing:
If Processes are concurrently executed in multiple CPUs.
A Transaction
Logical unit of database processing that includes one or more access operations (read -
retrieval, write - insert or update, delete). Examples include ATM transactions, credit
card approvals, flight reservations, hotel check-in, phone calls, supermarket scanning,
academic registration and billing.
Collections of operations that form a single logical unit of work are called transactions.
A transaction is a unit of program execution that accesses and possibly updates various
data items. Usually, a transaction is initiated by a user program written in a high-level
Let us consider a schedule S in which there are two consecutive instructions, I and J, of
transactions Ti and Tj, respectively (i = j).
If I and J refer to different data items, then we can swap (exchange) I and J without affecting
the results of any instruction in the schedule.
If I and J refer to the same data item Q, then the order of the two steps may matter.
1. I = read (Q), J = read (Q). The order of I and J does not matter.
2. I = read (Q), J = write (Q). If I come before J, then Ti does not read the value of Q
that is written by Tj in instruction J. If J comes before I, then Ti reads the value of
Q that is written by Tj. Thus, the order of I and J matters.
T1 1 read(A) T2 3 read(A)
2 write(A) 4 write(A)
5 read(B) 7 read(B)
6 write(B) 8 write(B)
T1 1 read(A) T2 5 read(A)
2 write(A) 6 write(A)
3 read(B) 7 read(B)
4 write(B) 8 write(B)
Figure 9 Schedules 7.
Let I and J be consecutive instructions of a schedule S. If I and J are instructions of different
transactions and I and J do not conflict, then we can swap the order of I and J to produce a new
schedule S.
S is equivalent to S, since all instructions appear in the same order in both schedules except for I
and J, whose order does not matter.
Since the write (A) instruction of T2 in schedule 3 does not conflict with the read (B)
instruction of T1, we can swap these instructions to generate an equivalent schedule, schedule
5, in Figure 7. Regardless of the initial system state, schedules 3 and 5 both produce the same
final system state.
We continue to swap non conflicting instructions:
Swap the read (B) instruction of T1 with the read (A) instruction of T2.
Swap the write (B) instruction of T1 with the write (A) instruction of T2.
Swap the write (B) instruction of T1 with the read (A) instruction of T2.
The final result of these swaps, schedule 6 of Figure 8, is a serial schedule.
Transaction Isolation and Atomicity
Effect of transaction failures during concurrent execution.
T1 T5
1 read(A) 3 read(A)
2 write(A) 4 commit
5 read(B)
There are several different notions of equivalence leading to the concepts of conflict
serializability and view serializability.
Serializability of schedules generated by concurrently executing transactions can be ensured
through one of a variety of mechanisms called concurrency-control policies.
We can test a given schedule for conflict serializability by constructing a precedence graph for the
schedule, and by searching for absence of cycles in the graph. However, there are more efficient
concurrency-control policies for ensuring serializability.
Schedules must be recoverable, to make sure that if transaction a sees the effects of transaction b,
and b then aborts, then an also gets aborted.
Schedules should preferably be cascade less, so that the abort of a transaction does not result in
cascading aborts of other transactions. Cascadelesness s is ensured by allowing transactions to
only read committed data.
The concurrency-control–management component of the database is responsible for handling the
concurrency-control policies.
CHAPTER FOUR
CONCURRENCY CONTROL TECHNIQUES
Introduction to Concurrency control techniques
When several transactions execute concurrently in the database, however, the isolation property
may no longer be preserved. To ensure that it is, the system must control the interaction among
the concurrent transactions; this control is achieved by the mechanisms called concurrency-
control schemes.
There are a variety of concurrency-control schemes. The most frequently used schemes are two-
phase locking and snapshot isolation.
Lock-Based Protocols
One way to ensure isolation is to require that data items be accessed in a mutually exclusive
manner; that is, while one transaction is accessing a data item, no other transaction can
modify that data item.
The most common method used to implement this requirement is to allow a transaction to access
a data item only if it is currently holding a lock on that item.
Locks
There are various modes in which a data item may be locked. Two modes of locks mostly focused
are:
1. Shared. If a transaction Ti has obtained a shared-mode lock (denoted by S) on item Q, then
Ti can read, but cannot write, Q.
2. Exclusive. If a transaction Ti has obtained an exclusive-mode lock (denoted by X) on item
Q, then Ti can both read and write Q.
Every transaction request a lock in an appropriate mode on data item Q, depending on the types
of operations that it will perform on Q. The transaction makes the request to the concurrency-
control manager.
The transaction can proceed with the operation only after the concurrency-control manager
grants the lock to the transaction. The use of these two lock modes allows multiple transactions
to read a data item but limits write access to just one transaction at a time.
Examples: -Let A and B represent arbitrary lock modes. Suppose that a transaction Ti requests a
lock of mode A on item Q on which transaction Tj (Ti =Tj) currently holds a lock of mode B. If
transaction Ti can be granted a lock on Q immediately, in spite of the presence of the mode B
lock, then we say mode A is compatible with mode B. Such a function can be represented
conveniently by a matrix.
Granting of Locks
When a transaction requests a lock on a data item in a particular mode, and no other transaction
has a lock on the same data item in a conflicting mode, the lock can be granted. However, care
must be taken to avoid the following scenario.
Suppose a transaction T2 has a shared-mode lock on a data item, and transaction T1
requests an exclusive-mode lock on the data item. Clearly, T1 has to wait for T2 to release the
shared-mode lock.
Timestamp-Based Protocols
The locking protocols that we have described thus far determine the order between every pair of
conflicting transactions at execution time by the first lock that both members of the pair request
that involves incompatible modes.
Another method for determining the serializability order is to select an ordering among
transactions in advance. The most common method for doing so is to use a timestamp-ordering
scheme.
Timestamps
With each transaction Ti in the system, we associate a unique fixed timestamp, denoted by
TS(Ti). This timestamp is assigned by the database system before the transaction Ti starts
execution.
If a transaction Ti has been assigned timestamp TS(Ti), and a new transaction Tj enters the
system, then TS(Ti) < TS(Tj).
There are two simple methods for implementing this scheme:
1. Use the value of the system clock as the timestamp; that is, a transaction’s timestamp is
equal to the value of the clock when the transaction enters the system.
2. Use a logical counter that is incremented after a new timestamp has been assigned; that is,
a transaction’s timestamp is equal to the value of the counter when the transaction enters
the system.
Locking Multi-version
Timestamp Lock Granularity
Optimistic
Locking and timestamp are conservative/ traditional approach
Starvation
Starvation occurs when a particular transaction consistently waits or restarted and
never gets a chance to proceed further while other transaction continues normally
This may occur, if the waiting method for item locking:
Gave priority for some transaction over others
Problem in Victim selection algorithm- it is possible that the same
transaction may consistently be selected as victim and rolled-back .example
In Wound-Wait
Solution
FIFO
Allow for transaction that wait for a longer time
Give higher priority for transaction that have been aborted for many time
Deferred Update
The deferred update techniques do not physically update the database on disk until after a
transaction reaches its commit point; then the updates are recorded in the database.
Before reaching commit, all transaction updates are recorded in the local transaction
workspace or in the main memory buffers that the DBMS maintains.
Before commit, the updates are recorded persistently in the log, and then after commit,
the updates are written to the database on disk.
If a transaction fails before reaching its commit point, it will not have changed the database in
any way, so UNDO is not needed.
It may be necessary to REDO the effect of the operations of a committed transaction from the
log, because their effect may not yet have been recorded in the database on disk. Hence, deferred
update is also known as the NO-UNDO/REDO algorithm.
Immediate Update
In the immediate update techniques, the database may be updated by some operations of a
transaction before the transaction reaches its commit point.
However, these operations must also be recorded in the log on disk by force-writing
before they are applied to the database on disk, making recovery still possible.
Shadow Paging
This recovery scheme does not require the use of a log in a single-user environment.
In a multiuser environment, a log may be needed for the concurrency control method.
Shadow paging considers the database to be made up of a number of fixed size disk
pages (or disk blocks)—say, n—for recovery purposes.
A directory with n entries is constructed, where the ith entry points to the ith database page on
disk.
The directory is kept in main memory if it is not too large, and all references—read or writes—to
database pages on disk go through it.
When a transaction begins executing, the current directory—whose entries point to the most
recent or current database pages on disk—is copied into a shadow directory.
The shadow directory is then saved on disk while the current directory is used by the transaction.
During transaction execution, the shadow directory is never modified. When a write item
operation is performed, a new copy of the modified database page is created, but the old copy of
that page is not overwritten.
The database thus is returned to its state prior to the transaction that was executing when the crash
occurred, and any modified pages are discarded. Committing a transaction corresponds to
discarding the previous shadow directory. Since recovery involves neither undoing nor redoing
data items, this technique can be categorized as a NO-UNDO/ NO-REDO technique for
recovery.
Generally
Distributed database (DDB) as a collection of multiple logically interrelated databases
distributed over a computer network.
Distributed database management system (DDBMS) as a software system that manages a
distributed database while making the distribution transparent to the user.
A collection of files stored at different nodes of a network and the maintaining of inter
relationships among them via hyperlinks has become a common organization on the Internet,
with files of Web pages.
Examples DDBMS Operational database, Analytical database, Hypermedia database
3. Improved performance:
Techniques that are used to break up the database into logical units, called fragments, which
may be assigned for storage at the various sites.
Data replication, which permits certain data to be stored in more than one site, and the process
of allocating fragments—or replicas of fragments—for storage at the various sites.
These techniques are used during the process of distributed database design.
The information concerning data fragmentation, allocation, and replication is stored in a global
directory that is accessed by the DDBS applications as needed.
Data Fragmentation
There are two approaches to store the relation in the distributed database:
Horizontal Fragmentation
A horizontal fragment of a relation is a subset of the tuples in that relation.
The tuples that belong to the horizontal fragment are specified by a condition on one or more
attributes of the relation. Often, only a single attribute is involved.
For example, we may define three horizontal fragments on the EMPLOYEE relation: (DNO= 5),
(DNO= 4), and (DNO= 1)—each fragment contains the EMPLOYEE tuples working for a
particular department.
Similarly, we may define three horizontal fragments for the PROJECT relation, with the
conditions (DNUM= 5), (DNUM= 4), and (DNUM= 1)—each fragment contains the PROJECT
tuples controlled by a particular department.
Horizontal fragmentation divides a relation "horizontally" by grouping rows to create subsets
of tuples, where each subset has a certain logical meaning. These fragments can then be
assigned to different sites in the distributed system.
Derived horizontal fragmentation applies the partitioning of a primary relation (DEPARTMENT
in our example) to other secondary relations (EMPLOYEE and PROJECT in our example), which
are related to the primary via a foreign key. This way, related data between the primary and the
secondary relations gets fragmented in the same way.
Vertical Fragmentation
Each site may not need all the attributes of a relation, which would indicate the need for a
different type of fragmentation.
Vertical fragmentation divides a relation "vertically" by columns.
A vertical fragment of a relation keeps only certain attributes of the relation.
For example, we may want to fragment the EMPLOYEE relation into two vertical fragments.
The first fragment includes personal information—NAME, BDATE, ADDRESS, and
SEX—and
The second includes work-related information—SSN, SALARY, SUPERSSN, DNO.
This vertical fragmentation is not quite proper because, if the two fragments are stored
separately, we cannot put the original employee tuples back together, since there is no common
attribute between the two fragments.
Semantic Heterogeneity
Semantic heterogeneity occurs when there are differences in the meaning, interpretation, and
intended use of the same or related data.
Semantic heterogeneity among component database systems (DBSs) creates the biggest hurdle in
designing global schemas of heterogeneous databases.
The design autonomy of component DBSs refers to their freedom of choosing the following
design parameters, which in turn affect the eventual complexity of the FDBS:
A. The universe of discourse from which the data is drawn: For example, two customer
accounts databases in the federation may be from United States and Japan with entirely
different sets of attributes about customer accounts required by the accounting practices.
Currency rate fluctuations would also present a problem. Hence, relations in these two
databases which have identical names—CUSTOMER or ACCOUNT—may have some
common and some entirely distinct information.
B. Representation and naming: The representation and naming of data elements and the
structure of the data model may be pre specified for each local database.
C. The understanding, meaning, and subjective interpretation of data. This is a chief
contributor to semantic heterogeneity.
D. Transaction and policy constraints: this deal with Serializability criteria, compensating
transactions, and other transaction policies.
E. Derivation of summaries: Aggregation, summarization, and other data-processing features
and operations supported by the system.
Communication autonomy of a component DBS refers to it stability to decide whether to
communicate with another component DBSs.
Execution autonomy refers to the ability of a component DBS to execute local operations
without interference from external operations by other component DBSs and its ability to decide
the order in which to execute them.
The association autonomy of a component DBS implies that it has the ability to decide whether
and how much to share its functionality (operations it supports) and resources (data it manages)
with other component DBSs.
The major challenge of designing FDBSs is to let component DBSs interoperate while still
providing the above types of autonomies to them.
The result of this query will include 10,000 records, assuming that every employee is related to a
department.
Suppose that each record in the query result is 40 bytes long.
The query is submitted at a distinct site 3, which is called the result site because the query result
is needed there.
Neither the EMPLOYEE nor the DEPARTMENT relations reside at site 3.
There are three simple strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the result site, and
perform the join at site 3. In this case a total of 1,000,000+ 3500 = 1,003,500 bytes must be
transferred.
One possible function of the client is to hide the details of data distribution from the user; that is, it
enables the user to write global queries and transactions as though the database were centralized,
without having to specify the sites at which the data referenced in the query or transaction resides. This
property is called distribution transparency.
Spatial Relationship
Topological relationships:
adjacent, inside, disjoint, etc.
Direction relationships:
Above, below, north of, etc.
Metric relationships: “distance < 100”
EXAMPLE
A database:
Relation states (sname: string, area: region, spop: int)
Relation cities (cname: string, center: point; ext: region)
Relation rivers (rname: string, route:line)
SELECT * FROM rivers WHERE route intersects Ethiopia
SELECT cname, sname FROM cities, states WHERE center inside area
SELECT rname, length (intersection (route, California)) FROM rivers WHERE route intersects
Oromia
Spatial Queries
Selection queries:
“Find all objects inside query q”, inside-> intersects, north
Nearest Neighbor-queries:
“Find the closets object to a query point q”, k-closest objects
Spatial join queries: Two spatial relations S1 and S2, find all pairs: {x in S1, y in S2, and x
rel y= true}, rel= intersect, inside, etc
Access Methods
Point Access Methods (PAMs):
Index methods for 2 or 3-dimensional points (k-d trees, Z-ordering, grid-file)
Spatial Access Methods (SAMs):
Index methods for 2 or 3-dimensional regions and points (R-trees)
Mobile Database
A mobile database is either a stationary database that can be connected to by a mobile computing device
such as smart phones or PDAs over a mobile network, or a database which is actually carried by the
mobile device. This could be a list of contacts, price information, distance travelled, or any other
information.
Mobile databases are highly concentrated in the retail and logistics industries. They are
increasingly being used in aviation and transportation industry.
Home Directory
An example of this is a mobile workforce. In this scenario, a user would require access to update
information from files in the home directories on a server or customer records from a database.
A home directory is a file system directory on a multi-user operating system containing files for a
given user of the system.
A cache could maintain to hold recently accessed data and transactions so that they are not lost
due to connection failure.
Users might not require access to truly live data, only recently modified data, and uploading of
changing might be deferred until reconnected.
Bandwidth must be conserved (a common requirement on wireless networks that charge per
megabyte.
Mobile computing devices tend to have slower CPUs and limited battery life.
Users with multiple devices (i.e.: smartphone and tablet) may need to synchronize their devices to
a centralized data store. This may require application-specific automation features.
Users may change location geographically and on the network. Usually dealing with this, is left to
the operating system, which is responsible for maintaining the wireless network connection.
Main
Memory
increasing probability (RAM) increasing storage capacity
of access increasing permanence
increasing cost increasing access
improving time
On-line devices:
performance magnetic disks,
optical disks
Near-line devices:
optical storage
Off-line devices:
magnetic tapes, optical storage
In Web technology, basic client-server architecture underlies all activities. Information is stored on
computers designated as Web servers in publicly accessible shared files encoded using Hyper Text
Markup Language (HTML).
A number of tools enable users to create Web pages formatted with HTML tags, freely mixed with
multimedia content from graphics to audio and even to video. A page has many interspersed hyperlinks
literally a link that enables a user to "browse" or move from one page to another across the Internet. This
ability has given a tremendous power to end users in searching and navigating related information often
across different continents.
Information on the Web is organized according to a Uniform Resource Locator (URL) something
similar to an address that provides the complete pathname of a file. The pathname consists of a string of
machine and directory names separated by slashes and ends in a filename. For example, the table of
contents of this book is currently at the following URL:
A URL always begins with a hypertext transport protocol (http), which is the protocol used by the Web
browsers, a program that communicates with the Web server, and vice versa. Web browsers interpret and
present HTML documents to users. Popular Web browsers include the Internet Explorer of Microsoft and
the Netscape Navigator. A collection of HTML documents and other files accessible via the URL on a
Web server is called a Web site. In the above URL, "www.awl.com" may be called the Web site of
Addison Wesley Publishing.
1. Access using CGI scripts: The database server can be made to interact with the Web server via
CGI. The main disadvantage of this approach is that for each user request, the Web server must
start a new CGI process: each process makes a new connection with the DBMS and the Web
server must wait until the results are delivered to it. No efficiency is achieved by any grouping of
multiple users’ requests; moreover, the developer must keep the scripts in the CGI-bin
subdirectories only, which opens it to a possible breach of security. The fact that CGI has no
language associated with it but requires database developers to learn PERL or Tcl is also a
drawback. Manageability of scripts is another problem if the scripts are scattered everywhere.
2. Access using JDBC: JDBC is a set of Java classes developed by Sun Microsystems to allow
access to relational databases through the execution of SQL statements. It is a way of connecting
with databases, without any additional processes for each client request. Note that JDBC is a
name trademarked by Sun; it does not stand for Java Data Base connectivity as many believe.
JDBC has the capabilities to connect to a database, send SQL statements to a database and to
retrieve the results of a query using the Java classes Connection, Statement, and Result Set
respectively. With Java’s claimed platform independence, an application may run on any Java-
capable browser, which loads the Java code from the server and runs it on the client’s browser.
The Java code is DBMS transparent; the JDBC drivers for individual DBMSs on the server end
carry the task of interacting with that DBMS. If the JDBC driver is on the client, the application
runs on the client and its requests are communicated to the DBMS directly by the driver. For
standard SQL requests, many RDBMSs can be accessed this way. The drawback of using JDBC
is the prospect of executing Java through virtual machines with inherent efficiency. The JDBC
bridge to Object Database Connectivity (ODBC) remains another way of getting to the
RDBMSs.
Besides CGI, other Web server vendors are launching their own middleware products for providing
multiple database connectivity. These include Internet Server API (ISAPI) from Microsoft and
Netscape API (NSAPI) from Netscape
WIO eliminates the need for scripts. Developers use tools to create intelligent HTML pages called
Application Pages (or App Pages) directly within the database. They execute SQL statements
dynamically, format the results inside HTML, and return the resulting Web page to the end users.
Driver, a lightweight CGI process that is invoked when a URL request is received by the Web server. A
unique session identifier is generated for each request but the WIO application is persistent and does not
terminate after each request. When the WIO application receives a request from the Web driver, it
connects to the database and executes Web Explode, a function that executes queries within Web pages
and formats results as a Web page that goes back to the browser via the Web driver.
Informix HTML tag extensions allow Web authors to create applications that can dynamically construct
Web page templates from the Informix Dynamic Server and present them to the end users .
WIO also lets users create their own customized tags to perform specialized tasks. Thus, without resorting
to any programming or script development, powerful applications can be designed.
WIO supports applications developed in C, C++, and Java. This flexibility lets developer’s port existing
applications to the Web or develops new applications in these languages. The WIO is integrated with
Web server software and utilizes the native security mechanism of the Informix Dynamic Server. The
open architecture of WIO allows the use of various Web browsers and servers.
There is an HTTP demon (a process that runs continuously) called Web Listener running on the server
that listens for the requests originating in the clients. A static file (document) is retrieved from the file
system of the server and displayed on the Web browser at the client. Request for a dynamic page is
passed by the listener to a Web request broker (WRB), which is a multi-threaded dispatcher that adheres
Currently cartridges are provided for PL/SQL, Java, and Live HTML; customized cartridges may be
provided as well. Web Server has been fully integrated with PL/SQL, making it efficient and scalable.
The cartridges give it additional flexibility, making it possible to work with other languages and software
packages. An advanced secure sockets layer may be used for secure communication over the Internet. The
Designer 2000 development has a Web generator that enables previous applications developed for LANs
to be ported to the Internet and Intranet environments.
Among the prominent applications of the intranet and the WWW are databases to support electronic
storefronts, parts and product catalogs, directories and schedules, newsstands, and bookstores. Electronic
commerce the purchasing of products and services electronically on the Internet is likely to become a
major application supported by such databases.
The future challenges of managing databases on the Web will be many, among them the following:
Web technology needs to be integrated with the object technology. Currently, the web can be
viewed as a distributed object system, with HTML pages functioning as objects identified by the
URL.
HTML functionality is too simple to support complex application requirements. As we saw, the
Web Integration Option of Informix adds further tags to HTML. In general, additional facilities
will be needed to
1. To make Web clients function as application front ends, integrating data from multiple
heterogeneous databases;
2. To make Web clients present different views of the same data to different users; and
3. To make Web clients "intelligent" by providing additional data mining functionality
Web page content can be made more dynamic by adding more "behavior" to it as an object (In
this respect
XML defines a subset of SGML (the Standard Generalized Markup Language), allowing customization of
markup languages with application-specific tags. XML is rapidly gaining ground due to its extensibility in
defining new tags.
W3C’s Document Object Model (DOM) defines an object-oriented API for HTML or XML documents
presented by a Web client. W3C is also defining metadata modeling standards for describing Internet
resources.
The technology to model information using the standards discussed above and to find information on the
Web is undergoing a major evolution. Overall, the Web servers have to gain robustness as a reliable
technology to handle production-level databases for supporting 24x7 applications—24 hours a day, 7
days a week.
Security remains a critical problem for supporting applications involving financial and medical databases.
Moreover, transferring from existing database application environments to those on the Web will need
adequate support that will enable users to continue their current mode of operation and an expensive
infrastructure for handling migration of data among systems without introducing inconsistencies. The
traditional database functionality of querying and transaction processing must undergo appropriate
modifications to support Web-based applications. One such area is mobile databases.
In the above image, you can see that the data is coming from multiple heterogeneous
data sources to a Data Warehouse. Common data sources for a data warehouse includes.
Operational databases Flat Files
SAP and non-SAP Applications
What is an Aggregation?
We save tables with aggregated data like yearly 1row, quarterly 4rows, and monthly
12rows or so, if someone has to do a year to year comparison, only one row will be
processed. However, in an unaggregated table it will compare all the rows. This is called
Aggregation.
There are various Aggregation functions that can be used in an OLAP system like Sum,
Avg, Max, Min, etc.
For Example −SELECT Avg(salary) FROM employee WHERE title = 'Programmer';
Key Differences
These are the major differences between an OLAP and an OLTP system.
Indexes − An OLTP system has only few indexes while in an OLAP system there are
many indexes for performance optimization.
Joins − In an OLTP system, large number of joins and data are normalized.
However, in an OLAP systems there are less joins and are de-normalized.
Aggregation − In an OLTP system, data is not aggregated while in an OLAP
database more aggregations are used.
Normalization − An OLTP system contains normalized data however data is not
normalized in an OLAP system.