Ad Bms Notes
Ad Bms Notes
Ad Bms Notes
Query processing is the process by which a declarative query is translated into low-level data
manipulation operations. SQL is the standard query language that is supported in current
DBMSs.
Query Processing steps:
Non-leaf nodes = operations of relational algebra (with parameters); Leaf nodes = relations
Query optimization refers to the process by which the best execution strategy for a given query is
found from among a set of alternatives.
The process typically involves two steps:
Query Decomposition: Query decomposition takes an SQL query and translates it into relational algebra.
In the process, the query is analyzed semantically so that incorrect queries are detected and rejected as
easily as possible, and correct queries are simplified. Simplification involves the elimination of redundant
predicates which may be introduced as a result of query modification to deal with views, security
enforcement and semantic integrity control. The simplified query is then restructured as an algebraic
query.
Query Optimization: For a given SQL query, there are more than one possible relation algebraic
expressions. Some of these algebraic expressions are better than others. The quality of an algebraic
expression is defined in terms of expected performance.
The traditional procedure is to obtain an initial algebraic expression by translating the predicates and the
target statement into relational operations as they appear in the query. This initial algebraic query is then
transformed, using algebraic transformation rules, into other algebraic queries until the best one is
found.
The best algebraic expression is determined according to a cost function which calculates the cost of
executing the query according to that algebraic specification. This is the process of query optimization.
Heuristic Optimization
Cost Based Optimization
In Heuristic Optimization, the query execution is refined based on heuristic rules for reordering
the individual operations.
With Cost Based Optimization, the overall cost of executing the query is systematically
reduced by estimating the costs of executing several different execution plans.
A query can be represented as a tree data structure. Operations are at the interior nodes
and data items (tables, columns) are at the leaves.
The query is evaluated in a depth-first pattern.
For Example:
SELECT PNUMBER, DNUM, LNAME
FROM
PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER and MGRSSN=SSN and
PLOCATION = 'Stafford';
TABLE
MI LNAME
-- ------B SMITH
T WONG
J ZELAYA
S WALLACE
K NARAYAN
A ENGLISH
SSN
--------123456789
333445555
999887777
987654321
666884444
453453453
BDATE
--------09-JAN-55
08-DEC-45
19-JUL-58
20-JUN-31
15-SEP-52
31-JUL-62
ADDRESS
------------------------731 FONDREN, HOUSTON, TX
638 VOSS,HOUSTON TX
3321 CASTLE, SPRING, TX
291 BERRY, BELLAIRE, TX
975 FIRE OAK, HUMBLE, TX
5631 RICE, HOUSTON, TX
AHMAD
JAMES
V
E
JABBAR
BORG
DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5
MGRSSN
--------888665555
987654321
333445555
PROJECT TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
Computerization
10
Reorganization
20
NewBenefits
30
PLOCATION
---------Bellaire
Sugarland
Houston
Stafford
Houston
Stafford
MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78
DNUM
---5
5
5
4
1
4
WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
333445555
20
999887777
30
999887777
10
987987987
10
987987987
30
987654321
30
987654321
20
888665555
20
M
M
25000 987654321 4
55000
1
HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null
Note the two cross product operations. These require lots of space and time (nested
loops) to build.
After the two cross products, we have a temporary table with 144 records (6 projects * 3
departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and project
operations as possible before doing any joins.
There are a number of transformation rules that can be used to transform a query:
1. Cascading selections. A list of conjunctive conditions can be broken up into
separate individual conditions.
c1(c2(E)) = c1 c2(E)
2. Commutativity of the selection operation.
3. Cascading projections. All but the last projection can be ignored.
Assume that attributes A1, . . . ,An are among B1, . . . ,Bm. Then
A1,...,An( B1,...,Bm(E)) = A1,...,An(E)
4. Commuting selection and projection. If a selection condition only involves
attributes contained in a projection clause, the two can be commuted.
5. Commutativity of Join and Cross Product.
6. Commuting selection with Join.
If c only involves attributes from E1,then
Just looking at the Syntax of the query may not give the whole picture - need to look at
the data as well.
Several Cost components to consider:
1. Access cost to secondary storage (hard disk)
2. Storage Cost for intermediate result sets
3. Computation costs: CPU, memory transfers, etc. for performing in-memory
operations.
4. Communications Costs to ship data around a network. e.g., in a distributed or
client/server database.
Of these, Access cost is the most crucial in a centralized DBMS. The more work we can
do with data in cache or in memory, the better.
Access Routines are algorithms that are used to access and aggregate data in a database.
An RDBMS may have a collection of general purpose access routines that can be
combined to implement a query execution plan.
We are interested in access routines for selection, projection, join and set operations such
as union, intersection, set difference, cartesian product, etc.
As with heuristic optimization, there can be many different plans that lead to the same
result.
In general, if a query contains n operations, there will be n! possible plans.
However, not all plans will make sense. We should consider:
Perform all simple selections first
Perform joins next
Perform projection last
Overview of the Cost Based optimization process
1. Enumerate all of the legitimate plans (call these P1...Pn) where each plan contains
a set of operations O1...Ok
2. Select a plan
3. For each operation Oi in the plan, enumerate the access routines
4. For each possible Access routine for Oi, estimate the cost
Select the access routine with the lowest cost
5. Repeat previous 2 steps until an efficient access routine has been selected for each
operation
Sum up the costs of each access routine to determine a total cost for the plan
6. Repeat steps 2 through 5 for each plan and choose the plan with the lowest total
cost.
There are many possible ways to estimate cost, e.g., based on disk accesses, CPU
time, or communication overhead.
Disk access is the predominant cost (in terms of time); relatively easy to estimate;
therefore, number of block transfers from/to disk is typically used as measure.
Simplifying assumption: each block transfer has the same cost.
Cost of algorithm (e.g., for join or selection) depends on database buffer size; more
memory for DB buffer reduces disk accesses. Thus DB buffer size is a parameter for
estimating cost.
We refer to the cost estimate of algorithm S as cost(S). We do not consider cost of
writing output to disk.
UNIT-2
Disadvantages of RDBMS
RDBMSs are not suitable for applications with complex data structures or new data types
for large, unstructured objects, such as CAD/CAM, Geographic information systems,
multimedia databases, imaging and graphics.
The RDBMSs typically do not allow users to extend the type system by adding new data
types.
They also only support first-normal-form relations in which the type of every column
must be atomic, i.e., no sets, lists, or tables are allowed inside a column.
Recursive queries are difficult to write.
MOTIVATING EXAMPLE
As a specific example of the need for object-relational systems, we focus on a new business data
processing problem that is both harder and (in our view) more entertaining than the dollars and
cents bookkeeping of previous decades. Today, companies in industries such as entertainment
are in the business of selling bits; their basic corporate assets are not tangible products, but rather
software artifacts such as video and audio.
We consider the fictional Dinky Entertainment Company, a large Hollywood conglomerate
whose main assets are a collection of cartoon characters, especially the cuddly and
internationally beloved Herbert the Worm. Dinky has a number of Herbert the Worm films,
many of which are being shown in theaters around the world at any given time. Dinky also
makes a good deal of money licensing Herbert's image, voice, and video footage for various
purposes: action figures, video games, product endorsements, and so on. Dinky's database is used
to manage the sales and leasing records for the various Herbert-related products, as well as the
video and audio data that make up Herbert's many films.
Traditional database systems, such as RDBMS, have been quite successful in developing the
database technology required for many traditional business database applications. However, they
have certain shortcomings when more complex database applications must be designed and
implementedfor example, databases for engineering design and manufacturing (CAD/CAM ),
scientific experiments, telecommunications, geographic information systems, and multimedia.
These newer applications have requirements and characteristics that differ from those of
traditional business applications, such as more complex structures for objects, longer-duration
transactions, new data types for storing images or large textual items, and the need to define
nonstandard application-specific operations.
Object-oriented databases were proposed to meet the needs of these more complex applications.
The object-oriented approach offers the flexibility to handle some of these requirements without
being limited by the data types and query languages available in traditional database systems. A
key feature of object-oriented databases is the power they give the designer to specify both the
structure of complex objects and the operations that can be applied to these objects.
Object database systems combine the classical capabilities of relational database management
systems (RDBMS), with new functionalities assumed by the object-orientedness. The traditional
capabilities include:
Complex objects
Object identities
User-defined types
Encapsulation
Type/class hierarchy with inheritance
Overloading, overriding, late binding, polymorphism
A complex object mechanism allows an object to contain attributes that can themselves be
objects. In other words, the schema of an object is not in first-normal-form. Examples of
attributes that can comprise a complex object include lists, bags, and embedded objects.
Object identity
Every instance in the database has a unique identifier (OID), which is a property of an object
that distinguishes it from all other objects and remains for the lifetime of the object. In
object-oriented systems, an object has an existence (identity) independent of its value.
Each database object has identity, i.e. a unique internal identitifier (OID) (with no meaning in the
problem domain). Each object has one or more external names that can be used to identify the object by
the programmer.
Properties of OID:
It is unique
It is system generated
It is invisible to the user. That is it cannot be modified by the user.
It is immutable. That is, once generated, it is never regenerated.
It is a long integer value
Encapsulation
Object-oriented models enforce encapsulation and information hiding. This means, the state of
objects can be manipulated and read only by invoking operations that are specified within the
type definition and made visible through the public clause.
In an object-oriented database system encapsulation is achieved if only the operations are
visible to the programmer and both the data and the implementation are hidden.
Support for types or classes
Type: in an object-oriented system, summarizes the common features of a set of objects
with the same characteristics. In programming languages types can be used at
compilation time to check the correctness of programs.
Class: The concept is similar to type but associated with run-time execution. The term
class refers to a collection of all objects with the same internal structure (attributes) and
methods. These objects are called instances of the class.
Both of these two features can be used to group similar objects together, but it is normal
for a system to support either classes or types and not both.
Class or type hierarchies
Any subclass or subtype will inherit attributes and methods from its superclass or supertype.
Overriding, Overloading and Late Binding
Overloading: A class modifies an existing method, by using the same name, but with a
different list, or type, of parameters.
Overriding: The implementation of the operation will depend on the type of the object it is
applied to.
Late binding: The implementation code cannot be referenced until run-time.
Computational Completeness
SQL does not have the full power of a conventional programming language. Languages such as
Pascal or C are said to be computationally complete because they can exploit the full
capabilities of a computer. SQL is only relationally complete, that is, it has the full power of
relational algebra. Whilst any SQL code could be rewritten as a C++ program, not all C++
programs could be rewritten in SQL.
Mandatory features of database systems
A database is a collection of data that is organized so that its contents can easily be accessed,
managed, and updated. Thus, a database system contains the five following features:
Persistence
As in a conventional database, data must remain after the process that created it has
terminated. For this purpose data has to be stored permanently on secondary storage.
Secondary Storage Management
Traditional databases employ techniques, which manage secondary storage in order to improve
the performance of the system. These are usually invisible to the user of the system.
Concurrency
The system should provide a concurrency mechanism, which is similar to the concurrency
mechanisms in conventional databases.
Recovery
The system should provide a recovery mechanism similar to recovery mechanisms in
conventional databases.
Ad hoc query facility
The database should provide a high-level, efficient, application independent query facility.
This needs not necessarily be a query language but could instead, be some type of graphical
interface.
Name
FName
LName
Age
Salary
Address
street
City
province
Postal_code
and an arbitrary number of hierarchy levels. Objects can be aggregates of (sub-) objects.
An object typically has two components: state (value) and behavior (operations). Hence, it is somewhat
similar to a program variable in a programming language, except that it will typically have a complex
data structure as well as specific operations defined by the programmer.
Types of objects:
Transient objects: Objects in an OOPL exist only during program execution and are hence called
transient objects.
Persistent objects: An OO database can extend the existence of objects so that they are stored
permanently, and hence the objects persist beyond program termination and can be retrieved later and
shared by other programs. In other words, OO databases store persistent objects permanently on
secondary storage, and allow the sharing of these objects among multiple programs and applications.
This requires the incorporation of other well-known features of database management systems, such as
indexing mechanisms, concurrency control, and recovery. An OO database system interfaces with one or
more OO programming languages to provide persistent and shared object capabilities.
Relationships, associations, links. Objects are connected by conceptual links. For instance, the
Employee and Department objects can be connected by a link worksFor. In the data structure links
are implemented as logical pointers (bi-directional or uni-directional).
Encapsulation and information hiding. The internal properties of an object are subdivided into two
parts: public (visible from the outside) and private (invisible from the outside). The user of an object
can refer to public properties only.
Classes, types, interfaces. Each object is an instance of one or more classes. The class is understood
as a blueprint for objects; i.e. objects are instantiated according to information presented in the class
and the class contains the properties that are common for some collection of objects (objects
invariants). Each object is assigned a type. Objects are accessible through their interfaces, which
specify all the information that is necessary for using objects.
Abstract data types (ADTs): a kind of a class, which assumes that any access to an object is limited to
methods). The object performs the operation after receiving a message with the name of operation
to be performed (and parameters of this operation).
Inheritance. Classes are organized in a hierarchy reflecting the hierarchy of real world concepts. For
instance, the class Person is a super class of the classes Employee and Student. Properties of more
abstract classes are inherited by more specific classes. Multi-inheritance means that a specific class
inherits from several independent classes.
Polymorphism, late binding, overriding. The operation to be executed on an object is chosen
dynamically, after the object receives the message with the operation name. The same message sent
to different objects can invoke different operations.
Persistence. Database objects are persistent, i.e., they live as long as necessary. They can outlive
research. In this section we examine a few of the key challenges that arise in implementing an efficient,
fully functional ORDBMS. Many more issues are involved than those discussed here
Structured objects can also be large, but unlike ADT objects they often vary in size during the
lifetime of a database. For example, consider the stars attribute of the films table. As the years
pass, some of the bit actors in an old movie may become famous. When a bit actor becomes
famous, we might want to advertise his or her presence in the earlier films. This involves an
insertion into the stars attribute of an individual tuple in lms. Because these bulk attributes can
grow arbitrarily, flexible disk layout mechanisms are required. An additional complication arises
with array types. Traditionally, array elements are stored sequentially on disk in a row-by-row
fashion, for example
A11,.A1n, A21,..,A2n Am1,.....,Amn
However, queries may often request sub arrays that are not stored contiguously on disk (e.g.,
A11,A21,...,Am1). Such requests can result in a very high I/O cost for retrieving the sub array. In
order to reduce the number of I/Os required in general, arrays are often broken into contiguous
chunks, which are then stored in some order on disk. Although each chunk is some contiguous
region of the array, chunks need not be row-by-row or column-by-column. For example, a chunk
of size 4 might be A11,A12,A21,A22, which is a square region if we think of the array as being
arranged row-by-row in two dimensions.
(e.g., the R-tree, which matches conditions such as Find me all theaters within 100 miles of
Andorra).
One way to make the set of index structures extensible is to publish an access method interface
that lets users implement an index structure outside of the DBMS. The index and data can be
stored in a file system, and the DBMS simply issues the open , next ,and close iterator requests to
the users external index code. Such functionality makes it possible for a user to connect a
DBMS to a Web search engine, for example. A main drawback of this approach is that data in an
external index is not protected by the DBMSs support for concurrency and recovery. An
alternative is for the ORDBMS to provide a generic template index structure that is sufficiently
general to encompass most index structures that users might invent. Because such a structure is
implemented within the DBMS, it can support high concurrency and recovery. The Generalized
Search Tree (GiST) is such a structure. It is a template index structure based on B+trees, which
allows most of the tree index structures invented so far to be implemented with only a few lines
of user-defined ADT code.
Query Processing
ADTs and structured types call for new functionality in processing queries in ORDBMSs. They
also change a number of assumptions that affect the efficiency of queries. In this section we look
at two functionality issues (user-defined aggregates and security) and two efficiency issues
(method caching and pointer swizzling).
Method Security
ADTs give users the power to add code to the DBMS, this power can be abused. A buggy or
malicious ADT method can bring down the database server or even corrupt the database. The
DBMS must have mechanisms to prevent buggy or malicious user code from causing problems.
It may make sense to override these mechanisms for efficiency in production environments with
Method Caching
User-defined ADT methods can be very expensive to execute and can account for the bulk of the
time spent in processing a query. During query processing it may make sense to cache the results
of methods, in case they are invoked multiple times with the same argument. Within the scope of
a single query, one can avoid calling a method twice on duplicate values in a column by either
sorting the table on that column or using a hash-based scheme much like that used for
aggregation. An alternative is to maintain a cache of method inputs and matching outputs as a
table in the database. Then to find the value of a method on particular inputs, we essentially join
the input tuples with the cache table. These two approaches can also be combined.
Pointer Swizzling
In some applications, objects are retrieved into memory and accessed frequently through their
oids, dereferencing must be implemented very efficiently. Some systems maintains table of oids
of objects that are (currently) in memory. When an object O is brought into memory, they check
each oid contained in O and replace oids of in-memory objects by in-memory pointers to those
objects. This technique is called pointer swizzling and makes references to in-memory objects
very fast. The downside is that when an object is paged out, in-memory references to it must
somehow be invalidated and replaced with its oid.
Query Optimization
New indexes and query processing techniques widen the choices available to a query optimizer.
In order to handle the new query processing functionality, an optimizer must know about the new
functionality and use it appropriately. In this section we discuss two issues in exposing
information to the optimizer (new indexes and ADT method estimation) and an issue in query
planning that was ignored in relational systems (expensive selection optimization).
OODBMS
ORDBMS
ORDBMS.
and
working on them for long periods,
with related objects (e.g., objects
referenced
Every record is uniquely Here every object is uniquely Here every object is uniquely
identified by primary key
identified by system generated identified by system generated
Object ID
Object ID
RDBMS is suitable for small
database management systems
like
Hotel
management,
university management, shop
management, etc.
OODBMS
is
suitable
for
advanced
applications like:
Computer
Integrated
Manufacturing (CIM), Advanced
office automation systems,
Hospital patient care tracking
systems, etc. All of these
applications are characterized by
having to manage complex,
highly interrelated information,
which is a strength of objectoriented database systems.
ORDBMS
is
suitable
for
applications like: Complex data
analysis,
Digital
Asset
Management, Gio-graphic Data,
Bio-medical
Examples of RDBMS: Oracle, SQL Examples of OODBMS: Object Examples of ORDBMS: Postgres,
server, MySQL, etc
store, Versant, Gemstone, etc.
SQL 92
Standard Query Language is Lack of standard query language.
present i.e: SQL
UNIT- 3
Parallel and Distributed Databases
A parallel database system is one that seeks to improve performance through parallel
implementation of various operations such as loading data, building indexes, and evaluating
queries.
Parallel Database Systems
A parallel database system tries to improve performance through parallelization of various
operations such as loading data , evaluating queries etc. the main goal of such system is to
improve the performance. Whereas, in case of distributed database systems, the data
distribution is the governing factor. The main goal of such systems is to increase the
availability and reliability.
Some terms that defines systems performance:
Throughput: Number of tasks (transactions) that can be completed in a given time interval.
Response Time: Amount of time taken to complete a single task from the time it is
submitted.
A system that processes large number of small transactions can improve throughput by
processing many transactions in parallel.
A system that processes large transactions can improve response time and throughput by
dividing each transaction into number of sub-transactions that can be executed in parallel.
Speed-Up: Running a given task in less time by increasing the degree of parallelism is call
speed up.
Speed Up = Ts/Tl
where Ts= Time required on small system
Tl= time required on large system with more resources.
A parallel system is said to demonstrate linear speed up if the speed up is N, when
resources are increased N times.
Scale-Up: Handling larger tasks in same amount of time by increasing the degree of
parallelism is called scale up.
Scale-Up= Ts/Tl
where Ts= time required to execute task of size Q
Tl= time required to execute task of size Q*N
The parallel system is said to demonstrate linear scale up on task of size Q if Ts=Tl when
resources are increased N times.
Parallel Database architectures:
Three main architectures are proposed for building parallel databases:
1. Shared - memory :- (All processors share common memery) where multiple CPUs are
attached to an interconnection network and can access a common region of main
memory.
In shared memory architecture, the processors and disks have access to common
memory via a bus or through an interconnection network.
A processor can send messages to other processors using memory writes.
This message sending is the much faster communication mechanism.
a. Round Robin Partitioning :If there are n processors, the ith tuple is assigned to
processor i mod n
b. Hash Partitioning : A hash function is applied to (selected fields of) a tuple to determine
its processor.
Hash partitioning has the additional virtue that it keeps data evenly distributed even if the
data grows and shrinks over time.
c. Range Partitioning : Tuples are sorted (conceptually), and n ranges are chosen for the
sort key values so that each range contains roughly the same number of tuples; tuples in
range i are assigned to processor i.
Range partitioning can lead to data skew; that is, partitions with widely varying numbers of
tuples across partitions or disks. Skew causes processors dealing with large partitions to
become performance bottlenecks.
Sorting:
Sorting could be done by redistributing all tuples in the relation using range partitioning.
Ex. Sorting a collection of employee tuples by salary whose values are in a certain
range.
For N processors each processor gets the tuples which lie in range assigned to it. Like
processor 1 contains all tuples in range 10 to 20 and so on.
Each processor has a sorted version of the tuples which can then be combined by
traversing and collecting the tuples in the order on the processors (according to the range
assigned)
The problem with range partitioning is data skew which limits the scalability of the
parallel sort. One good approach to range partitioning is to obtain a sample of the entire
relation by taking samples at each processor that initially contains part of the relation. The
(relatively small) sample is sorted and used to identify ranges with equal numbers of tuples.
This set of range values, called a splitting vector, is then distributed to all processors and
used to range partition the entire relation.
Joins:
Here we consider how the join operation can be parallelized
Consider 2 relations A and B to be joined using the age attribute. A and B are initially
distributed across several disks in a way that is not useful for join operation
So we have to decompose the join into a collection of k smaller joins by partitioning both
A and B into a collection of k logical partitions.
If same partitioning function is used for both A and B then the union of k smaller joins will
compute to the join of A and B.
DISTRIBUTED DATABASES
The idea of a distributed database is that the data should be physically stored at different
locations but its distribution and access should be transparent to the user.
Introduction to DBMS:
A Distributed Database should exhibit the following properties:
1) Distributed Data Independence: - The user should be able to access the database
without having the need to know the location of the data.
2) Distributed Transaction Atomicity: - The concept of atomicity should be distributed for
the operation taking place at the distributed sites.
Types of Distributed Databases are:a) Homegeneous Distributed Database is where the data stored across multiple sites is
managed by same DBMS software at all the sites.
b) Heterogeneous Distributed Database is where multiple sites which may be autonomous
are under the control of different DBMS software.
Architecture of DDBs :
There are 3 architectures: Client-Server:
A Client-Server system has one or more client processes and one or more server
processes, and a client process can send a query to any one server process. Clients are
responsible for user-interface issues, and servers manage data and execute transactions.
Thus, a client process could run on a personal computer and send queries to a server
running on a mainframe.
Advantages: 1. Simple to implement because of the centralized server and separation of functionality.
2. Expensive server machines are not underutilized with simple user interactions which are
now pushed on to inexpensive client machines.
3. The users can have a familiar and friendly client side user interface rather than unfamiliar
and unfriendly server interface
Collaborating Server:
In the client sever architecture a single query cannot be split and executed across
multiple servers because the client process would have to be quite complex and intelligent
enough to break a query into sub queries to be executed at different sites and then place
their results together making the client capabilities overlap with the server. This makes it
hard to distinguish between the client and server
Although these cost may not be very high if the sites are connected via a high local n/w
but sometime they become quit significant in other types of network.
Hence, DDBMS query optimization algorithms consider the goal of reducing the amount
of data transfer as an optimization criterion in choosing a distributed query execution
strategy.
Consider an EMPLOYEE relation.
The size of the employee relation is 100 * 10,000=10^6 bytes
The size of the department relation is 35 * 100=3500 bytes
10,000 records
Each record is 100 bytes
Fname field is 15 bytes long
SSN field is 9 bytes long
Lname field is 15 bytes long
Dnum field is 4 byte long
100records
Each record is 35 bytes long
Dnumber field is 4 bytes long
Dname field is 10 bytes long
MGRSSN field is 9 bytes long
Now consider the following query:
For each employee, retrieve the employee name and the name of the department for which
the employee works.
Now suppose that each record in the query result is 40 bytes long and the query is
submitted at a distinct site which is the result site.
Then there are 3 strategies for executing this distributed query:
1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the site 3 that is your
result site and perform the join at that site. In this case a total of 1,000,000 + 3500 =
1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2 (site where u have Department relation) and
send the result to site 3. the size of the query result is 40 * 10,000 = 400,000 bytes so
400,000 + 1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTEMNT relation to site 1 (site where u have Employee relation) and
send the result to site 3. in this case 400,000 + 3500 = 403,500 bytes must be transferred.
Nonjoin Queries in a Distributed DBMS:
Consider the following two relations:
Sailors (sid: integer, sname:string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: date, rname: string)
Now consider the following query:
SELECT S.age FROM Sailors S WHERE S.rating > 3 AND S.rating < 7
Now suppose that sailor relation is horizontally fragmented with all the tuples having a rating
less than 5 at Shanghais and all the tuples having a rating greater than 5 at Tokyo.
The DBMS will answer this query by evaluating it both sites and then taking the union of the
answer.
Joins in a Distributed DBMS:
Joins of a relation at different sites can be very expensive so now we will consider the
evaluation option that must be considered in a distributed environment.
Suppose that Sailors relation is stored at London and Reserves relation is stored at
Paris. Hence we will consider the following strategies for computing the joins for Sailors and
Reserves.
In the next example the time taken to read one page from disk (or to write one page to
disk) is denoted as td and the time taken to ship one page (from any site to another site) as
ts.
Distributed Recovery
When a transaction commits, all its actions across all the sites at which it executes must
persist.
When a transaction aborts none of its actions must be allowed to persist.
Concurrency Control and Recovery in Distributed Databases: For currency control and recovery
purposes, numerous problems arise in a distributed DBMS environment that is not encountered in a
centralized DBMS environment.
This includes the following:
Dealing with multiple copies of the data items: The concurrency control method is responsible for
maintaining consistency among these copies. The recovery method is responsible for making a copy
consistent with other copies if the site on which he copy is stored fails and recovers later.
Failure of individual sites: The DBMS should continue to operate with its running sites, if possible
when one or the more individual site fall. When a site recovers its local database must be brought
up to date with the rest of the sites before it rejoins the system.
Failure of communication links: The system must be able to deal with failure of one or more of the
communication links that connect the sites. An extreme case of this problem is that network
partitioning may occur. This breaks up the sites into two or more partitions where the sites within
each partition can communicate only with one another and not with sites in other partitions.
Distributed Commit: Problems can arise with committing a transactions that is accessing database
stored on multiple sites if some sites fail during the commit process. The two-phase commit
protocol is often used to deal with this problem.
Distributed Deadlock: Deadlock may occur among several sites so techniques for dealing with
deadlocks must be extended to take this into account.
Lock management can be distributed across sites in many ways:
Centralized: A single site is in charge of handling lock and unlock requests for all
objects.
Primary copy: One copy of each object is designates as the primary copy. All requests
to lock or unlock a copy of these objects are handled by the lock manager at the site where
the primary copy is stored, regardless of where the copy itself is stored.
Fully Distributed: Request to lock or unlock a copy of an object stored at a site are
handled by the lock manager at the site where the copy is stored.
Distributed Deadlock
One issue that requires special attention when using either primary copy or fully
distributed locking is deadlocking detection
Each site maintains a local waits-for graph and a cycle in a local graph indicates a
deadlock.
For example:
Suppose that we have two sites A and B, both contain copies of objects O1 and O2 and
that the read-any write-all technique is used.
To detect such deadlocks, a distributed deadlock detection algorithm must be used and we
have three types of algorithms:
1. Centralized Algorithm:
It consist of periodically sending all local waits-for graphs to some one site that is
responsible for global deadlock detection.
At this site, the global waits-for graphs is generated by combining all local graphs and in
the graph the set of nodes is the union of nodes in the local graphs and there is an edge
from one node to another if there is such an edge in any of the local graphs.
2. Hierarchical Algorithm:
This algorithm groups the sites into hierarchies and the sites might be grouped by states,
then by country and finally into single group that contain all sites.
Every node in this hierarchy constructs a waits-for graph that reveals deadlocks involving
only sites contained in (the sub tree rooted at) this node.
Thus, all sites periodically (e.g., every 10 seconds) send their local waits-for graph to the
site constructing the waits-for graph for their country.
The sites constructing waits-for graph at the country level periodically (e.g., every 10
minutes) send the country waits-for graph to site constructing the global waits-for graph.
3. Simple Algorithm:
The transaction manager at the site where the transaction originated is called the
Coordinator for the transaction and the transaction managers where its sub transactions
execute are called Subordinates.
Two Phase Commit Protocol:
When the user decides to commit the transaction and the commit command is sent to
the coordinator for the transaction.
This initiates the 2PC protocol:
The coordinator sends a Prepare message to each subordinate.
When a subordinate receive a Prepare message, it then decides whether to abort or
commit its sub transaction. it force-writes an abort or prepares a log record and then sends
a NO or Yes message to the coordinator.
Here we can have two conditions:
o If the coordinator receives Yes message from all subordinates. It force-writes a
commit log record and then sends a commit message to all the subordinates.
o If it receives even one No message or No response from some coordinates for a
specified time-out period then it will force-write an abort log record and then sends an abort
message to all subordinate.
UNIT IV
INTRODUCTION TO DATABASE SECURITY
There are three main objectives to consider while designing a secure database application:
1. Secrecy: Information should not be disclosed to unauthorized users. For example, a student should
not be allowed to examine other students' grades.
2. Integrity: Only authorized users should be allowed to modify data. For example, students may be
allowed to see their grades, yet not allowed (obviously!) to modify them.
3. Availability: Authorized users should not be denied access. For example, an instructor who wishes to
change a grade should be allowed to do so.
A DBMS typically includes a database security and authorization subsystem that is responsible for
ensuring the security of portions of a database against unauthorized access. It is now customary to refer
to two types of database security mechanisms:
Discretionary Security mechanism: These are used to grant privileges to users, including the capability to
access specific data files, records, or fields in a specified mode(such as read, insert,delete, or update).
Mandatory security mechanisms: These are used to enforce multilevel security by classifying the data and
users into various security classes (or levels) and then implementing the appropriate security policy of the
organization. For example, a typical policy is to purmit users at a certain classification level to see only
data items classified at the users own level. An extension of this is role-based security, which enforces
policies and privileges based on the concept of roles.
ACCESS CONTROL
A DBMS should provide mechanisms to control access to data. A DBMS offers two main approaches to
access control.
Discretionary access control
Mandatory access control
Discretionary access control: It is based on the concept of access rights, or privileges, and
mechanisms for users. A privilege allows a user to access some data object in a certain manner ( e.g., to
read or to modify). A user who creates a database object such as a table or a view automatically gets all
applicable privileges on that object. SQL-92 supports discretionary access control through the GRANT
and REVOKE commands.
The GRANT command gives privileges to users.
The GRANT command gives privileges to base table and views. The syntax of this command is as
follows:
GRANT privileges ON object TO users [WITH GRANT OPTION]
Here object is either a base table or a view.
Several privileges can be specified including:
SELECT: The right to access (read) all columns of the table specified as object, including columns added
later through ALTER TABLE commands.
INSERT(column-name): The right to insert rows with (non-null or non default) values in the named
column of the table named as object. The privileges UPDATE(column-name) and UPDATE are similar to
INSERT.
DELETE: The right to delete rows from the table named as object.
REFERENCES(column-name): The right to define foreign keys (in other tables) that refer to the
specified column of the table object. REFERENCES without a column name specified denotes this right
with respect to all columns.
For Example:
Suppose that user joe has created the tables BOATS, RESERVES, and SAILORS. Some examples of
GRANT command that joe can now execute are:
GRANT INSERT, DELETE ON RESERVES TO Yuppy WITH GRANT OPTION
GRANT SELECT ON RESERVES TO Michel
GRANT SELECT ON SAILORS TO Michael WITH GRANT OPTION
GRANT UPDATE (rating) ON SAILORS TO Leah
GRANT REFERENCES (bid) ON BOATS TO Bill
Adding WITH GRANT OPTION at the end of the grant command allows the user who has been granted
the privilege to pass those privilege to other user.
In the above examples. Yuppy can insert or delete Reserves rows and can authorize someone else to do
the same. Michael can execute Select queries on Sailors and Reserves, and he can pass this privilege to
others for sailors, but not for Reserves.
The REVOKE command takes away privileges.
This is complementary command to GRANT that allows the withdrawal of privileges.
The syntax of REVOKE Command is as follows:
REVOKE [ GRANT OPTION FOR] Privileges
ON object FROM users {RESTRICT|CASCADE}
The command can be used to revoke either a privilege or just the grant option on a privilege( by using the
option GRANT OPTION FOR clause).
A user who has granted a privilege to other user may change his mind and want to withdraw the granted
privilege. The intuition behind exactly what effect a REVOKE command has is complicated by the fact
that a user may be granted the same privilege multiple times, possible by different users.
When a user executes a REVOKE command with the CASCADE keyword, the effect is to withdraw the
named privileges or grant option from all users who currently hold these privileges solely through a
GRANT command that was previously executed by some user who is now executing the REVOKE
command. If these users received the privileges with the grant option and passed it along, those
recipients will also lose their privileges as consequence of the REVOKE command unless they received
these privileges independently.
For Example:
GRANT SELECT ON Sailors TO Art WITH GRANT OPTION
GRANT SELECT ON Sailors TO Bob WITH GRANT OPTION
(executed by Joe)
(executed by Art)
(executed by Joe)
Art loses the SELECT privilege on Sailors, of course. Then Bob, who received this privilege from Art, and
only Art, also loses this privilege.
If the RESTRICT keyword is specified in the REVOKE command, the command is rejected if revoking the
privileges just from the users specified in the command would result in other privileges becoming
abandoned.
Mandatory access control: It is based on system wide policies that cannot be changed by individual
users. In this approach each database object is assigned a security class, each user is assigned
clearance for a security class, and rules are imposed on reading and writing of database objects by users.
The DBMS determines whether a given user can read or write a given object based on certain rules that
involve the security level of the object and the clearance of the user.
The popular model for mandatory access control, called the Bell-LaPadula model, is described in terms of
objects (e.g., tables, views, rows, columns), subjects (e.g., users, programs), security classes, and
clearances. Each database object is assigned a security class, and each subject is assigned clearance
for a security class; we will denote the class of an object or subject A as class(A). The security classes in
a system are organized according to a partial order, with a most secure class and a least secure class.
For simplicity, we will assume that there are four classes: top secret (TS), secret (S), confidential (C), and
unclassified (U). In this system, TS > S > C > U, where A > B means that class A data is more sensitive
than class B data.
The Bell-LaPadula model imposes two restrictions on all reads and writes of database objects:
1. Simple Security Property: Subject S is allowed to read object O only if class(S) class(O). For
example, a user with TS clearance can read a table with C clearance, but a user with C clearance is not
allowed to read a table with TS classification.
2. *-Property: Subject S is allowed to write object O only if class(S) class(O). For example, a user with
S clearance can only write objects with S or TS classification.
bid
bname
color
Security class
101
Salsa
Red
102
Pinto
Brown
The Boats table is defined to have bid as the primary key. Suppose that a user with clearance C wishes
to enter the row <101, Picante,Scarlet, i>. We have a dilemma:
If the insertion is permitted, two distinct rows in the table will have key 101.
If the insertion is not permitted because the primary key constraint is violated, the user trying to insert the
new row, who has clearance C, can infer that there is a boat with bid=101 whose security class is higher
than C. This situation compromises the principle that users should not be able to infer any information
about objects that have a higher security classification.
This dilemma is resolved by effectively treating the security classification as part of the key. Thus, the
insertion is allowed to continue, and the table instance is modified as shown in Figure below.
bid
bname
color
Security class
101
Salsa
Red
101
Picante
Scarlet
102
Pinto
Brown
Users with clearance C or U see just the rows for Picante and Pinto, but users with clearance S or TS see
all three rows. The two rows with bid=101 can be interpreted in one of two ways: only the row with the
higher classification (Salsa, with classification S) actually exists, or both exist and their presence is
revealed to users according to their clearance level. The choice of interpretation is up to application
developers and users.
developing a security policy. The DBA has a special account, which we will call the system account, and
is responsible for the overall security of the system. In particular the DBA deals with the following:
1. Creating new accounts: Each new user or group of users must be assigned an authorization id and a
password. Note that application programs that access the database have the same authorization id as the
user executing the program.
2. Mandatory control issues: If the DBMS supports mandatory control some customized systems for
applications with very high security requirements (for example, military data) provide such support the
DBA must assign security classes to each database object and assign security clearances to each
authorization id in accordance with the chosen security policy.
3.Audit trail: The DBA is also responsible for maintaining the audit trail, which is essentially the log of
updates with the authorization id (of the user who is executing the transaction) added to each log entry.
This log is just a minor extension of the log mechanism used to recover from crashes. Additionally, the
DBA may choose to maintain a log of all actions, including reads, performed by a user. Analyzing such
histories of how the DBMS was accessed can help prevent security violations by identifying suspicious
patterns before an intruder finally succeeds in breaking in, or it can help track down an intruder after a
violation has been detected.
Encryption
A DBMS can use encryption to protect information in certain situations where the normal security
mechanism of the DBMS are not adequate. For example, an intruder may steal tapes containing some
data or tape a communication line. By storing and transmitting data in an encrypted form, the DBMS
ensures that such stolen data is not intelligible to the intruder.
Encryption is basically done through encryption algorithm. The output of the algorithm is the encrypted
version of the data. There is also a decryption algorithm, which takes the encrypted data and the
encryption key as input and then returns the original data. This approach is called Data Encryption
Standard (DES). The main weakness of this approach is that authorized users must be told the
encryption key, and the mechanism for communicating this information is vulnerable to clever intruders.
Another approach is called Public Key encryption. The encryption scheme proposed by Rivest, Shamir,
and Adleman, called RSA, is a well-known example of public-key encryption. In this each authorized user
has a public encryption key, known to everyone, and a private decryption key, choosen by the user and
known only to him or her.
For example: Consider a user called sam. Anyone can send sam a secret message by encrypting the
message using sams publicly known encryption key. Only sam can decrypt this secret message because
the decryption algorithm requires sams decryption key, known only to sam. Since users choose their own
decryption keys, the weakness of DES is avoided.
UNIT V
What is Postgres?
Traditional relational database management systems (DBMSs) support a data model consisting of a
collection of named relations, containing attributes of a specific type. In current commercial systems,
possible types include floating point numbers, integers, character strings, money, and dates. It is
commonly recognized that this model is inadequate for future data processing applications. The relational
model successfully replaced previous models in part because of its "Spartan simplicity". However, as
mentioned, this simplicity often makes the implementation of certain applications very difficult. Postgres
offers substantial additional power by incorporating the following four additional basic concepts in such a
way that users can easily extend the system:
classes
inheritance
types
functions
Other features provide additional power and flexibility:
constraints
triggers
rules
transaction integrity
These features put Postgres into the category of databases referred to as object-relational. Postgres is a
client/server application. As a user, you only need access to the client portions of the installation
POSTGRES ARCHITECTURE
Postgres uses a simple "process per-user" client/server model. A Postgres session consists of the
following cooperating UNIX processes (programs):
A supervisory daemon process (postmaster),
The users frontend application (e.g., the psql program), and
The one or more backend database servers (the postgres process itself).
A single postmaster manages a given collection of databases on a single host. Such a collection of
databases is called an installation or site. Frontend applications that wish to access a given database
within an installation make calls to the library. The library sends user requests over the network to the
postmaster (How a connection is established), which in turn starts a new backend server process and
connects the frontend process to the new server. From that point on, the frontend process and the
backend server communicate without intervention by the postmaster. Hence, the postmaster is always
running, waiting for requests, whereas frontend and backend processes come and go.
Transactions in POSTGRES
Transactions are a fundamental concept of all database systems. The essential point of a
transaction is that it bundles multiple steps into a single, all-or-nothing operation. The
intermediate states between the steps are not visible to other concurrent transactions, and if some
failure occurs that prevents the transaction from completing, then none of the steps affect the
database at all.
For example, consider a bank database that contains balances for various customer accounts, as
well as total deposit balances for branches. Suppose that we want to record a payment of $100.00
from Alice's account to Bob's account. Simplifying outrageously, the SQL commands for this
might look like
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
UPDATE branches SET balance = balance - 100.00
WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Alice');
UPDATE accounts SET balance = balance + 100.00
WHERE name = 'Bob';
UPDATE branches SET balance = balance + 100.00
WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Bob');
The details of these commands are not important here; the important point is that there are
several separate updates involved to accomplish this rather simple operation. Our bank's officers
will want to be assured that either all these updates happen, or none of them happen. It would
certainly not do for a system failure to result in Bob receiving $100.00 that was not debited from
Alice. Nor would Alice long remain a happy customer if she was debited without Bob being
credited. We need a guarantee that if something goes wrong partway through the operation, none
of the steps executed so far will take effect. Grouping the updates into a transaction gives us this
guarantee. A transaction is said to be atomic: from the point of view of other transactions, it
either happens completely or not at all.
In PostgreSQL, a transaction is set up by surrounding the SQL commands of the transaction with
BEGIN and COMMIT commands. So our banking transaction would actually look like
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
-- etc etc
COMMIT;
If, partway through the transaction, we decide we do not want to commit (perhaps we just
noticed that Alice's balance went negative), we can issue the command ROLLBACK instead of
COMMIT, and all our updates so far will be canceled.
PostgreSQL actually treats every SQL statement as being executed within a transaction. If you
do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if
successful) COMMIT wrapped around it. A group of statements surrounded by BEGIN and COMMIT
is sometimes called a transaction block.
XML stands for the eXtensible Markup Language. It is a new markup language, developed by
the W3C (World Wide Web Consortium)
Some of the areas where XML will be useful in the near-term include:
large Web site maintenance. XML would work behind the scene to simplify the creation of
HTML documents
exchange of information between organizations
off loading and reloading of databases
syndicated content, where content is being made available to different Web sites
electronic commerce applications where different organizations collaborate to serve a customer
scientific applications with new markup languages for mathematical and chemical formulas
electronic books with new markup languages to express rights and ownership
handheld devices and smart phones with new markup languages optimized for these
alternative devices
XML makes essentially two changes to HTML:
It predefines no tags.
It is stricter.
No Predefined Tags
Because there are no predefined tags in XML, you, the author, can create the tags that you need.
Example:
<price currency=usd>499.00</price>
<toc xlink:href=/https/www.scribd.com/newsletter>Pineapplesoft Link</toc>
Stricter
HTML has a very forgiving syntax. This is great for authors who can be as lazy as they want, but
it also makes Web browsers more complex. According to some estimates, more than 50% of the
code in a browser handles errors or sloppiness on the authors part.
XML Example:
A List of Products in XML
<?xml version=1.0?>
<products>
<product id=p1>
<name>XML Editor</name>
<price>499.00</price>
</product>
<product id=p2>
<name>DTD Editor</name>
<price>199.00</price>
</product>
<product id=p3>
<name>XML Book</name>
<price>19.99</price>
</product>
<product id=p4>
<name>XML Training</name>
<price>699.00</price>
</product>
</products>
XML Schemas
The DTD is the original modeling language or schema for XML.
The syntax for DTDs is different from the syntax for XML documents.
The purpose of a DTD is to define the structure of an XML document. It defines the structure
with a list of legal elements:
Example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
XML Schema
</xs:element>
XML NAMESPACES
XSL
XSL stands for EXtensible Stylesheet Language.
The World Wide Web Consortium (W3C) started to develop XSL because there was a need for
an XML-based Stylesheet Language.
What is XSLT?
XSLT is a language for transforming XML documents into XHTML documents or to other XML documents.
XPath is a language for navigating in XML documents. XSLT uses XPath to find information in an XML
document. XPath is used to navigate through elements and attributes in XML documents.
What is XSL-FO?