0% found this document useful (0 votes)
125 views28 pages

Unit 2

- Relational Algebra defines ways to manipulate relations (tables) through unary and binary operations. Join and semi-join are binary operations. - A join combines records from two tables based on common values. A semi-join returns rows from the first table that match rows in the second table, but returns each row from the first table at most once. - Semi-joins are useful in distributed databases to reduce data transfer between sites. They can improve performance of certain queries by sending less data over networks compared to regular joins.

Uploaded by

nikhilsinha789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views28 pages

Unit 2

- Relational Algebra defines ways to manipulate relations (tables) through unary and binary operations. Join and semi-join are binary operations. - A join combines records from two tables based on common values. A semi-join returns rows from the first table that match rows in the second table, but returns each row from the first table at most once. - Semi-joins are useful in distributed databases to reduce data transfer between sites. They can improve performance of certain queries by sending less data over networks compared to regular joins.

Uploaded by

nikhilsinha789
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Relational Algebra :

• The Relational Algebra is used to define the ways in which relations


(tables) can be operated to manipulate their data.
• This Algebra is composed of Unary operations (involving a single table)
and Binary operations (involving multiple tables).
Join, Semi-join these are Binary operations in Relational Algebra.

Join
• Join is a binary operation in Relational Algebra.
• It combines records from two or more tables in a database.
• A join is a means for combining fields from two tables by using values
common to each.

Semi-Join
•A Join where the result only contains the columns from one of the joined
tables.
•Useful in distributed databases, so we don't have to send as much data
over the network. •Can dramatically speed up certain classes of queries.

What is “Semi-Join” ?
Semi-join strategies are technique used for query processing in
distributed database systems. Used for reducing communication cost.
A semi-join between two tables returns rows from the first table where
one or more matches are found in the second table.

The difference between a semi-join and a conventional join is that rows in


the first table will be returned at most once. Even if the second table
contains two matches for a row in the first table, only one copy of the row
will be returned.

Semi-joins are written using EXISTS or IN.

A Simple Semi-Join Example “Give a list of departments with at least


one employee.” Query written with a conventional join:
SELECT D.deptno, D.dname FROM dept D, emp E WHERE E.deptno =
D.deptno ORDER BY D.deptno;

A department with N employees will appear in the list N times.


◦ We could use a DISTINCT keyword to get each department to appear
only once.
A Simple Semi-Join Example “Give a list of departments with at least
one employee.” Query written with a semi-join:

SELECT D.deptno, D.dname FROM dept D


WHERE EXISTS (SELECT 1 FROM emp E WHERE E.deptno =
D.deptno) ORDER BY D.deptno;

◦ No department appears more than once.


◦ Oracle stops processing each department as soon as the first employee
in that department is found.

➢ The Relational Algebra is used to define the ways in which relations (tables)
can be operated to manipulate their data. This Algebra is composed of
Unary operations (involving a single table) and Binary operations (involving
multiple tables). Join, Semi-join these are Binary operations in Relational
Algebra.
➢ Join Join is a binary operation in Relational Algebra. It combines
records from two or more tables in a database.
➢ A join is a means for combining fields from two tables by using values
common to each.
➢ Semi-Join •A Join where the result only contains the columns from one of
the joined tables. •it is Useful in distributed databases, so we don't have to
send as much data over the network. •Can dramatically speed up certain
classes of queries.
➢ Semi-Join strategies are technique for query processing in distributed
database systems. Used for reducing communication cost. A semi-join
between two tables returns rows from the first table where one or more
matches are found in the second table.
➢ The difference between a semi-join and a conventional join is that rows in
the first table will be returned at most once. Even if the second table contains
two matches for a row in the first table, only one copy of the row will be
returned.
➢ semii-joins are written using EXISTS or IN.

Example “Give a list of departments with at least one employee.”

Query written with a conventional join:

SELECT D.deptno, D.dname


FROM dept D, emp E
WHERE E.deptno = D.deptno
ORDER BY D.deptno; ◦

A department with N employees will appear in the list N times

. ◦ We use a DISTINCT keyword to get each department to appear only once.


Simple Semi-Join
Example “Give a list of departments with at least one employee

SELECT D.deptno, D.dname


FROM dept D
WHERE EXISTS (SELECT 1 FROM emp E
WHERE E.deptno = D.deptno)
ORDER BY D.deptno

emi-join is introduced in Oracle 8.0. It provides an efficient method of

Departments table

CREATE TABLE "DEPARTMENTS"


( "DEPARTMENT_ID" NUMBER(10,0) NOT NULL ENABLE,
"DEPARTMENT_NAME" VARCHAR2(50) NOT NULL ENABLE,
CONSTRAINT "DEPARTMENTS_PK" PRIMARY KEY ("DEPARTMENT_ID") E
NABLE
)
/

Customer table

CREATE TABLE "CUSTOMER"


( "CUSTOMER_ID" NUMBER,
"FIRST_NAME" VARCHAR2(4000),
"LAST_NAME" VARCHAR2(4000),
"DEPARTMENT_ID" NUMBER
)
1. /

Execute this query

SELECT departments.department_id, departments.department_name


FROM departments
WHERE EXISTS
(
SELECT 1
FROM customer
WHERE customer.department_id = departments.department_id
)
ORDER BY departments.department_id;

Output

Example
Layers of Query processing

Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global
relations. The information needed for this transformation is found in the global
conceptual schema describing the global relations. However, the information about
data distribution is not used here but in the next layer. Thus the techniques used by
this layer are those of a centralized DBMS.

Query decomposition can be viewed as four successive steps. First, the calculus query
is rewritten in a normalized form that is suitable for subsequent manipulation.
Normalization of a query generally involves the manipulation of the query quantifiers
and of the query qualification by applying logical operator priority.

Second, the normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect queries exist
only for a subset of relational calculus. Typically, they use some sort of graph that
captures the semantics of the query.

Third, the correct query (still expressed in relational calculus) is simplified. One way
to simplify a query is to eliminate redundant predicates. Note that redundant queries
are likely to arise when a query is the result of system transformations applied to the
user query. Such transformations are used for performing semantic data control
(views, protection, and semantic integrity control).

Fourth, the calculus query is restructured as an algebraic query. That several algebraic
queries can be derived from the same calculus query, and that some algebraic queries
are “better” than others. The quality of an algebraic query is defined in terms of
expected performance. The traditional way to do this transformation toward a “better”
algebraic specification is to start with an initial algebraic query and transform it in
order to find a “good” one. The initial algebraic query is derived immediately from
the calculus query by translating the predicates and the target statement into relational
operators as they appear in the query. This directly translated algebra query is then
restructured through transformation rules. The algebraic query generated by this layer
is good in the sense that the worse executions are typically avoided. For instance, a
relation will be accessed only once, even if there are several select predicates.
However, this query is generally far from providing an optimal execution, since
information about data distribution and fragment allocation is not used at this layer.

Data Localization
The input to the second layer is an algebraic query on global relations. The main role
of the second layer is to localize the query’s data using data distribution information
in the fragment schema. We saw that relations are fragmented and stored in disjoint
subsets, called fragments, each being stored at a different site. This layer determines
which fragments are involved in the query and transforms the distributed query into a
query on fragments. Fragmentation is defined by fragmentation predicates that can be
expressed through relational operators. A global relation can be reconstructed by
applying the fragmentation rules, and then deriving a program, called a localization
program, of relational algebra operators, which then act on fragments. Generating a
query on fragments is done in two steps. First, the query is mapped into a fragment
query by substituting each relation by its reconstruction program (also
called materialization program). Second, the fragment query is simplified and
restructured to produce another “good” query. Simplification and restructuring may
be done according to the same rules used in the decomposition layer. As in the
decomposition layer, the final fragment query is generally far from optimal because
information regarding fragments is not utilized.

Global Query Optimization


The input to the third layer is an algebraic query on fragments. The goal of query optimization is
to find an execution strategy for the query which is close to optimal. Remember that finding the
optimal solution is computationally intractable. An execution strategy for a distributed query can
be described with relational algebra operators and communication primitives (send/receive
operators) for transferring data between sites. The previous layers have already optimized the
query, for example, by eliminating redundant expressions. However, this optimization is
independent of fragment characteristics such as fragment allocation and cardinalities. In addition,
communication operators are not yet specified. By permuting the ordering of operators within one
query on fragments, many equivalent queries may be found.

Query optimization consists of finding the “best” ordering of operators in the query, including
communication operators that minimize a cost function. The cost function, often defined in terms
of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost,
communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and
communication costs. Nevertheless, a typical simplification made by the early distributed DBMSs,
as we mentioned before, was to consider communication cost as the most significant factor. This
used to be valid for wide area networks, where the limited bandwidth made communication much
more costly than local processing. This is not true anymore today and communication cost can be
lower than I/O cost. To select the ordering of operators it is necessary to predict execution costs of
alternative candidate orderings. Determining execution costs before query execution (i.e., static
optimization) is based on fragment statistics and the formulas for estimating the cardinalities of
results of relational operators. Thus the optimization decisions depend on the allocation of
fragments and available statistics on fragments which are recorder in the allocation schema.

An important aspect of query optimization is join ordering, since permutations of the joins within
the query may lead to improvements of orders of magnitude. One basic technique for optimizing a
sequence of distributed join operators is through the semijoin operator. The main value of the
semijoin in a distributed system is to reduce the size of the join operands and then the
communication cost. However, techniques which consider local processing costs as well as
communication costs may not use semijoins because they might increase local processing costs.
The output of the query optimization layer is a optimized algebraic query with communication
operators included on fragments. It is typically represented and saved (for future executions) as
a distributed query execution plan.

Distributed Query Execution


The last layer is performed by all the sites having fragments involved in the query. Each subquery
executing at one site, called a local query, is then optimized using the local schema of the site and
executed. At this time, the algorithms to perform the relational operators may be chosen. Local
optimization uses the algorithms of centralized systems.

The goal of distributed query processing may be summarized as follows: given a calculus query
on a distributed database, find a corresponding execution strategy that minimizes a system cost
function that includes I/O, CPU, and communication costs. An execution strategy is specified in
terms of relational algebra operators and communication primitives (send/receive) applied to the
local databases (i.e., the relation fragments). Therefore, the complexity of relational operators that
affect the performance of query execution is of major importance in the design of a query
processor.

Localization of Distributed Data


The general techniques for decomposing and restructuring queries are
expressed in relational calculus. The general techniques apply to
both centralized and distributed DBMSs

. Role of the localization layer, which translates an algebraic query


on global relations into an algebraic query expressed on physical
fragments.

➢ Localization uses information stored in the fragment schema.


1Fragmentation is defined through fragmentation rules, which can
be expressed as relational queries. A naive way to localize a
distributed query is to generate a query where each global
relation is substituted by its localization program.

1 Reduction for Primary Horizontal Fragmentation:

The horizontal fragmentation function distributes a relation based


on selection predicates. The reduction of queries on horizontally
fragmented relations consist primarily of determining, after
restructuring the subtrees, those that will produce empty relations,
and removing them.
Horizontal fragmentation can be exploited to simplify both selection
and join operations.

2 Reduction for Vertical Fragmentation

The vertical fragmentation function distributes a relation based on


projection attributes. Since the reconstruction operator for vertical
fragmentation is the join, the localization program for a vertically
fragmented relation consist of the join of the fragments on the
common attribute. Similar to horizontal fragmentation, queries on
vertical fragments can be reduced by determining the useless
intermediate relations and removing the subtrees that produce them.

3 Reduction for Derived Fragmentation:

The join operation is probably the most important operation


because it is both frequent and expensive, can be optimized by using
primary horizontal fragmentation when the joined relations a re
fragmented according to the joins attributes. In this case the join
of two relations is implemented as a union of partial joins. However,
this method precludes one of the relations from being fragmented on a
different attribute used for selection. Derived horizontal
fragmentation is another way of distributing two relations so that
the joint processing of select and join is improved.

4 Reduction for Hybrid Fragmentation Hybrid fragmentation is obtained


by combining the fragmentation functions discussed above. The goal of
hybrid fragmentation is to support, efficiently queries involving
projection, selection, and join. Note that the optimization of an
operation or of a combination of operations is always done at the
expense of other operations.

Query Decomposition in details


Query Decomposition The first layer decomposes the calculus
query into an algebraic query on global relations. The information
needed for this transformation is found in the global conceptual
schema describing the global relations. Both input and output
queries refer to global relations, without knowledge of the distribution
of data. Therefore, query decomposition is the same for centralized
and distributed systems.

Query decomposition can be viewed in four successive steps:

1) Normalization
2) Analysis
3) Elimination of redundancy
4) Rewriting.

Query Decomposition First, the calculus query is rewritten in a


normalized form that is suitable for subsequent manipulation.
Normalization of a query generally involves the manipulation of the
query quantifiers and of the query qualification by applying logical
operator priority.
Second, the normalized query is analyzed semantically so that
incorrect queries are detected and rejected as early as possible.
Typically, they use some sort of graph that captures the semantics of
the query.

Normalization The input query may be arbitrarily complex.

It is the goal of normalization to transform the query to a


normalized form to facilitate further processing. With relational
languages such as SQL, the most important transformation is that of
the query qualification (the WHERE clause), which may be an
arbitrarily complex, quantifier-free predicate, preceded by all
necessary quantifiers ( or ).
There are two possible normal forms for the predicate
one giving precedence to the AND (^)
other to the OR (˅).

The conjunctive normal form is a conjunction (^ predicate) of


disjunctions (˅ predicates) as follows:
(p11 ˅ p12 ˅ . . . ˅ p1n) ^ . . . . ^ (pm1 ˅ pm2 ˅ . . . ˅ pmn)

where pi j is a simple predicate.

A qualification in disjunctive normal form, on the other hand, is as


follows:

(p11 ^ p12 ^ . . . ^ p1n) ˅ . . . . ˅ (pm1 ^ pm2 ^ . . . ^ pmn)


The transformation is straightforward using the well-known
equivalence rules for logical operations (^, ˅ and ¬):
1. p1 ^ p2 p2 ^ p1 2. p1 ˅ p2 p2 ˅ p1 3. p1 ^ (p2 ^ p3) (p1 ^ p2) ^ p3 4.
p1 ˅ (p2 ˅ p3) (p1 ˅ p2) ˅ p3 5. p1 ^ (p2 ˅ p3) (p1 ^ p2) ˅ (p1 ^ p3) 6.
p1 ˅ (p2 ^ p3) (p1 ˅ p2) ^ (p1 ˅ p3) 7. ¬ (p1 ^ p2) ¬ p1 ˅ ¬ p2 8. ¬
(p1 ˅ p2) ¬ p1 ^ ¬ p2 9. ¬ (¬ p) p

Example:
Let us consider the following query on the engineering database that
we have been referring to:

“Find the names of employees who have been working on project P1


for 12 or 24 months”

Engineering Database:

EMP(ENO, ENAME, TITLE) PROJ(PNO, PNAME, BUDGET).


SAL(TITLE, AMT) ASG(ENO, PNO, RESP, DUR) ;SAL=Salary,
AMT=Amount ; Employees assigned to projects ;
RESP=Responsibility, DUR=Duration

Example: Let us consider the following query on the engineering


database that we have been referring to: “Find the names of employees
who have been working on project P1 for 12 or 24 months”
SELECT ENAME FROM EMP,
ASG WHERE EMP.ENO = ASG.ENO AND ASG.PNO = "P1" AND
DUR = 12 OR DUR = 24

Example:

Let us consider the following query on the engineering database that


we have been referring to:
“Find the names of employees who have been working on project P1
for 12 or 24 months”

The qualification in conjunctive normal form is


EMP.ENO = ASG.ENO ^ ASG.PNO = “P1” ^ (DUR = 12 ˅ DUR =
24)

The qualification in disjunctive normal form is


(EMP.ENO = ASG.ENO ^ ASG.PNO = “P1” ^ DUR = 12) ˅
(EMP.ENO = ASG.ENO ^ ASG.PNO = “P1” ^ DUR = 24)

Analysis

Query analysis enables rejection of normalized queries for which


further processing is either impossible or unnecessary.
The main reasons for rejection are that the query is type incorrect or
semantically incorrect.

A query is type incorrect if any of its attribute or relation names are


not defined in the global schema, or if operations are being applied to
attributes of the wrong type.

The technique used to detect type incorrect queries is similar to type


checking for programming languages.

Example: The following SQL query on the engineering database is


type incorrect for two reasons.
First, attribute E# is not declared in the schema.
Second, the operation “>200” is incompatible with the type string of
ENAME. SELECT E# FROM EMP WHERE ENAME > 200
A query is semantically incorrect if its components do not
contribute in any way to the generation of the result. This is based
on the representation of the query as a graph, called a query graph or
connection graph. In a query graph, one node indicates the result
relation, and any other node indicates an operand relation. An edge
between two nodes one of which does not correspond to the result
represents a join, whereas an edge whose destination node is the result
represents a project. An important subgraph of the query graph is
the join graph, in which only the joins are considered.

Example: Let us consider the following query:

“Find the names and responsibilities of programmers who have been


working on the CAD/CAM project for more than 3 years.”

SELECT ENAME, RESP FROM EMP, ASG, PROJ


WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO
AND PNAME = "CAD/CAM" AND DUR ≥ 36 AND TITLE =
"Programmer“
ASG PROJ EMP RESULT ASG.PNO = PROJ.PNO EMP.ENO =
ASG.ENO TITLE= “Programmer” PNAME =
“CAD/CAM”ENAME RESP Fig.: Query

Elimination of Redundancy :
Simplify the query by eliminating redundancies, e.g., redundant
predicates.

Redundancies are often due to semantic integrity constraints expressed


in the query language. Transformation rules are used:

Example:

SELECT TITLE FROM EMP


WHERE (NOT (TITLE = "Programmer")
AND (TITLE = "Programmer" OR TITLE = "Elect. Eng.") AND
NOT (TITLE = "Elect. Eng.")) OR ENAME = "J. 38.
Elimination of Redundancy Can be simplified using the previous
rules to become
SELECT TITLE FROM EMP WHERE ENAME = "J. Doe“

The simplification proceeds as follows:

Let p1 be TITLE = “Programmer”, p2 be TITLE = “Elect. Eng.”, and


p3 be ENAME = “J. Doe”. The query qualification is: (¬ p1 ^ (p1 ˅ p2)
^ ¬ p2) ˅ p3

The disjunctive normal form for this qualification is obtained by


applying rule 5 which yields: (¬ p1 ^ ((p1 ^ ¬ p2) ˅ (p2 ^ ¬ p2))) ˅ p3
and then rule 3 yields: (¬ p1 ^ p1 ^ ¬ p2) ˅ (¬ p1 ^ p2 ^ ¬ p2) ˅ p3 By
applying rule 7, we obtain (false ^ ¬ p2) ˅ (¬ p1 ^ false) ˅ p3 By
applying the same rule, we get (false ˅ false) ˅ p3 which is equivalent
to p3 by rule 4

Rewriting
The last step of query decomposition rewrites the query in relational
algebra. For the sake of clarity it is customary to represent the
relational algebra query graphically by an operator tree. An operator
tree is a tree in which a leaf node is a relation stored in the database,
and a non-leaf node is an intermediate relation produced by a
relational algebra operator. The sequence of operations is directed
from the leaves to the root, which represents the answer to the query.
.
Rewriting The transformation of a tuple relational calculus query into
an operator tree can easily be achieved as follows:
First, a different leaf is created for each different tuple variable
(corresponding to a relation).
In SQL, the leaves are immediately available in the FROM clause.

Second, the root node is created as a project operation involving the


result attributes. These are found in the SELECT clause in SQL.

Third, the qualification (SQL WHERE clause) is translated into the


appropriate sequence of relational operations (select, join, union, etc.)
going from the leaves to the root. The sequence can be given directly
by the order of appearance of the predicates and operators.

Example:
“Find the names of employees other than J. Doe who worked on the
CAD/CAM project for either one or two years”
SELECT ENAME FROM PROJ, ASG, EMP WHERE ASG.ENO =
EMP.ENO AND ASG.PNO = PROJ.PNO AND ENAME != "J. Doe"
AND PROJ.PNAME = "CAD/CAM" AND (DUR = 12 OR DUR = 24)

Localization of Distributed Data The main role of this layer is to


localize the query’s data using data distribution information
Output of the first layer is an algebraic query on distributed relations
which is input to the second layer.
. We know that relations are fragmented and stored in disjoint
subsets, called fragments where

This layer determines which fragments are involved in the query


and transforms the distributed query into a fragment query.
A naive way to localize a distributed query is to generate a query
where each global relation is substituted by its localization program.

This can be viewed as replacing the leaves of the operator tree of the
distributed query with subtrees corresponding to the localization
programs. We call the query obtained this way the localized query.

3 Gobal Query Optimization

The input to the third layer is a fragment algebraic query. The


goal of this layer is to find an execution strategy for the algebraic
fragment query which is close to optimal. Query optimization
consists of i) Finding the best ordering of operations in the fragment
query, ii) Finding the communication operations which minimize a
cost function.

The cost function refers to computing resources such as disk space,


disk I/Os, buffer space, CPU cost, communication cost, and so on.
. Query optimization is achieved through the semi-join operator
instead of join operators.
.
.
Distributed Execution The last layer is performed by all the sites
having fragments involved in the query. Each subquery, called a local
query, is executing at one site. It is then optimized using the local schema
of the site.
Query optimization in details
Query optimization is of great importance for the performance of a relational
database, especially for the execution of complex SQL statements. A query optimizer
decides the best methods for implementing each query.
Query Optimization: A single query can be executed through different algorithms
or re-written in different forms and structures. Hence, the question of query
optimization comes into the picture – Which of these forms or pathways is the most
optimal? The query optimizer attempts to determine the most efficient way to
execute a given query by considering the possible query plans.
Importance: The goal of query optimization is to reduce the system resources
required to fulfill a query, and ultimately provide the user with the correct result set
faster.

There is the various principle of Query Optimization are as follows −Understand


how your database is executing your query − The first phase of query optimization
is understanding what the database is performing. Different databases have different
commands for this. For example, in MySQL, one can use the “EXPLAIN [SQL
Query]” keyword to see the query plan. In Oracle, one can use the “EXPLAIN PLAN
FOR [SQL Query]” to see the query plan.
Retrieve as little data as possible − The more information restored from the
query, the more resources the database is required to expand to process and
save these records. For example, if it can only require to fetch one column
from a table, do not use ‘SELECT *’.
Store intermediate results − Sometimes logic for a query can be quite
complex. It is possible to produce the desired outcomes through the use of
subqueries, inline views, and UNION-type statements. For those methods, the
transitional results are not saved in the database but are directly used within the
query. This can lead to achievement issues, particularly when the transitional
results have a huge number of rows.

There are various query optimization strategies are as follows −
Use Index − It can be using an index is the first strategy one should use to
speed up a query.
Aggregate Table − It can be used to pre-populating tables at higher levels so
less amount of information is required to be parsed.
Vertical Partitioning − It can be used to partition the table by columns. This
method reduces the amount of information a SQL query required to process.
Horizontal Partitioning − It can be used to partition the table by data value,
most often time. This method reduces the amount of information a SQL query
required to process.
De-normalization − The process of de-normalization combines multiple tables
into a single table. This speeds up query implementation because fewer table
joins are required.
Server Tuning − Each server has its parameters and provides tuning server
parameters so that it can completely take benefit of the hardware resources that
can significantly speed up query implementation.

Query Optimization in Centralized Systems


In centralized system, query processing is done with the following aim −
Minimization of response time of query (time taken to produce the results to
user’s query).
Maximize system throughput (the number of requests that are processed in a
given amount of time).
Reduce the amount of memory and storage required for processing.
Increase parallelism.

How ?

Query Parsing and Translation


Initially, the SQL query is scanned. Then it is parsed to look for syntactical errors and
correctness of data types. If the query passes this step, the query is decomposed into
smaller query blocks. Each block is then translated to equivalent relational algebra
expression.
Steps for Query Optimization
Query optimization involves three steps, namely query tree generation, plan
generation, and query plan code generation.
Step 1 − Query Tree Generation
A query tree is a tree data structure representing a relational algebra expression. The
tables of the query are represented as leaf nodes. The relational algebra operations are
represented as the internal nodes. The root represents the query as a whole.
During execution, an internal node is executed whenever its operand tables are
available. The node is then replaced by the result table. This process continues for all
internal nodes until the root node is executed and replaced by the result table.
For example, let us consider the following schemas −
EMPLOYEE

EmpID EName Salary DeptNo DateOfJoining

DEPARTMENT

DNo DName Location

Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"} {(EMPLOYEE)})$$
The corresponding query tree will be −

Example 2
Let us consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"} {(DEPARTMENT)})
\bowtie_{DNo=DeptNo}{(EMPLOYEE)}$
Following is the query tree for the above query.

Step 2 − Query Plan Generation


After the query tree is generated, a query plan is made. A query plan is an extended
query tree that includes access paths for all operations in the query tree. Access paths
specify how the relational operations in the tree should be performed. For example, a
selection operation can have an access path that gives details about the use of B+ tree
index for selection.
Besides, a query plan also states how the intermediate tables should be passed from
one operator to the next, how temporary tables should be used and how operations
should be pipelined/combined.
Step 3− Code Generation
Code generation is the final step in query optimization. It is the executable form of the
query, whose form depends upon the type of the underlying operating system. Once
the query code is generated, the Execution Manager runs it and produces the results.

What are the Approaches to Query Optimization


Among the approaches for query optimization, exhaustive search and heuristics-based
algorithms are mostly used.
Exhaustive Search Optimization
In these techniques, for a query, all possible query plans are initially generated and
then the best plan is selected. Though these techniques provide the best solution, it has
an exponential time and space complexity owing to the large solution space. For
example, dynamic programming technique.
Heuristic Based Optimization
Heuristic based optimization uses rule-based optimization approaches for query
optimization. These algorithms have polynomial time and space complexity, which is
lower than the exponential complexity of exhaustive search-based algorithms.
However, these algorithms do not necessarily produce the best query plan.
Some of the common heuristic rules are −
Perform select and project operations before join operations. This is done by
moving the select and project operations down the query tree. This reduces the
number of tuples available for join.
Perform the most restrictive select/project operations at first before the other
operations.
Avoid cross-product operation since they result in very large-sized
intermediate tables.

Distributed Query Processing Architecture


In a distributed database system, processing a query comprises of optimization at both
the global and the local level. The query enters the database system at the client or
controlling site. Here, the user is validated, the query is checked, translated, and
optimized at a global level.
The architecture can be represented as −
Mapping Global Queries into Local Queries
The process of mapping global queries to local ones can be realized as follows −
The tables required in a global query have fragments distributed across
multiple sites. The local databases have information only about local data. The
controlling site uses the global data dictionary to gather information about the
distribution and reconstructs the global view from the fragments.
If there is no replication, the global optimizer runs local queries at the sites
where the fragments are stored. If there is replication, the global optimizer
selects the site based upon communication cost, workload, and server speed.

The global optimizer generates a distributed execution plan so that least
amount of data transfer occurs across the sites. The plan states the location of
the fragments, order in which query steps needs to be executed and the
processes involved in transferring intermediate results.

The local queries are optimized by the local database servers. Finally, the local
query results are merged together through union operation in case of horizontal
fragments and join operation for vertical fragments.

For example, let us consider that the following Project schema is horizontally
fragmented according to City, the cities being New Delhi, Kolkata and Hyderabad.
PROJECT

PId City Department Status


Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.
The global query will be &inus;
$$\sigma_{status} = {\small "ongoing"}^{(PROJECT)}$$
Query in New Delhi’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})}$$
Query in Kolkata’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Kol}_-{PROJECT})}$$
Query in Hyderabad’s server will be −
$$\sigma_{status} = {\small "ongoing"}^{({Hyd}_-{PROJECT})}$$
In order to get the overall result, we need to union the results of the three queries as
follows −
$\sigma_{status} = {\small "ongoing"}^{({NewD}_-{PROJECT})} \cup
\sigma_{status} = {\small "ongoing"}^{({kol}_-{PROJECT})} \cup \sigma_{status}
= {\small "ongoing"}^{({Hyd}_-{PROJECT})}$

Distributed Query Optimization


Distributed query optimization requires evaluation of a large number of query trees
each of which produce the required results of a query. This is primarily due to the
presence of large amount of replicated and fragmented data. Hence, the target is to
find an optimal solution instead of the best solution.
The main issues for distributed query optimization are −
• Optimal utilization of resources in the distributed system.
• Query trading.
• Reduction of solution space of the query.

Optimal Utilization of Resources in the Distributed System


A distributed system has a number of database servers in the various sites to perform
the operations pertaining to a query.
Following are the approaches for optimal resource utilization −
Operation Shipping − In operation shipping, the operation is run at the site where the
data is stored and not at the client site. The results are then transferred to the client site.
This is appropriate for operations where the operands are available at the same site.
Example: Select and Project operations.
Data Shipping − In data shipping, the data fragments are transferred to the database
server, where the operations are executed. This is used in operations where the
operands are distributed at different sites. This is also appropriate in systems where
the communication costs are low, and local processors are much slower than the client
server.
Hybrid Shipping − This is a combination of data and operation shipping. Here, data
fragments are transferred to the high-speed processors, where the operation runs. The
results are then sent to the client site.

Query Trading
In query trading algorithm for distributed database systems, the controlling/client site
for a distributed query is called the buyer and the sites where the local queries execute
are called sellers.
The buyer formulates a number of alternatives for choosing sellers and for
reconstructing the global results. The target of the buyer is to achieve the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The
optimal plan is created from local optimized query plans proposed by the sellers
combined with the communication cost for reconstructing the final result. Once the
global optimal plan is formulated, the query is executed.

Reduction of Solution Space of the Query


Optimal solution generally involves reduction of solution space so that the cost of
query and data transfer is reduced. This can be achieved through a set of heuristic
rules, just as heuristics in centralized systems.
Following are some of the rules −
Perform selection and projection operations as early as possible. This reduces
the data flow over communication network.

Simplify operations on horizontal fragments by eliminating selection


conditions which are not relevant to a particular site.

In case of join and union operations comprising of fragments located in


multiple sites, transfer fragmented data to the site where most of the data is
present and perform operation there.

Use semi-join operation to qualify tuples that are to be joined. This reduces the
amount of data transfer which in turn reduces communication cost.

Merge the common leaves and sub-trees in a distributed query tree.

Data Allocation

Data Allocation is an intelligent distribution of your data pieces, (called data


fragments) to improve database performance and Data Availability for end-users. It
aims to reduce overall costs of transaction processing while also providing accurate
data rapidly in your DDBMS systems.

Data Allocation is one of the key steps in building your Distributed Database Systems.
There are two common strategies used in optimal Data Allocation: Data
Fragmentation

Data Replication.

Fragmentation and Replication In Distributed


Database
Data Fragmentation

Fragmentation is a process of disintegrating relations or tables into several partitions


in multiple sites. It divides a database into various subtables and sub relations so that
data can be distributed and stored efficiently.

Database Fragmentation can be of two types: horizontal or vertical. In a horizontal


fragmentation, each tuple of a relation r is assigned to one or more fragments. In
vertical fragmentation, the schema for a relation r is split into numerous smaller
schemas with a common candidate key and a special attribute. More details on
horizontal and vertical fragmentation will be discussed in the next section.

Methods of Data Fragmentation of a Table

In this section of our fragmentation and replication in distributed database guide, we


discuss the two fundamental fragmentation strategies: horizontal and vertical. In
addition to these, Distributed Database Management Systems also allow the nesting of
fragments in a hybrid fashion, called Hybrid Fragmentation. This will be discussed
separately in our third fragmentation strategy.

• Horizontal Fragmentation
• Vertical Fragmentation
• Hybrid Fragmentation

Horizontal Fragmentation (or Sharding)

A Horizontal Fragmentation strategy divides a table horizontally by selecting a subset


of rows in accordance with values of one or more fields. After partition, these data
fragments are assigned to different sites of a Distributed Database System. When a
user makes a complete table request, these fragments are then combined using
a union operation.

There are two versions of Horizontal Fragmentation: Primary Horizontal


Fragmentation, which uses predicates of relation to perform fragmentation,
and Derived Horizontal Fragmentation, which uses predicates defined on another
relation to partition a relation.
Horizontal Fragmentation allows parallel processing of a relation. You can also split a
global table into tuples and allocate them to places where they are most frequently
accessed for efficient data storage and better access.

Vertical Fragmentation

Vertical Fragmentation splits a table vertically by attributes or columns. In this case,


data fragments keep only certain attributes of the original table. They are then
assigned to different sites of a DDBMS.

Every data fragment gets a primary key that is required while restoring the original
table. The fragmentation is done in such a way that reconstructing a table from
fragments only requires a normal JOIN operation. To do so, a specific property
called Tuple-id is added to the schema.

Vertical Fragmentation is highly useful for cases when you want to enforce data
privacy.

Hybrid Fragmentation

Out of the two discussed in fragmentation and replication in distributed databases,


Hybrid Fragmentation takes a different approach. It comprises a combination of both
Horizontal and Vertical Fragmentation.

Here the tables are initially fragmented in any form (horizontal or vertical) and then
these fragments are partially replicated across different sites according to the
frequency of accessing the database fragments. In this case, the original table can be
reconstructed by applying union and natural JOIN operations in the appropriate order.

Advantages and Disadvantages of Fragmentation

Database Fragmentation improves Data Accessibility and provides faster transaction


processing to user queries. Using fragmentation, you can decompose a relation into
multiple independent units so that your users can perform a number of transactions,
and retrieve data concurrently without any noticeable lag.

However, Data Fragmentation raises some difficulties as well. Fragmentation and


replication in distributed databases must ensure fault tolerance and zero data loss
while reconstructing your original table from its fragments. This must happen
correctly and at all times whenever your users pass a request.

Moreover, your database fragments must be split up “sensibly” so that users with a
high demand volume can request and receive data from fragmented tables quickly. In
other words, your Database Fragmentation should ensure high query
performance and concurrent user processing. Additionally, you must be mindful of
the need to reduce dispersed joins throughout the process, which can inevitably add to
your costs.

Here, in this section on fragmentation and replication in distributed database guide,


we discuss the pros and cons of Database Fragmentation. Let’s have a closer look at
those.

Advantages

Using Database Fragmentation, you and your teams can:

• Concurrently execute a number of transactions.


• Capitalize on parallel processing of a single query.
• Take advantage of increased system throughput.
• Store data efficiently, by saving frequently used data close to the site of usage.
• Use local query optimization.
• Preserve the security and privacy of your database systems.
• Benefit from fault-tolerance architecture with better disaster recovery
mechanisms.

Disadvantages

Database Fragmentation falls short in the following scenarios:

• When application views are defined on more than one fragment, they can
develop conflicting requirements.
• When doing recurrent fragmentation, the reconstruction task might become
rather large.
• In simple operations like checking for dependencies, which might result in
chasing data across several sites.
• When data from several fragments is required, access times can be extremely
fast

Data Replication

Distributed Database Replication is the process of creating and maintaining multiple


copies (redundancy) of data in different sites. The main benefit it brings to the table is
that duplication of data ensures faster retrieval. This eliminates single points of failure
and data loss issues if one site fails to deliver user requests, and hence provides you
and your teams with a fault-tolerant system.

However, Distributed Database Replication also has some disadvantages. To ensure


accurate and correct responses to user queries, data must be constantly updated and
synchronized at all times. Failure to do so will create inconsistencies in data, which
can hamper business goals and decisions for other teams.

advantages of Data Replication

• Data Reliability: Your databases continue to work even in case of a site


failure. Using Distributed Database Replication, you can request and receive
the same copy from a different site.
• Scalability: As your systems grow geographically and in terms of the number
of locations (and hence the number of access requests), replication provides a
seamless way to handle this expansion without compromising on response
times.
• Quicker Response: Data Replication enables copies of data to be available
close to their access sites. This method of localization delivers quick query
processing and consequently fast response times.
• Simpler Transactions: With Data Replication, user transactions become
simple since they require fewer table joins and minimal coordination across
the network.

Disadvantages of Data Replication

• High Storage Requirements: If your databases are of a gigantic scale,


creating and maintaining copies of those databases will demand a high storage
capacity.
• Increased Costs and Complexity: More copies mean more storage costs.
And with every update, your DDBMS system must ensure that new changes
are reflected in all the copies of the data at all sites.
• Undesirable Application – Database Coupling: Inherent to data update
mechanisms are possibilities of Data Inconsistency. Eliminatin

. What is Query ?
. A query is a statement requesting the retrieval of information. The
portion of a DML that involves information retrieval is called a query
language.
.
What is Query Processor?

In relational database technology, users perform the task of data processing


and data manipulation with the help of high- level non-procedural language (e.g.
SQL). This high-level query hides the low-level details from the user about the
physical organization of the data and presents such an environment so that the
user can handle the tasks of even complex queries in an easy, concise and simple
fashion. The inner procedure of query-task is performed by a DBMS module
called a Query Pro cessor. Hence the users are relieved from query optimization
which is a time- consuming task that is actually performed by the query
processor.

What is query processing?

➢ Query processing refers to the process to answer a query to a database or


an information system, which usually involves interpreting the query,
searching through the space storing data, and retrieving the results satisfying the
query.

➢ Query Processing is a translation of high-level queries into low-level


expression. It is a step wise process that can be used at the physical level of the
file system, query optimization and actual execution of the query to get the result.

Query processing in distributed database environments is very difficult instead


of centralized database, because there are many elements or parameters involved
that affect the overall performance of distributed queries. Moreover, in
distributed environment, the query processor may have to access in many sites.
Thus, query response time may become very high. That is why, Query
processing problem is divided into several sub-problems/ steps which are easier
to solve individually.

What are the Main Function/Objective of a Query Processor

Main function of a query processor is to transform a high- level-query (also called calculus
query) into an equivalent lower-level query (also called algebraic query). The conversion
must be correct and efficient. The conversion will be correct if low-level query has the same
semantics as the original query, i.e. if both queries produce the same result.

What are the layers of query processing discuss briefly?


Four main layers are involved in distributed query processing. The first three layers
map the input query into an optimized distributed query execution plan. They
perform the functions of query decomposition, data localization, and global query
optimization. Last layer query execution

What are the components of query processing?


query processor consists of four sub-components; each of them corresponds to a
different stage in the lifecycle of a query. The sub-components are the query parser,
the query rewriter, the query optimizer and the query executor [3].

What are the typical phases of query processing?


There are four phases in a typical query processing.
• Parsing and Translation.
• Query Optimization.
• Evaluation or query code generation.
• Execution in DB's runtime processor.

Which is the third layer of query processing?

Global Query Optimization The input to the third layer is an algebraic


query on fragments. The goal of query optimization is to find an
execution strategy for the query which is close to optimal. Remember
that finding the optimal solution is

Is there a generic layering scheme for query processing?


A generic layering scheme for query processing is shown where each layer solves a
well-defined subproblem. To simplify the discussion, let us assume a static and
semicentralized query processor that does not exploit replicated fragments.

How is the problem of query processing decomposed?

The problem of query processing can itself be decomposed into several sub-problems,
corresponding to various layers. A generic layering scheme for query processing is
shown where each layer solves a well-defined sub-problem. The input is a query on
global data expressed in relational calculus.

You might also like