Unit 2
Unit 2
Join
• Join is a binary operation in Relational Algebra.
• It combines records from two or more tables in a database.
• A join is a means for combining fields from two tables by using values
common to each.
Semi-Join
•A Join where the result only contains the columns from one of the joined
tables.
•Useful in distributed databases, so we don't have to send as much data
over the network. •Can dramatically speed up certain classes of queries.
What is “Semi-Join” ?
Semi-join strategies are technique used for query processing in
distributed database systems. Used for reducing communication cost.
A semi-join between two tables returns rows from the first table where
one or more matches are found in the second table.
➢ The Relational Algebra is used to define the ways in which relations (tables)
can be operated to manipulate their data. This Algebra is composed of
Unary operations (involving a single table) and Binary operations (involving
multiple tables). Join, Semi-join these are Binary operations in Relational
Algebra.
➢ Join Join is a binary operation in Relational Algebra. It combines
records from two or more tables in a database.
➢ A join is a means for combining fields from two tables by using values
common to each.
➢ Semi-Join •A Join where the result only contains the columns from one of
the joined tables. •it is Useful in distributed databases, so we don't have to
send as much data over the network. •Can dramatically speed up certain
classes of queries.
➢ Semi-Join strategies are technique for query processing in distributed
database systems. Used for reducing communication cost. A semi-join
between two tables returns rows from the first table where one or more
matches are found in the second table.
➢ The difference between a semi-join and a conventional join is that rows in
the first table will be returned at most once. Even if the second table contains
two matches for a row in the first table, only one copy of the row will be
returned.
➢ semii-joins are written using EXISTS or IN.
Departments table
Customer table
Output
Example
Layers of Query processing
Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global
relations. The information needed for this transformation is found in the global
conceptual schema describing the global relations. However, the information about
data distribution is not used here but in the next layer. Thus the techniques used by
this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps. First, the calculus query
is rewritten in a normalized form that is suitable for subsequent manipulation.
Normalization of a query generally involves the manipulation of the query quantifiers
and of the query qualification by applying logical operator priority.
Second, the normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect queries exist
only for a subset of relational calculus. Typically, they use some sort of graph that
captures the semantics of the query.
Third, the correct query (still expressed in relational calculus) is simplified. One way
to simplify a query is to eliminate redundant predicates. Note that redundant queries
are likely to arise when a query is the result of system transformations applied to the
user query. Such transformations are used for performing semantic data control
(views, protection, and semantic integrity control).
Fourth, the calculus query is restructured as an algebraic query. That several algebraic
queries can be derived from the same calculus query, and that some algebraic queries
are “better” than others. The quality of an algebraic query is defined in terms of
expected performance. The traditional way to do this transformation toward a “better”
algebraic specification is to start with an initial algebraic query and transform it in
order to find a “good” one. The initial algebraic query is derived immediately from
the calculus query by translating the predicates and the target statement into relational
operators as they appear in the query. This directly translated algebra query is then
restructured through transformation rules. The algebraic query generated by this layer
is good in the sense that the worse executions are typically avoided. For instance, a
relation will be accessed only once, even if there are several select predicates.
However, this query is generally far from providing an optimal execution, since
information about data distribution and fragment allocation is not used at this layer.
Data Localization
The input to the second layer is an algebraic query on global relations. The main role
of the second layer is to localize the query’s data using data distribution information
in the fragment schema. We saw that relations are fragmented and stored in disjoint
subsets, called fragments, each being stored at a different site. This layer determines
which fragments are involved in the query and transforms the distributed query into a
query on fragments. Fragmentation is defined by fragmentation predicates that can be
expressed through relational operators. A global relation can be reconstructed by
applying the fragmentation rules, and then deriving a program, called a localization
program, of relational algebra operators, which then act on fragments. Generating a
query on fragments is done in two steps. First, the query is mapped into a fragment
query by substituting each relation by its reconstruction program (also
called materialization program). Second, the fragment query is simplified and
restructured to produce another “good” query. Simplification and restructuring may
be done according to the same rules used in the decomposition layer. As in the
decomposition layer, the final fragment query is generally far from optimal because
information regarding fragments is not utilized.
Query optimization consists of finding the “best” ordering of operators in the query, including
communication operators that minimize a cost function. The cost function, often defined in terms
of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost,
communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and
communication costs. Nevertheless, a typical simplification made by the early distributed DBMSs,
as we mentioned before, was to consider communication cost as the most significant factor. This
used to be valid for wide area networks, where the limited bandwidth made communication much
more costly than local processing. This is not true anymore today and communication cost can be
lower than I/O cost. To select the ordering of operators it is necessary to predict execution costs of
alternative candidate orderings. Determining execution costs before query execution (i.e., static
optimization) is based on fragment statistics and the formulas for estimating the cardinalities of
results of relational operators. Thus the optimization decisions depend on the allocation of
fragments and available statistics on fragments which are recorder in the allocation schema.
An important aspect of query optimization is join ordering, since permutations of the joins within
the query may lead to improvements of orders of magnitude. One basic technique for optimizing a
sequence of distributed join operators is through the semijoin operator. The main value of the
semijoin in a distributed system is to reduce the size of the join operands and then the
communication cost. However, techniques which consider local processing costs as well as
communication costs may not use semijoins because they might increase local processing costs.
The output of the query optimization layer is a optimized algebraic query with communication
operators included on fragments. It is typically represented and saved (for future executions) as
a distributed query execution plan.
The goal of distributed query processing may be summarized as follows: given a calculus query
on a distributed database, find a corresponding execution strategy that minimizes a system cost
function that includes I/O, CPU, and communication costs. An execution strategy is specified in
terms of relational algebra operators and communication primitives (send/receive) applied to the
local databases (i.e., the relation fragments). Therefore, the complexity of relational operators that
affect the performance of query execution is of major importance in the design of a query
processor.
1) Normalization
2) Analysis
3) Elimination of redundancy
4) Rewriting.
Example:
Let us consider the following query on the engineering database that
we have been referring to:
Engineering Database:
Example:
Analysis
Elimination of Redundancy :
Simplify the query by eliminating redundancies, e.g., redundant
predicates.
Example:
Rewriting
The last step of query decomposition rewrites the query in relational
algebra. For the sake of clarity it is customary to represent the
relational algebra query graphically by an operator tree. An operator
tree is a tree in which a leaf node is a relation stored in the database,
and a non-leaf node is an intermediate relation produced by a
relational algebra operator. The sequence of operations is directed
from the leaves to the root, which represents the answer to the query.
.
Rewriting The transformation of a tuple relational calculus query into
an operator tree can easily be achieved as follows:
First, a different leaf is created for each different tuple variable
(corresponding to a relation).
In SQL, the leaves are immediately available in the FROM clause.
Example:
“Find the names of employees other than J. Doe who worked on the
CAD/CAM project for either one or two years”
SELECT ENAME FROM PROJ, ASG, EMP WHERE ASG.ENO =
EMP.ENO AND ASG.PNO = PROJ.PNO AND ENAME != "J. Doe"
AND PROJ.PNAME = "CAD/CAM" AND (DUR = 12 OR DUR = 24)
This can be viewed as replacing the leaves of the operator tree of the
distributed query with subtrees corresponding to the localization
programs. We call the query obtained this way the localized query.
How ?
DEPARTMENT
Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"} {(EMPLOYEE)})$$
The corresponding query tree will be −
Example 2
Let us consider another query involving a join.
$\pi_{EName, Salary} (\sigma_{DName = \small "Marketing"} {(DEPARTMENT)})
\bowtie_{DNo=DeptNo}{(EMPLOYEE)}$
Following is the query tree for the above query.
The local queries are optimized by the local database servers. Finally, the local
query results are merged together through union operation in case of horizontal
fragments and join operation for vertical fragments.
•
For example, let us consider that the following Project schema is horizontally
fragmented according to City, the cities being New Delhi, Kolkata and Hyderabad.
PROJECT
Query Trading
In query trading algorithm for distributed database systems, the controlling/client site
for a distributed query is called the buyer and the sites where the local queries execute
are called sellers.
The buyer formulates a number of alternatives for choosing sellers and for
reconstructing the global results. The target of the buyer is to achieve the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The
optimal plan is created from local optimized query plans proposed by the sellers
combined with the communication cost for reconstructing the final result. Once the
global optimal plan is formulated, the query is executed.
Use semi-join operation to qualify tuples that are to be joined. This reduces the
amount of data transfer which in turn reduces communication cost.
Data Allocation
Data Allocation is one of the key steps in building your Distributed Database Systems.
There are two common strategies used in optimal Data Allocation: Data
Fragmentation
Data Replication.
• Horizontal Fragmentation
• Vertical Fragmentation
• Hybrid Fragmentation
Vertical Fragmentation
Every data fragment gets a primary key that is required while restoring the original
table. The fragmentation is done in such a way that reconstructing a table from
fragments only requires a normal JOIN operation. To do so, a specific property
called Tuple-id is added to the schema.
Vertical Fragmentation is highly useful for cases when you want to enforce data
privacy.
Hybrid Fragmentation
Here the tables are initially fragmented in any form (horizontal or vertical) and then
these fragments are partially replicated across different sites according to the
frequency of accessing the database fragments. In this case, the original table can be
reconstructed by applying union and natural JOIN operations in the appropriate order.
Moreover, your database fragments must be split up “sensibly” so that users with a
high demand volume can request and receive data from fragmented tables quickly. In
other words, your Database Fragmentation should ensure high query
performance and concurrent user processing. Additionally, you must be mindful of
the need to reduce dispersed joins throughout the process, which can inevitably add to
your costs.
Advantages
Disadvantages
• When application views are defined on more than one fragment, they can
develop conflicting requirements.
• When doing recurrent fragmentation, the reconstruction task might become
rather large.
• In simple operations like checking for dependencies, which might result in
chasing data across several sites.
• When data from several fragments is required, access times can be extremely
fast
Data Replication
. What is Query ?
. A query is a statement requesting the retrieval of information. The
portion of a DML that involves information retrieval is called a query
language.
.
What is Query Processor?
Main function of a query processor is to transform a high- level-query (also called calculus
query) into an equivalent lower-level query (also called algebraic query). The conversion
must be correct and efficient. The conversion will be correct if low-level query has the same
semantics as the original query, i.e. if both queries produce the same result.
The problem of query processing can itself be decomposed into several sub-problems,
corresponding to various layers. A generic layering scheme for query processing is
shown where each layer solves a well-defined sub-problem. The input is a query on
global data expressed in relational calculus.