Rdbms Assignment
Rdbms Assignment
SYSTEM
QUERY OPTIMIZATION
TOPICS COVERED:-
- INTRODUCTION TO QUERY PROCESSING
- HEURISTIC APPROACH TO QUERY OPTIMIZATION
- COST ESTIMATION
- PIPELINING
Definition
Query processing denotes the compilation and execution of a query specification usually
expressed in a declarative database query language such as the structured query language
(SQL). Query processing consists of a compile-time phase and a runtime phase. Query
processing denotes the compilation and execution of a query specification usually expressed
in a declarative database query language such as the structured query language (SQL). Query
processing consists of a compile-time phase and a runtime phase.
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved
are:
As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL. It gets translated into
expressions that can be further used at the physical level of the file system. After this, the
actual evaluation of the queries and a variety of query -optimizing transformations and takes
place. Thus before processing a query, a computer system needs to translate the query into a
human-readable and understandable language. Consequently, SQL or Structured Query
Language is the best suitable choice for humans. But, it is not perfectly suitable for the
internal representation of the query to the system. Relational algebra is well suited for the
internal representation of a query. The translation process in query processing is similar to the
parser of a query. When a user executes any query, for generating the internal form of the
query, the parser in the system checks the syntax of the query, verifies the name of the
relation in the database, the tuple, and finally the required attribute value. The parser creates a
tree of the query, known as 'parse-tree.' Further, translate it into the form of relational algebra.
With this, it evenly replaces all the use of the views when used in the query.
Suppose a user executes a query. In SQL, a user wants to fetch the records of the employees
whose salary is greater than or equal to 10000. For doing this, the following query is
undertaken:
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a query
evaluation plan.
o In order to fully evaluate a query, the system needs to construct a query evaluation
plan.
o The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
o Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
o Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query
execution plan.
o A query execution engine is responsible for generating the output of the given query.
It takes the query execution plan, executes it, and finally makes the output for the user
query.
AD
Optimization
o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which
minimizes its cost. This type of task performed by the database system and is known
as Query Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
Definition
Heuristic approaches to query optimization involve using rules, guidelines, and estimation
techniques to find a reasonably efficient query execution plan without exhaustively exploring
all possible plan options. These methods are based on heuristics, which are practical and
often intuitive strategies for decision-making.
Heuristic optimization transforms the expression-tree by using a set of rules which improve
the performance.
Rules :
Perform the SELECTION process foremost in the query. This should be the first
action for any SQL table. By doing so, we can decrease the number of records
required in the query, rather than using all the tables during the query.
Perform all the projection as soon as achievable in the query. Somewhat like a
selection but this method helps in decreasing the number of columns in the query.
Perform the most restrictive joins and selection operations. What this means is that
select only those sets of tables and/or views which will result in a relatively lesser
number of records and are extremely necessary in the query. Obviously any query will
execute better when tables with few records are joined.
Some systems use only heuristics and the others combine heuristics with partial cost-based
optimization.
Let’s see the steps involve in heuristic optimization, which are explained below −
Deconstruct the conjunctive selections into a sequence of single selection operations.
Move the selection operations down the query tree for the earliest possible execution.
First execute those selections and join operations which will produce smallest
relations.
Replace the cartesian product operation followed by selection operation with join
operation.
Deconstructive and move the tree down as far as possible.
Identify those subtrees whose operations are pipelined.
1. Rule-Based Optimization:
- Rule-based optimization relies on a set of predefined rules or heuristics to guide the
query optimization process.
2. Cost-Based Optimization:
- Cost-based optimization combines heuristics with cost estimation.
3. Join Ordering Heuristics:
- It decides the order in which tables are joined in a query with multiple tables
4. Index Selection Heuristics:
- Heuristics can guide the selection of appropriate indexes to improve query
performance.
Heuristic approaches to query optimization are widely used in practice because they are
efficient and can handle complex queries in a reasonable amount of time. However, they may
not always guarantee the absolute optimal query execution plan.
COST ESTIMATION
The main aim of query optimization is to choose the most efficient way of implementing the
relational algebra operations at the lowest possible cost.
• The query optimizer should not depend solely on heuristic rules, but, it should also estimate
the cost of executing the different strategies and find out the strategy with the minimum cost
estimate.
The cost functions used in query optimization are estimates and not exact cost functions.The
cost of an operation is heavily dependent on its selectivity, that is, the proportion of select
operation(s) that forms the output.In general the different algorithms are suitable for low or
high selectivity queries.In order for query optimizer to choose suitable algorithm for an
operation an estimate of the cost of executing that algorithm must be provided.The cost of an
algorithm is depend of a cardinality of its input.
To estimate the cost of different query execution strategies, the query tree is viewed as
containing a series of basic operations which are linked in order to perform the query.It is
also important to know the expected cardinality of an operation's output because this forms
the input to the next operation.
Linear Search:
- [nBlocks(R)/2], if the record is found. -[nBlocks(R)], if no record satisfied the condition.
Binary Search:
[log2(nBlocks(R))], if equality condition is on key attribute, because SCA(R) = 1 in
this case.
[log2(nBlocks(R))] + [SCA(R)/bFactor(R)]-1, otherwise.
Equity condition on Primary key
- [nLevelA(I) + 1]
Equity condition on Non-Primary key :-
- [nLevelA(I) + 1] + [nBlocks(R)/2]
-
PIPELINING
In the context of a Relational Database Management System (RDBMS), pipelining refers to a
technique used to optimize query processing and improve the efficiency of data retrieval.
Pipelining is particularly relevant in the context of complex queries that involve multiple
operations, such as joins, filtering, and aggregation. It aims to reduce the time it takes to
complete a query by processing and passing the data through a series of stages, each of which
performs a specific operation on the data.
1. Query Parsing: The first step is to parse and analyze the SQL query to understand the
operations it requires. This involves identifying the tables, columns, conditions, joins, and
sorting requirements specified in the query.
2. Query Optimization: Once the query is parsed, the RDBMS's query optimizer generates an
execution plan that outlines the most efficient way to retrieve and process the data. This plan
considers factors such as indexing, table access methods, and join strategies.
3. Pipeline Stages: The query execution plan is then broken down into a series of pipeline
stages. Each stage is responsible for a specific operation, such as table scanning, filtering,
joining, or aggregation. For example, in a query with multiple joins, there may be separate
stages for each join operation.
4. Data Flow: Data flows through these pipeline stages in a stream-like fashion. As data is
processed at each stage, it is passed to the next stage in the pipeline without waiting for the
entire dataset to be processed.
5. Parallel Execution: Pipelining allows for parallel execution of these stages, which can
significantly improve query performance. Multiple stages can run concurrently on different
parts of the dataset, reducing the overall query execution time.
6. Resource Management: The RDBMS manages resources and ensures that the data flows
efficiently through the pipeline stages. It allocates memory and processing power
appropriately to optimize the query's performance.
7. Output Generation: Once all pipeline stages have been executed, the final results are
generated and returned to the user as the query's output.
Pipelining is especially effective for queries that involve large datasets and complex
operations because it minimizes the need to store intermediate results and allows for efficient
use of system resources. However, not all RDBMS systems support pipelining, and its
effectiveness can depend on the specific query and the database engine's capabilities. It's
important to note that the degree of parallelism, the efficiency of the execution plan, and the
database schema design all play significant roles in determining the performance benefits of
pipelining.
Advantages of Pipeline
o It reduces the cost of query evaluation by eliminating the cost of reading and writing
the temporary relations, unlike the materialization process.
o If we combine the root operator of a query evaluation plan in a pipeline with its
inputs, the process of generating query results becomes quick. As a result, it is
beneficial for the users as they can view the results of their asked queries as soon as
the outputs get generated. Else, the users need to wait for high-time to get and view
any query results.
Implementation of Pipelining
For implementing a pipeline in order to evaluate multiple operations of the given user query,
we need to construct a single and complex operation that merges the multiple operations of
the given query, which will implement a pipeline. However, such an approach is feasible and
efficient for some frequently occurring conditions.
The system can use any of the following ways for executing a pipeline:
Demand-driven Pipeline
In the demand-driven pipeline, the system repeatedly makes tuples request from the
operation, which is at the top of the pipeline. Whenever the operation gets the system request
for the tuples, initially, it computes those next tuples which will be returned, and after that, it
returns the requested tuples. The operation repeats the same process each time it receives any
tuples request from the system. In case, the inputs of the operation are not pipelined, then we
compute the next returning tuples from the input relations only. However, the system keeps
track of all tuples which have been returned so far. But if there are some pipelined inputs
present, the operation will make a request for tuples from its pipelined inputs also. After
receiving tuples from its pipelined inputs, the operation uses them for computing tuples for its
output or result and then passes them to its parent which is at the upper-level. So, in the
demand-driven pipeline, a pipeline is implemented on the basis of the demand or request of
tuples made by the system.
o After invoking the open() function, each call to next() returns the next tuple as an
output of the operation.
o In turn, the implementation of the operation invokes the open(), and next() functions
on its inputs so that the input tuples may be easily available when needed.
o After fulfilling the requirements, the close() function tells the iterator that there is no
more tuple requirement.
o Also, in-between the calling process or calls, the iterator maintains its state of
execution. As a result, the successive next() function receives tuples of the successive
result.
Producer-driven Pipeline
The producer-driven pipeline is different from the demand-driven pipeline. In the producer-
driven pipeline, the operations do not wait for the system request for producing the tuples.
Instead, the operations are eager to produce such tuples. In the producer-driven pipeline, it
models each operation as a separate thread or process within the system. Here, the system
gets a stream of tuples from its pipelined inputs and finally generates or produces a stream of
tuples for its output. The producer-driven pipeline follows such an approach.
The way of implementing the producer-driven pipeline varies from demand-driven pipeline.
The implementation processes in the following described steps:
o For each pair of adjacent operations, the system constructs a buffer that holds the
tuples which are being passed from one operation to the next operation.
o After creating the buffer, the processes which are corresponding to different
operations are concurrently executed.
o All those operations which are present at the bottom of the pipeline continually
produce the output tuples put them in the output buffer until the buffer becomes full.
o As soon the operation uses a tuple from a pipelined input, it removes that tuple from
its input buffer.
o In case the output buffer becomes full, the operation waits until the buffer creates
more space for more tuples. What happens, the parent operation of the specified
operation is responsible for removing the tuples form the buffer. So, in actuality, the
operation waits for its parent operation to do so.
o So, when the buffer creates more space again, the operation restarts its tuples
production and continues until the buffer becomes full again.
o The operation repeats this process until the generation of all the output tuples.
AD
Note: It becomes necessary for the system to switch operations if an input buffer is empty,
the output buffer is full, or when it needs more input tuples for generating more output
tuples.
It is similar to pulling data up from the top of an It is similar to pushing data up from the below of an oper
operation tree.
It is most commonly used for evaluating an It is typical so rarely used in the systems. But, it is goo
expression. parallel processing systems.
There are the following difference points between the demand-driven pipeline and producer-
driven pipeline: