Chapter 2 Query Processing
Chapter 2 Query Processing
QUERY PROCESSING
Query processing and optimization are crucial components of a database management system
(DBMS). They work together to ensure that database queries are executed efficiently,
accurately, and within the expected time frame.
Query Processing
1. Parsing: The query is broken down into smaller components, such as tables, columns, and
conditions.
2. Optimization: The query is analyzed and optimized by considering various factors, such as
the available indexes, statistics, and system resources.
3. Query Rewriting: The query is rewritten to improve performance, such as by reordering
joins or eliminating unnecessary operations.
4. Execution: The optimized query is executed, and the results are returned to the user.
Query Optimization
1. Access Path Selection: Choosing the most efficient method to access the required data, such
as using an index or a full table scan.
2. Join Order Optimization: Determining the optimal order in which to join tables, taking into
account factors such as data distribution and indexing.
3. Query Transformation: Modifying the query to reduce overhead, improve parallelism, or
optimize for specific data distributions.
4. Index Selection: Choosing the most effective index for a query, considering factors such as
data distribution, indexing overhead, and query frequency.
1. Rule-Based Optimizers: Use a set of pre-defined rules to determine the optimal query plan.
2. Cost-Based Optimizers: Use dynamic programming to find the optimal query plan based on
estimated execution costs.
3. Hybrid Optimizers: Combine rule-based and cost-based approaches to improve optimization
performance.
Query Optimization Challenges
The activities involved in parsing, validating, execution and optimizing a query is called Query
Processing. It is the activity performed in extracting data from the database. In query processing, it
takes various steps for fetching the data from the database. The steps involved are:
1. Parsing and translation
2. Evaluation
3. Optimization
Parsing and Translation
As query processing includes certain activities for data retrieval. Initially, the given user queries get
translated in high-level database languages such as SQL. It gets translated into expressions that can be
further used at the physical level of the file system. After this, the actual evaluation of the queries and
a variety of query optimizing transformations and takes place. Thus before processing a query, a
computer system needs to translate the query into a human-readable and understandable language.
Consequently, SQL or Structured Query Language is the best suitable choice for humans. But, it is not
perfectly suitable for the internal representation of the query to the system. Relational algebra is well
suited for the internal representation of a query. The translation process in query processing is similar
to the parser of a query. When a user executes any query, for generating the internal form of the
query, the parser in the system checks the syntax of the query, verifies the name of the relation in the
database, the tuple, and finally the required attribute value. The parser creates a tree of the query,
known as 'parse-tree.' Further, translate it into the form of relational algebra. With this, it evenly
replaces all the use of the views when used in the query. Thus, we can understand the working of a
query processing in the below-described diagram:
Suppose a user executes a query. As we have learned that there are various methods of extracting the
data from the database. In SQL, a user wants to fetch the records of the employees whose salary is
greater than or equal to 10000. For doing this, the following query is undertaken:
Select emp_name from Employee where salary>10000;
2. Selection
Select operation chooses the subset of tuples from the relation that satisfies the given condition
mentioned in the syntax of selection. The selection operation is also known as horizontal partitioning
since it partitions the table or relation horizontally.
Notation: σ c(R) where ‘c’ is selection condition which is a boolean expression (condition), we can have
a single condition like Id_No = 2 or combination of condition like Course =”OOP” AND Name=”Hagos
Haile”, symbol ‘σ (sigma)’ is used to denote select (choose) operator, R is a relational algebra expression,
whose result is a relation. The boolean expression specified in condition ‘c’ can be written in the
following form:
<attribute name> <comparison operator> <constant value> or <attribute name> where, <attribute name> is
obviously name of an attribute of relation, <comparison operator> is any of the operator {<, >, =, <=,
>=, !=} and, <constant value> is constant value taken from the domain of the relation.
Example-1:
The query above (immediate) is called nested expression, here, as usual, we evaluate the inner expression
first (which results in relation say Manager1), then we calculate the outer expression on Manager1 (the
relation we obtained from evaluating the inner expression), which results in relation again, which is an
instance of a relation we input.
Example-2:
Given a relation Student (Roll, Name, Class, Fees, and Team) with the following tuples:
Roll Name Department Fees Team
1 Abebe CS 22000 A
2 Kebede CS 34000 A
3 Lwam IT 36000 C
4 Aster IT 56000 D
Select all the student of Team A :
σ Team = 'A' (Student)
This results as follows:
Roll Name Department Fees Team
1 Abebe CS 22000 A
2 Kebede CS 34000 A
Select all the students of department IT whose fees is greater than equal to 10000 and belongs to Team
other than A.
σ Fees >= 10000(σTeam != 'A' (Student))
This results as follows:
Roll Name Department Fees Team
3 Lwam IT 36000 C
4 Aster IT 56000 D
Important points about Select operation: Select operator is Unary, means it it applied to single relation
only. Selection operation is commutative that is,
The degree (number of attributes) of resulting relation from a Selection operation is same as the degree of
the Relation given. The cardinality (number of tuples) of resulting relation from a Selection operation is,
0 <= σ c (R) <= |R|
Difference between Selection and Projection in DBMS
2. Use It is used to choose the subset of tuples from the It is used to select certain required
relation that satisfies the given condition attributes, while discarding other
mentioned in the syntax of selection. attributes.
Notation: R ∪ S
Example:
Output:
CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that are in
both R & S. It is denoted by intersection ∩.
Notation: R ∩ S
Output:
CUSTOMER_NAME
Smith
Jones
5. Set Difference:
Suppose there are two tuples R and S. The set intersection operation contains all tuples that are in
R but not in S. It is denoted by intersection minus (-).
Notation: R - S
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product
The Cartesian product is used to combine each row in one table with each row in the other table.
It is also known as a cross product. It is denoted by X.
Notation: E X D
Example: EMPLOYEE DEPARTMENT
Output:
7. Rename operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
ρ(STUDENT1, STUDENT)
8. Join operations:
EMP_CODE EMP_NAME
101 Stephan
102 Jack
103 Harry
EMP_CODE SALARY
101 50000
102 30000
Result:
A. Natural Join:
Example: Let's use the above EMPLOYEE table and SALARY table:
Output:
EMP_NAME SALARY
Stephan 50000
Jack 30000
Harry 25000
B. Outer Join:
The outer join operation is an extension of the join operation. It is used to deal with missing
information.
Example:
EMPLOYEE FACT_WORKERS
Left outer join contains the set of tuples of all combinations in R and S that are equal on their
denoted by ⟕.
common attribute names. In the left outer join, tuples in R have no matching tuples in S. It is
Right outer join contains the set of tuples of all combinations in R and S that are equal on their
denoted by ⟖.
common attribute names. In right outer join, tuples in S have no matching tuples in R. It is
Output:
Full outer join is like a left or right join except that it contains all rows from both tables. In full
Output:
C. Equi join:
It is also known as an inner join. It is the most common join. It is based on matched data as per
the equality condition. The equi join uses the comparison operator (=).
Example:
CUSTOMER PRODUCT
Output:
Translating SQL queries into relational algebra involves expressing the query semantics using
operations that align with the relational algebra formalism. Below, I will go through a few SQL
queries and translate them into relational algebra with suitable examples.
Example Schema
Let's consider a simplified database consisting of two tables: Employees and Departments.
Employees Table:
Departments Table:
| DeptID | DeptName |
|--------|---------------|
| 101 | Engineering |
| 102 | Sales |
| 103 | HR |
SQL Query:
Relational Algebra:
σ(DeptID = 101)(Employees)
Explanation: The selection operator σ is used to filter rows based on the condition DeptID =
101.
Example 2: Projection
SQL Query:
Relational Algebra:
π(EmpName)(Employees)
Explanation: The projection operator π selects only the EmpName column from the Employees
table.
SQL Query:
Relational Algebra:
Explanation: The join operator ⨝ combines rows from both tables based on the matching
DeptID. The projection selects only EmpName and DeptName.
SQL Query:
Explanation: The grouping operator γ is used here to group records by DeptID and count the
number of employees in each department.
Example 5: Union
SQL Query:
Relational Algebra:
Explanation: The union operator ∪ combines the results from the two selection operations
where DeptID is either 101 or 102, projecting the EmpName.
Example 6: Intersection
SQL Query:
Relational Algebra:
Explanation: The intersection operator ∩ returns only the employee names that appear in both
sets (Department 101 and names starting with 'A').
Basic Algorithms for Executing Query Operations
Executing query operations in databases generally involves a series of basic algorithms that help
retrieve or manipulate data based on specific conditions. Below, I will outline some of the basic
algorithms commonly used for executing query operations, along with suitable examples.
1. Sequential Search
This is the simplest form of searching in a dataset, where each element is checked until the
desired value is found.
Example:
You have a table of users, and you want to find a user with the name "Alice".
Internally, a sequential search will iterate through each row in the users table until it finds the
row where the name is "Alice".
2. Binary Search
Binary search is a more efficient search method, but it requires that the data is sorted. It works by
repeatedly dividing the search interval in half.
Example:
Consider a sorted list of user IDs:
To find the user with a specific ID (say 7), the binary search would check the middle of the list.
If the middle ID is less than 7, it would search the upper half; if greater, it would search the
lower half.
3. Indexing
Indexing is a technique that allows fast retrieval of rows in a database. An index is a data
structure (like a B-tree or hash table) that improves the speed of data retrieval operations.
Example:
If you have an index on the name column in the users table, the database engine can directly
locate "Alice" without scanning every row.
With indexing, the database could quickly navigate to the index and retrieve the relevant rows.
4. Hashing
Hashing involves using a hash function to compute the address of a data element, leading to fast
access times.
Example:
A hash index on a table of products could allow a quick lookup for a product by its SKU (Stock
Keeping Unit).
Instead of searching linearly or using a binary search, the database uses the hash function to find
the correct location of 'SKU1234' in the index.
5. Join Operations
Joining tables is a common operation in relational databases, and there are various algorithms for
this, including nested loop joins, hash joins, and merge joins.
Nested Loop Join: Go through each row of the first table and for each row, search the
second table.
Example:
SELECT *
FROM users u
JOIN orders o ON u.user_id = o.user_id;
In a nested loop join, for each user, the database will check all orders to find matching user IDs.
Hash Join: Build a hash table from one of the tables and then probe it for matches.
Merge Join: Requires both tables to be sorted; it merges them based on matching keys.
6. Projection and Selection
These operations involve filtering and selecting specific columns (projection) or rows (selection)
of the data.
Example:
To get only the names of users:
7. Aggregation
Aggregation functions like SUM, COUNT, AVG, MAX, and MIN are used to summarize data.
Example:
To count how many users are in the database:
Heuristic query optimization involves employing rules of thumb or practical approaches that help
in generating efficient execution plans for SQL queries without exhaustive searching. In contrast
to cost-based optimization, heuristics are typically faster and less resource-intensive. Here are
some examples of heuristic strategies:
1. Predicate Pushdown: Moving selection operations as close to the data source as possible
(e.g., filtering rows before joining). For example, in a query that selects products with a
price below a certain threshold from a certain category:
SELECT * FROM Products WHERE category_id = 3 AND price < 100;
A heuristic might first filter the Products table by category_id, reducing the number of rows
in memory for subsequent operations.
Join Order Optimization: Generally, smaller tables are joined first to reduce the size of
intermediate results. For example, if you have three tables A, B, and C, and B is significantly
smaller than A or C, a heuristic might suggest:
3. WITH Temp AS (
4. SELECT id FROM Orders WHERE order_date > '2023-01-01'
5. )
6. SELECT * FROM Customers WHERE id IN (SELECT customer_id FROM Temp);
7.
A heuristic might determine that it’s more efficient to store Temp and reuse it.
1. Selectivity Calculation:
For a table with 10,000 rows, if a specific condition (like salary > 80,000) returns
around 500 rows, the selectivity for that predicate is 500 / 10,000 = 0.05 (or 5%).
2. Cost Estimates:
Cost estimates involve evaluating the resources that will be consumed to execute a query,
such as CPU time, I/O operations, and memory usage. Cost models may consider:
o The number of rows to process based on selectivity.
o Data distribution (e.g., how indexed or clustered the data is).
o Join costs based on how tables are joined (nested loop, hash join, etc.).
Using selectivity, the database optimizer can expect how many rows will likely be processed
under different predicates and choose the most efficient plan. For example, if choosing between:
If it’s known through statistics that department_id = 2 returns 20% of the data but salary >
100000 returns only 5%, the optimizer might choose the first option for better performance if
index scans are utilized.
Semantic query optimization aims to improve the execution of a query by leveraging the
semantics or meaning of the data involved, often taking advantage of additional knowledge
beyond what is expressed in the SQL statements. This can include rules about the data
(functional dependencies, constraints) or application-specific knowledge.
If you know that every customer with an order over 100 must exist (due to business logic or
constraints), it might be possible to restructure this:
This can reduce the size of data being processed if enforced correctly.
Eliminating Unreachable Data: Suppose a table of Employees has a rule that an employee
cannot work in multiple departments:
Knowing the semantic rules could allow the optimizer to identify that if an employee is in
department 1 or 2, they can't be in 3. This means the second condition is always true if
the first condition is true, allowing that part to be dropped or re-evaluated.
4. Utilizing Functional Dependencies: If a database has a profile where employee_id ->
employee_name holds true (functional dependency), the optimizer can recognize that if
there are no changes, retrieving names for employee_ids can be done in a
straightforward way without additional lookups once they're known from earlier joins.