Module - 1
Module - 1
• In query processing, it takes various steps for fetching the data from the
database.
• It is a methodical procedure that can be applied at the physical level of the file
system, during query optimization, and when the query is actually executed to
obtain the result.
• As query processing includes certain activities for data retrieval. Initially, the
given user queries get translated in high-level database languages such as
SQL.
• It gets translated into expressions that can be further used at the physical level
of the file system.
• After this, the actual evaluation of the queries and a variety of query -
optimizing transformations and takes place. Thus before processing a query, a
computer system needs to translate the query into a human-readable and
understandable language.
• The parser creates a tree of the query, known as 'parse-tree.' Further, translate
it into the form of relational algebra.
• With this, it evenly replaces all the use of the views when used in the query.
• Suppose a user executes a query. As we have learned that there are various
methods of extracting the data from the database.
• In SQL, a user wants to fetch the records of the employees whose salary is
greater than or equal to 10000. For doing this, the following query is
undertaken:
• select emp_name from Employee where salary>10000;
• Thus, to make the system understand the user query, it needs to be translated in
the form of relational algebra. We can bring this query in the relational algebra
form as:
• σsalary>10000 (πsalary (Employee))
• πsalary (σsalary>10000 (Employee))
• σsalary>10000 (πsalary (Employee))
• πsalary (σsalary>10000 (Employee))
• After translating the given query, we can execute each relational algebra
operation by using different algorithms.
• Typically, SQL queries are decomposed into query blocks, which form the
basic units that can be translated into the algebraic operators and optimized.
• A query block contains a single SELECT-FROM-WHERE
expression, as well as GROUP BY and HAVING clauses if these are
part of the block.
FROM EMPLOYEE
FROM EMPLOYEE
WHERE Dno=5 );
• The inner block could be translated into the following extended relational
algebra expression:
• The query optimizer would then choose an execution plan for each query
block.
• Notice that in the above example, the inner block needs to be evaluated only
once to produce the maximum salary of employees in department 5, which is
then used—as the constant c—by the outer block.
• We called this a nested query (without correlation with the outer query)
Basic Algorithms for executing query operations
• There are several algorithms used to execute query operations, including
relational algebra, join algorithms, and sorting algorithms.
• Relational algebra
• A formal framework for defining operations on relational databases
• Used to manipulate and retrieve data based on user queries
• Common operations include selection, projection, join, and aggregation
Join algorithms
•Used to determine the most efficient way to join two tables together
•Examples of join algorithms include nested loop joins, hash joins, and merge joins.
•Sorting algorithms
• Used to sort a dataset
• Algorithms like Order By divide the dataset into smaller parts, sort them individually, and then merge
them back together .
•Query decomposition
• Involves rewriting a calculus query in a normalized form
• Normalization involves manipulating the query quantifiers and qualification by applying logical
operators
• All pairs of tuples which satisfy the condition are added in the result of the
join.
• This algorithm is called nested join because it consists of nested for loops.
• Join operation:
A join operation combines data from two or more tables based on a common
column or columns. The join operation is performed using the JOIN keyword in
SQL, and it returns a single result set that contains columns from all the tables
involved in the join.
• For example, let’s say we have two tables, Table1 and Table2, with the following
data:
• Table1
ID | Name
1 | John
2 | Sarah
3 | David
• Table2
ID | Address
1 | 123 Main St.
2 | 456 Elm St.
4 | 789 Oak St.
• If we want to combine the data from these two tables based on the ID column, we can
perform an inner join using the following SQL query:
• SELECT Table1.ID, Table1.Name, Table2.Address
FROM Table1
INNER JOIN Table2
ON Table1.ID = Table2.ID
• Result:
ID | Name | Address
1 | John | 123 Main St.
2 | Sarah | 456 Elm St.
• If we want to retrieve the names of the people who have an address
in Table2, we can use a nested query as follows:
• SELECT Name
FROM Table1
WHERE ID IN (SELECT ID FROM Table2)
• Result:
Name
John
Sarah
• In this case, the nested query is executed first, and it returns the ID
values of the rows in Table2. These ID values are then used to
evaluate the outer query, which retrieves the names of the people
who have those ID values in Table1.
• The choice between using a join operation or a nested query depends on
the specific requirements of the task at hand.
• Joins are often faster and more efficient for large datasets, but nested
queries can be more flexible and allow for more complex conditions to be
evaluated.
Linear search (brute force)
• Linear search is a brute-force approach where elements in the list or array are
sequentially checked from the beginning to the end until the desired
element is found. The algorithm compares each element with the target value
until a match is found or the entire list has been traversed.
• Linear search can be used to search for the smallest or largest value in an
unsorted list rather than searching for a match. It can do so by keeping track of
the largest (or smallest) value and updating as necessary as the algorithm
iterates through the dataset.
• Working of Linear Search Algorithm in Data Structures
• Let's suppose we need to find element 6 in the given array or list. We will
work according to the above-given algorithm.
• Start from the first element, and compare the key=6 with each element x.
Implementation of Linear Search Algorithm in Different Programming
Languages
Query Tree and Query Graph
• Once the alternative access paths for computation of a relational algebra
expression are derived, the optimal access path is determined.
• Minimization of response time of query (time taken to produce the results to user’s query).
• Maximize system throughput (the number of requests that are processed in a given amount of time).
• Increase parallelism.
Query Parsing and Translation -
• Initially, the SQL query is scanned. Then it is parsed to look for syntactical
errors and correctness of data types.
• If the query passes this step, the query is decomposed into smaller query
blocks. Each block is then translated to equivalent relational algebra
expression.
Steps for Query Optimization -
• Step 1 − Query Tree Generation
• Step 1: Write the relations you want to execute as the tree’s Leaf nodes. Here
R and S are the relations.
• Step 2: Add the condition (here R.P = S.P) with the relational algebra operator
as an internal node (or parent node of these two leaf nodes).
• Step 3: Now add the root node that on execution gives the output of the query.
• Example 2: Suppose we have a query:
• For every project located at ‘Stanford’, list the project number (Pnumber), the
controlling department number (Dnum), and the department manager’s last
name (Lname), address (Address), and birth date (Bdate).
• Step 1: We will begin with executing the first leaf node PROJECT, and the
corresponding internal node σPlocation = ‘Stanford’ as we need these resulting tuples
to execute the next operation.
• Step 2: Similarly, we will execute the leaf node DEPARTMENT and the
intermediate/internal node ⋈ Dnum=Dnumber so that we can move to the next
operation.
• Step 3: We execute the next operation with the leaf node EMPLOYEE and
intermediate node ⋈ Mgr_ssn=Ssn.
• Step 4: Now add the root node i.e., πPnumber, Dnum, Lname, Address,
Bdate to get the output of the query on execution.
• Features of Query Tree in Relational Algebra –
2.Modularity: Query trees break complex queries into smaller and more
manageable components. Which can facilitate the optimization process and
make it easier to reason about query execution.
1.Complexity: Query trees can be complex, especially for large and complex
queries. Managing and optimizing these trees can require significant
computational resources and sophisticated optimization algorithms.
2.Overhead: Building and traversing the query tree imposes overhead on query
processing. Although this overhead is usually negligible for small queries, it
can be significant for larger queries with many functions.
3.Limited Optimizations: Despite their ability to optimize, query trees may not
always yield significant performance improvements. In some cases, the
overhead associated with optimization may exceed the performance gain
achieved through query optimization.
• The performance of a query plan is determined largely by the order in
which the tables are joined.
• For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and
1,000,000 rows, respectively, a query plan that joins B and C first can take
several orders-of-magnitude more time to execute than one that joins A and C
first.
1. Use Indexes
2. Use WHERE Clause instead of having
3. Avoid Queries inside a Loop
4. Use Select instead of Select *
5. Add Explain to the Beginning of Queries
6. Keep Wild cards at the End of Phrases
7. Use Exist() instead of Count()
8. Avoid Cartesian Products
9. Consider Denormalization
10. Optimize JOIN Operations
Thanks
Functional dependencies in DBMS
• In relational database management, functional dependency is a
concept that specifies the relationship between two sets of
attributes where one attribute determines the value of another
attribute.
Also,
any relation that is in BCNF, is in 3NF;
any relation in 3NF is in 2NF; and
any relation in 2NF is in 1NF.
91.2914 69
Normalization
1NF
a relation in BCNF, is also in 3NF
BCNF
91.2914 70
Normalization
We consider a relation in BCNF to be fully normalized.
The benefit of higher normal forms is that update semantics for the affected data are
simplified.
This means that applications required to maintain the database are simpler.
A design that has a lower normal form than another design has more redundancy.
Uncontrolled redundancy can lead to data integrity problems.
AB
Example: Suppose we keep track of employee email addresses, and we only track one email address for each employee.
Suppose each employee is identified by their unique employee number.
91.2914 72
Functional Dependencies
EmpNum EmpEmail EmpFname EmpLname
123 [email protected] John Doe
456 [email protected] Peter Smith
555 [email protected] Alan Lee
633 [email protected] Peter Doe
787 [email protected] Alan Lee
If EmpNum is the PK then the FDs:
EmpNum EmpEmail
EmpNum EmpFname
EmpNum EmpLname
must exist.
91.2914 73
Functional Dependencies
EmpNum EmpEmail
EmpNum EmpFname 3 different ways
EmpNum EmpLname you might see FDs
depicted
EmpEmail
EmpNum EmpFname
EmpLname
91.2914 74
Determinant
Functional Dependency
EmpNum EmpEmail
91.2914 75
Transitive dependency
Transitive dependency
91.2914 76
First Normal Form
First Normal Form
We say a relation is in 1NF if all values stored in the relation are single-valued and atomic.
91.2914 77
First Normal Form
The following in not in 1NF
91.2914 78
First Normal Form
EmpNum EmpPhone EmpDegrees
123 233-9876
333 233-1231 BA, BSc, PhD
679 233-1231 BSc, MSc
To obtain 1NF relations we must, without loss of information, replace the above with two
relations - see next slide
91.2914 79
First Normal Form
EmployeeDegree
Employee
EmpNum EmpDegre
EmpNum EmpPhone e
333 BA
123 233-9876
333 BSc
333 233-1231
333 PhD
679 233-1231
679 BSc
679 MSc
An outer join between Employee and EmployeeDegree will produce the information we saw before
91.2914 80
Second Normal Form
Second Normal Form
A relation is in 2NF if it is in 1NF, and every non-key attribute is fully dependent on each
candidate key. (That is, we don’t have any partial functional dependency.)
• 2NF (and 3NF) both involve the concepts of key and non-key attributes.
• A key attribute is any attribute that is part of a key; any attribute that is not a key
attribute, is a non-key attribute.
91.2914 81
Second Normal Form
Third Normal Form
Third Normal Form
• A relation is in 3NF if the relation is in 1NF and all determinants of non-key attributes are
candidate keys
That is, for any functional dependency: X Y, where Y is a non-key attribute (or a set of non-key
attributes), X is a candidate key.
• This definition of 3NF differs from BCNF only in the specification of non-key attributes - 3NF is
weaker than BCNF. (BCNF requires all determinants to be candidate keys.)
• A relation in 3NF will not have any transitive dependencies
of non-key attribute on a candidate key through another non-key attribute.
91.2914 83
Third Normal Form
EmpNum EmpName DeptNum DeptName
91.2914 84
Second Normal Form:
Each column must depend on the *entire* primary key.
Third Normal Form:
Each column must depend on *directly* on the
primary key.
Boyce-Codd Normal Form
(BCNF)
Boyce-Codd normal form (BCNF)
A relation is in BCNF, if and only if, every determinant is a
candidate key.
For example, two tuples have to be updated if the roomNo need be changed for staffNo SG5 on the 13-
May-02.
Example of BCNF(2)
To transform the ClientInterview relation to BCNF, we must remove the violating functional dependency by
creating two new relations called Interview and StaffRoom as shown below,
Interview
ClientNo interviewDate interviewTime staffNo
CR76 13-May-02 10.30 SG5
CR76 13-May-02 12.00 SG5
CR74 13-May-02 12.00 SG37
CR56 1-Jul-02 10.30 SG5
StaffRoom
staffNo interviewDate roomNo
SG5 13-May-02 G101
SG37 13-May-02 G102
SG5 1-Jul-02 G102