0% found this document useful (0 votes)
18 views44 pages

Lesson 06

Uploaded by

pramuapex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views44 pages

Lesson 06

Uploaded by

pramuapex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

ADVANCED DATABASE MANAGEMENT

SYSTEMS
ICT3273

Query Processing Part II

Nuwan Laksiri
Department of ICT
Faculty of Technology
University of Ruhuna Lecture 06
WHAT WE DISCUSS TODAY ……..
• RECAP QUERY PROCESSING PART I
• OVERVIEW OF QUERY PROCESSING PART II
• SORTING
• JOIN OPERATION
• OTHER OPERATIONS
• EVALUATION OF EXPRESSIONS

NEXT WEEK
• QUERY OPTIMIZATION PART I
• INTRODUCTION
• TRANSFORMATION OF RELATIONAL EXPRESSIONS
• EQUIVALENT RULES
• COST BASED OPTIMIZATION
• HEURISTIC OPTIMIZATION
RECAP
• OVERVIEW
• MEASURES OF QUERY COST
• SELECTION OPERATION
• BASIC ALGORITHMS
• SELECTIONS USING INDICES
• SELECTIONS INVOLVING COMPARISONS
• IMPLEMENTATION OF COMPLEX SELECTIONS
Sorting
• What is Sorting in the context of databases?
• SQL queries can specify that the output be
sorted.
• Several of the relational operations, such as
joins, can be implemented efficiently if the input
relations are sorted.
Sorting
• We may build an index on the relation, and
then use the index to read the relation in sorted
order. May lead to one disk block access for
each tuple.
• For relations that fit in memory, techniques like
quicksort can be used.
• For relations that don’t fit in memory, external
sort-merge is a good choice.
External Sort Merge

• Let M denote memory size (in pages).


1. Create sorted runs.
Repeatedly do the following till the end of
the relation:
a. Read M blocks of relation into memory
b. Sort the in-memory blocks
c. Write sorted data into hard disk.
2. Merge the runs
External Sorting using Sort-Merge Merge
Join Operation
• Most important relational operator
• Potentially very expensive
• Required in all practical queries and
applications
• Often appears in groups of joins
• Many variations with different characteristics,
suited for different situations
Join Operation (Nested Loop Join)
• In its simplest form, a nested loops join
compares each row from one table (known as
the outer table) to each row from the other table
(known as the inner table) looking for rows that
satisfy the join predicate.
Join Operation (Nested Loop Join)
Algorithm
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr,ts) to see if they satisfy the
join condition θ
if they do, add tr·ts to the result
end
end
Join Operation (Nested Loop Join)
• In the worst case, if there is enough memory only to hold one block of each
relation, the estimated cost is
nr  bs + br block transfers, plus
nr + br seeks
• If the smaller relation fits entirely in memory, use that as the inner relation.
• Reduces cost to br + bs block transfers and 2 seeks
• Example Student and Orders
• No of records → Student 5000 Orders 10000
• No of blocks → Student 100 Orders 400
• Assuming worst case memory availability cost estimate is
• With student as outer relation:
• 5000  400 + 100 = 2,000,100 block transfers,
• 5000 + 100 = 5100 seeks
• With Orders as the outer relation
• 10000  100 + 400 = 1,000,400 block transfers and 10,400 seeks

• If smaller relation (student) fits entirely in memory, the cost estimate will be 500
block transfers.
Join Operation (Block Nested-Loop Join)
• Variant of nested-loop join in which every block of inner relation is
paired with every block of outer relation.

FOR EACH BLOCK BR OF R DO BEGIN


FOR EACH BLOCK BS OF S DO BEGIN
FOR EACH TUPLE TR IN BR DO BEGIN
FOR EACH TUPLE TS IN BS DO BEGIN
CHECK IF (TR,TS) SATISFY THE JOIN CONDITION
IF THEY DO, ADD TR • TS TO THE RESULT.
END
END
END
END
Join Operation (Block Nested-Loop Join)
• Worst case estimate: br  bs + br block transfers + 2 * br seeks
• Each block in the inner relation s is read once for each block in the
outer relation
• Best case: br + bs block transfers + 2 seeks.
• Improvements to nested loop and block nested loop algorithms:
• In block nested-loop, use M — 2 disk blocks as blocking unit for outer
relations, where M = memory size in blocks; use remaining two blocks
to buffer inner relation and output
• Cost = br / (M-2)  bs + br block transfers +
2 br / (M-2) seeks
• If equi-join attribute forms a key or inner relation, stop inner loop on
first match
• Scan inner loop forward and backward alternately, to make use of
the blocks remaining in buffer (with LRU replacement)
• Use index on inner relation if available
Join Operation (Indexed Nested-Loop Join)
• Index lookups can replace file scans if
• Join is an equi-join or natural join and
• An index is available on the inner relation’s join attribute
• Can construct an index just to compute a join.
• For each tuple tr in the outer relation r, use the index to look
up tuples in s that satisfy the join condition with tuple tr.
• Worst case: buffer has space for only one page of r, and, for
each tuple in r, we perform an index lookup on s.
• Cost of the join: br (tt + ts) + nr  c
• Where c is the cost of traversing index and fetching all matching s
tuples for one tuple or r
• C can be estimated as cost of a single selection on s using the join
condition.
• If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.
Example of nested-loop join costs
• Compute student orders, with student as the outer relation.
• Let orders have a primary b+-tree index on the attribute id, which
contains 20 entries in each index node.
• Since orders has 10,000 tuples, the height of the tree is 4, and one more
access is needed to find the actual data
• student has 5000 tuples
• Cost of block nested loops join
• 400*100 + 100 = 40,100 block transfers + 2 * 100 = 200 seeks
• Assuming worst case memory
• May be significantly less with more memory

• Cost of indexed nested loops join


• 100 + 5000 * 5 = 25,100 block transfers and seeks.
• Cpu cost likely to be less than that for block nested loops join
Merge-Join
1. Sort both relations on their join attribute (if not already
sorted on the join attributes).
2. Merge the sorted relations to join them
1. Join step is similar to the merge stage of the sort-merge
algorithm.
2. Main difference is handling of duplicate values in join
attribute — every pair with
same value on join attribute
must be matched

** if interested please refer


the detailed algorithm in the
reference book
Merge-Join
• Can be used only for equi-joins and natural joins
• Each block needs to be read only once (assuming all tuples for any given
value of the join attributes fit in memory
• Thus the cost of merge join is:
br + bs block transfers + br / bb + bs / bb seeks
• + The cost of sorting if relations are unsorted.
• Hybrid merge-join: if one relation is sorted, and the other has a secondary
b+-tree index on the join attribute
• Merge the sorted relation with the leaf entries of the b+-tree .
• Sort the result on the addresses of the unsorted relation’s tuples
• Scan the unsorted relation in physical address order and merge with
previous result, to replace addresses by the actual tuples
• Sequential scan more efficient than random lookup
Hash-Join
• Applicable for equi-joins and natural joins.
• A hash function h is used to partition tuples of both relations
• H maps joinattrs values to {0, 1, ..., n}, where joinattrs denotes the
common attributes of r and s used in the natural join.
• R0, r1, . . ., rn denote partitions of r tuples
• Each tuple tr  r is put in partition ri where i = h(tr
[joinattrs]).
• R0,, r1. . ., Rn denotes partitions of s tuples
• Each tuple ts s is put in partition si, where i = h(ts
[joinattrs]).
Hash-Join
Hash-Join
• r tuples in ri need only to be compared with s
tuples in si need not be compared with s tuples in
any other partition, since:
• An r tuple and an s tuple that satisfy the join
condition will have the same value for the join
attributes.
• If that value is hashed to some value i, the r
tuple has to be in ri and the s tuple in si.
Other Operations
• Duplicate elimination
• Can be implemented via Hashing or sorting
• Projection
• Perform projection on each tuple followed by
duplicate elimination
• Aggregation
• Can be implemented in manner similar to
duplicate elimination
Other Operations (Set Operations)
• r U s (Union)
Other Operations (Set Operations)
Union
Other Operations (Set Operations)
Union
Other Operations (Set Operations)
• r ∩ s (Intersection)
Other Operations (Set Operations)
Intersection
Other Operations (Set Operations)
Intersection
Other Operations (Set Operations)
•r - s
Other Operations (Set Operations)
Other Operations (Set Operations)
Aggregate Operations
• MAX(),MIN()
• Can be computed by a table scan or by using an
appropriate index
• Eg: SELECT MAX(SALARY) FROM EMPLOYEE;

• COUNT(), AVERAGE(), and SUM()


• Dense index can be used
Aggregate Operations
If GROUP BY clause is included
• The table must first be partitioned into subsets of
tuples
• Each partition (group) has the same value for the
grouping attributes
• Eg: SELECT DNO, AVG(SALARY)
FROM EMPLOYEE
GROUP BY DNO
Evaluation of Expressions
Operator Tree
Evaluation of Expressions
• So far: we have seen algorithms for individual
operations
• Alternatives for evaluating an entire expression tree
• Materialization: generate results of an expression
whose inputs are relations or are already computed,
materialize (store) it on disk. Repeat.
• Pipelining: pass on tuples to parent operations
even as an operation is being executed
Materialization
• Materialized evaluation: evaluate one operation at a time,
starting at the lowest-level. Use intermediate results
materialized into temporary relations to evaluate next-level
operations.
• Ex: In figure below, compute and store

 building="Watson" (department)
then compute the store its join with instructor,
and finally compute the projection on name.
Materialization
• Materialized evaluation is always applicable
• Cost of writing results to disk and reading them back can be quite
high
• Our cost formulas for operations ignore cost of writing results to
disk, so
• Overall cost = sum of costs of individual operations +
cost of writing intermediate results to disk
• Double buffering: use two output buffers for each operation, when
one is full write it to disk while the other is getting filled
• Allows overlap of disk writes with computation and reduces
execution time
Pipelining
• Pipelined evaluation : evaluate several operations
simultaneously, passing the results of one operation on to the
next.
• Ex: In previous expression tree, don’t store result of
 building="Watson" (department)
• Instead, pass tuples directly to the join.. Similarly, don’t
store result of join, pass tuples directly to projection.
• Much cheaper than materialization: no need to store a
temporary relation to disk.
• Pipelining may not always be possible – e.g., sort, hash-join.
• For pipelining to be effective, use evaluation algorithms that
generate output tuples even as tuples are received for inputs to
the operation.
• Pipelines can be executed in two ways: demand driven and
producer driven
Pipelining
• In demand driven or lazy evaluation
• System repeatedly requests next tuple from top level operation
• Each operation requests next tuple from children operations as
required, in order to output its next tuple
• In between calls, operation has to maintain “state” so it knows what to
return next

• In producer-driven or eager pipelining


• Operators produce tuples eagerly and pass them up to their parents
• Buffer maintained between operators, child puts tuples in buffer, parent
removes tuples from buffer
• If buffer is full, child waits till there is space in the buffer, and then
generates more tuples
• System schedules operations that have space in output buffer and can
process more input tuples

• Alternative name: pull and push models of pipelining


Evaluation Algorithms for Pipelining
• Some algorithms are not able to output results
even as they get input tuples
• E.g. merge join, or hash join
• Intermediate results written to disk and then read
back

• Blocking operations
• Operations are pipelined
HOME WORK
• Find more details about the concepts which are discussed in
the class by referring reference books
SUMMARY

• RECAP B+ TREE
• OVERVIEW
• MEASURES OF QUERY COST
• SELECTION OPERATION
• BASIC ALGORITHMS
• SELECTIONS USING INDICES
• SELECTIONS INVOLVING COMPARISONS
• IMPLEMENTATION OF COMPLEX SELECTIONS
REFERENCES

• Fundamentals of database systems


(6th edition) by remez elmasri & shamkant B. Navathe )

• Database Management Systems


(3rd edition) - by Raghu Ramakrishnan and Johannes Gehrke, McGraw Hill,
2003.

• Advanced Database Management Systems


by Rini Chakrabarti, Shibhadra Dasgupta
THANK YOU

You might also like