DBMS_Unit5_Lecture1
DBMS_Unit5_Lecture1
Lecture 1
Outline
• Query Processing
• Query Cost Estimation
• Query Operations
Query Processing
• Query processing refers to the range of activities involved in
extracting data from a database
• The activities include
• translation of queries in high-level database languages into expressions that
can be used at the physical level of the file system,
• a variety of query-optimizing transformations,
• and actual evaluation of queries
Query Processing …
• It is a step wise process that can be used at the physical level of the
file system, query optimization and actual execution of the query to
get the result
• It requires the basic concepts of relational algebra and file structure
• The actual updating and retrieval of data is performed through various “low-
level” operations.
• Examples of such operations for a relational DBMS can be relational algebra
operations such as project, join, select, Cartesian product, etc
Basic Steps in Query Processing [1]
1. Parsing and translation
2. Optimization
3. Evaluation
Basic Steps in Query Processing [2]
• Parsing and translation
• Translate the query into its internal form and then into relational algebra
• Parser checks syntax and verifies relations
• Optimization
• Amongst all equivalent evaluation plans choose the one with lowest cost
• Cost is estimated using statistical information from the database catalog, such as
the number of tuples in each relation, size of tuples, etc.
• Evaluation
• The query-execution engine takes a query-evaluation plan, executes that plan, and
returns the answers to the query
Evaluation Plans [1]
• A relational algebra expression may have many equivalent expressions
• Consider a query
select salary
from instructor
where salary < 75000
This query can be translated into either of the following relational-algebra
expressions:
• E.g., salary75000(salary(instructor)) is equivalent to
salary(salary75000(instructor))
Evaluation Plans [2]
• Each relational algebra operation can be evaluated using one of several different
algorithms
• For example, to implement the preceding selection, every tuple in instructor
can be searched to find tuples with salary less than 75000
• If a B+ tree index is available on the attribute salary, the index can be used
instead to locate the tuple
• Correspondingly, a relational-algebra expression can be evaluated in many
ways
Evaluation Plans [3]
• To specify fully how to evaluate a query, it requires both
• to specify the relational algebra expression and
• to annotate it with instructions specifying how to evaluate each operation
• Annotations may state the algorithm to be used for a specific
operation or the particular index or indices to use
• A relational-algebra operation annotated with instructions on how to
evaluate it is called an evaluation primitive
Evaluation Plans [4]
• A sequence of primitive operations that can be used
to evaluate a query is a query-execution plan or
query-evaluation plan
• Annotated expression specifying detailed evaluation
strategy
• E.g.:
• Use an index on salary to find instructors with
salary < 75000,
• Or perform complete relation scan and discard
instructors with salary 75000
Fig. A Query Evaluation plan
Basic Steps: Optimization
• Query Optimization:
• Amongst all equivalent evaluation plans, choose the one with lowest cost
• Cost is estimated using statistical information from the database catalog
• e.g. number of tuples in each relation, size of tuples, etc
• To Learn
• To measure query costs
• Algorithms for evaluating relational algebra operations
• To combine algorithms for individual operations in order to evaluate a complete expression
• To optimize queries: how to find an evaluation plan with lowest estimated cost
Measures of Query Cost
• Cost is generally measured as total elapsed time for answering query
• Many factors contribute to time cost
• disk accesses, CPU, or even network communication
• Typically disk access is the predominant cost, and is also relatively easy to
estimate.
• Measured by taking into account
• Number of seeks * average-seek-cost
• Number of blocks read * average-block-read-cost
• Number of blocks written * average-block-write-cost
• Cost to write a block is greater than cost to read a block
• data is read back after being written to ensure that the write was successful
Measures of Query Cost (Cont.)
• For simplicity we just use the number of block transfers from disk and the
number of seeks as the cost measures
• tT – time to transfer one block
• tS – time for one seek
• Cost for b block transfers plus S seeks
b * tT + S * tS
• We ignore CPU costs for simplicity
• Real systems do take CPU cost into account
• Cost to writing output to disk is not included in cost formula
Measures of Query Cost (Cont.)
• tT – time to transfer one block
• tS – time for one seek
• tS and tT depend on where data is stored;
• with 4 KB blocks:
• High end magnetic disk: tS = 4 msec and tT =0.1 msec
• SSD: : tS = 20-90 microsec and tT = 2-10 microsec for 4KB
• Costs of algorithms depend on the size of the buffer in main memory, as having
more memory reduces need for disk access
• Thus memory size should be a parameter while estimating cost; often use worst case
estimates
• The cost estimate of algorithm A is referred to as EA
Catalog Information for Cost Estimation
• nr : number of tuples in relation r.
• br : number of blocks containing tuples of r.
• sr : size of a tuple of r in bytes.
• fr : blocking factor of r — i.e., the number of tuples of r that fit into one block.
• V(A, r): number of distinct values that appear in r for attribute
• A; same as the size of A (r).
• SC(A, r): selection cardinality of attribute A of relation r; average number
of records that satisfy equality on A.
• If tuples of r are stored together physically in a file, then: br = nr / fr
Selection Operation
• File scan
• search algorithms that locate and retrieve records that fulfill a selection condition.
• Algorithm A1 (linear search)
• Scan each file block and test all records to see whether they satisfy the selection condition
• Cost estimate = br block transfers + 1 seek
Cost = br* tr + ts
• If selection is on a key attribute, can stop on finding record
• Average case, cost = (br /2) block transfers + 1 seek
Cost = (br/2)* tr + ts
• Linear search can be applied regardless of selection condition or ordering of
records in the file, or availability of indices
Selections Using Indices
• Index scan – search algorithms that use an index
• selection condition must be on search-key of index.
• A2 (primary index, equality on key). Retrieve a single record that satisfies the corresponding
equality condition
• Cost = (hi + 1) * (tT + tS)
• Where, hi denotes the height of the index. Index lookup traverses the height of the tree plus
one I/O to fetch the record
• Each of the I/O operations requires a seek and a block transfer
• A3 (primary index, equality on nonkey) Retrieve multiple records.
• Records will be on consecutive blocks
• Let b = number of blocks containing matching records
• Cost = hi * (tT + tS) + tS + tT * b
Selections Using Indices ..
• A4 (secondary index, equality on nonkey).
• Retrieve a single record if the search-key is a candidate key
• Cost = (hi + 1) * (tT + tS)
• This case is similar to primary index
• Retrieve multiple records if search-key is not a candidate key
• each of n matching records may be on a different block
• Cost = (hi + n) * (tT + tS)
• Can be very expensive!
Selections Involving Comparisons
• Can implement selections of the form AV (r) or A V(r) by using
• a linear file scan,
• or by using indices in the following ways:
• A5 (primary index, comparison). (Relation is sorted on A)
• For A V(r) use index to find first tuple v and scan relation sequentially from there
• For AV (r) just scan relation sequentially till first tuple > v; do not use index
• Identical to the case of A3, equality on nonkey
• A6 (secondary index, comparison).
• For A V(r) use index to find first index entry v and scan index sequentially from
there, to find pointers to records.
• For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v
• Identical to the case of A4, equality on nonkey
Implementation of Complex Selections
• Conjunction: 1 2. . . n(r)
• A7 (conjunctive selection using one index).
• Select a combination of i and algorithms A1 through A7 that results in the least cost for i
(r).
• Test other conditions on tuple after fetching it into memory buffer.
• A8 (conjunctive selection using composite index).
• Use appropriate composite (multiple-key) index if available.
• A9 (conjunctive selection by intersection of identifiers).
• Requires indices with record pointers.
• Use corresponding index for each condition, and take intersection of all the obtained sets of
record pointers.
• Then fetch records from file
• If some conditions do not have appropriate indices, apply test in memory.
Algorithms for Complex Selections
• Disjunction:1 2 . . . n (r).
• A10 (disjunctive selection by union of identifiers).
• Applicable if all conditions have available indices.
• Otherwise use linear scan.
• Use corresponding index for each condition, and take union of all the obtained sets of record
pointers.
• Then fetch records from file
• Negation: (r)
• Use linear scan on file
• If very few records satisfy , and an index is applicable to
• Find satisfying records using index and fetch from file
Next