0% found this document useful (0 votes)

39 views17 pages

Advanced Dbms Unit2

Uploaded by

Swathi SB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views17 pages

Advanced Dbms Unit2

Uploaded by

Swathi SB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

DEPT.

OF CSE

18CS8PE31: Advanced DBMS

UNIT-2

Overview of Query Evaluation

INTROUDUCTION:

• The DBMS describes the data that it manages, including tables and indexes. This descriptive data or
metadata stored in special tables called the system catalogs is used to find the best way to evaluate a
query.

• SQL queries are translated into an extended form of relational algebra, and query evaluation plans
are represented as trees of relational operators, along with labels that identify the algorithm to use at
each node.

• Relational operators serve as building blocks for evaluating queries and the implementation of these
operators is carefully optimized for good performance.

• Queries are composed of several operators, and the algorithms for individual operators can be
combined in many ways to evaluate a query. The process of finding a good evaluation plan is called
query optimization.

We consider a number of example queries using the following schema:

Sailors(sid: integer, sname:string, rating: integer, age: real)
Reserves(sid: integer, bid: integer, day: dates, marne: string)
We assume that each tuple of Reserves is 40 bytes long, that a page can hold 100 Reserves tuples, and that
we have 1000 pages of such tuples. Similarly, we assume that each of Sailors is 50 bytes long,
that a page can hold 80 Sailors tuples, and that we have 500 pages of such tuples.

The system catalog:

• We can store a table using one of several alternative file structures, and we can create one or more
indexes – each stored as a file – on every table
• In a relational DBMS, every file contains either the tuples in a table or the entries in an index.
• The collection of files corresponding to users tables and indexes represents the data in the database.
• A relational DBMS maintains information about every table and index that it contains.
• The descriptive information is itself stored in a collection of special tables called the catalog tables.
• The catalog tables are also called the data dictionary, the system catalog or simply the catalog.

Information in the catalog:

At a minimum, we have system-wide information such as the size of the buffer pool and page size, and the
following information about individual tables, indexes and views.
• For each table:
- Its table name, file name and the file structure (e.g., heap file) of the file in which it is stored.
- The attribute name and type of each of its attributes.
- The index name of each index on the table.
- The integrity constraints (e.g., primary key and foreign keyconstraints) on the table.

SSIT, Tumkur Page 1

DEPT. OF CSE

• For each index:

- The index name and the structure (e.g., B+ tree) of the index.
- The search key attributes.
• For each view:
- Its view name and definition
In addition, statistics about tables and indexes are stored in the system catalogs and updated periodically.

The following information is commonly stored:

• Cardiality: The number of tuples NTuples(R) for each table R

• Size: The number of pages NPages(R) for each table R.
• Index Cardinality: The number of distinct key values NKeys(I) for each index I.
• Index Size: The number of pages INPages(I) for each index I.
• Index Height: The number of nonleaf levels IHeight(I) for each tree indiex I.
• Index Range: The minimum present key value ILow(I) and maximum present key value IHigh(I) for
each index I.
The catalogs also contain information about users, such as accounting information and
authorization information.
How Catalogs are Stored:

An elegant aspect of a relational DBMS is that the system catalog is itself a collection of
tables. For example, we might store information about the attributes of tables in a catalog table called
Attribute_Cat:
Attribute_Cat(attr_name: string,rel_name: string, type: string, position: integer)

Sailors(sid: integer, sname: string, integer, age: real)

Reserves(sid: integer, bid: integer, day: dates, mame: string)

Figure 1 shows the tuples in the Attribute_Cat table that describe the at tributes of these two tables.

position
attr- rel-name Type
name
attr_name Attribute_Cat string 1

reLname string 2
Type Attribute_Cat string 3
Position Attribute_Cat integer 4
Sid Sailors integer 1
Sname Sailors string 2
Rating Sailors integer 3
Age Sailors real 4
Sid Reserves integer 1
Bid Reserves integer 2
Day Reserves dates 3
Rname Reserves string 4

Figure 1: An Instance of the Attribute_Cat Relation

SSIT, Tumkur Page 2

DEPT. OF CSE

These other tuples illustrate an important Point: the catalog tables describe all the tables in the
database, including the catalog themselves. When information about a table is needed, it is
obtained from the system catalog. Of course, at the implementation level, whenever the DBMS needs
to find the schema of a catalog table, the code that retrieves this information must be handled
specially.
The fact that the system catalog is also a collection of tables is very useful. For example, catalog tables
can be queried just like any other table, using the query language of the DBMS! Further, all the
techniques available for implementing and managing tables apply directly to catalog tables. The
choice of catalog tables and their is not unique and is made by the implementor of the DBMS.
Real systems vary in their catalog schema design, but the catalog is always implemented as a
collection of tables, and it essentially describes all the data stored in the database.

Introduction to operator evaluation:

Several alternative algorithms are available for implementing each relational operator and for most operators
no algorithm is universally superior.
Several factors influence which algorithm performs best, including the sizes of the tables involved, existing
indexes and sort orders , the size of the available buffer pool, and the buffer replacement policy.
Three Common Techniques:
The algorithms for various relational operators actually have a lot in common. A few simplen techniques are
used to develop algorithms for each operator:
 Indexing: If a selection or join condition is specified, use an index to examine just the tuples that
satisfy the condition. (Selection and condition)
 Iteration: Examine all tuples in an input table one after the other. If we need only a few fields,
instead of examining data tuples, we can scan all index data entries. (And sometimes, we can scan the
data entries in an index instead of the table itself.)
 Partitioning: By partitioning tuples on a sort key, we can often decompose an operation into a less
expensive collection of operations on partitions. Sorting and Hashing are two commonly used
partitioning techniques.

Access Paths:

An access path is a method of retrieving tuples from a table and consists of either(1) a file scan or (2) an index
plus a matching selection condition. Every relational operator accepts one or more tables as input, and
the access methods used to retrieve tuples contribute significantly to the cost of the operator. File scan,
or index that matches a selection condition(in the query).

Consider a simple selection that is a conjunction of conditions of the form attr op value, where
op is one of the comparison operators <, =, or >. Such selections are said to be in
conjunctive normal form (CNF), and each condition is called a conjunct. Intuitively, index
matches a selection condition if the index can be used to retrieve just the tuples that the
condition.

 A hash index matches CNF selection if there is a term of the form in

the selection for each attribute in the index's search key.
E.g., Hash index on <a, b, c> matches a=5 AND b=3 AND c=5; but it does not match b=3, or a=5
AND b=3, or a>5 AND b=3 AND c=5.

SSIT, Tumkur Page 3

DEPT. OF CSE

 A tree index matches a CNF selection if there is a term of the form attribute op value
for each attribute in a of the index's search key. and are prefixes of key
(a,b,c), but and are not.) E.g., Tree index on <a, b, c> matches the selection a=5
AND b=3, and a=5 AND b>6, but not b=3.

A Note on Complex Selections

-(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3)

-Selection conditions are first converted to conjunctive normal form (CNF)

(day<8/9/94 OR bid=5 OR sid=3 ) AND (rname=‘Paul’ OR bid=5 OR sid=3)

Selectivity of Access Paths:

he selectivity of access path is the number of pages retrieved (index pages plus data pages) if we
use this access path to retrieve all desired tuples. If a table contains an index that matches a given
selection, there are attwo access paths: the index and a scan of the data file. Sometimes, of
course, we can scan index itself, giving us a third access path.

• Find the most selective access path, retrieve tuples using it, and apply any remaining terms
that don’t match the index:
• Most selective access path: An index or file scan that we estimate will require the fewest
page I/Os.
• Terms that match this index reduce the number of tuples retrieved; other terms are used to
discard some retrieved tuples, but do not affect number of tuples/pages fetched.
Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5
and sid=3 must be checked for each retrieved tuple. Similarly, a hash inde x on <bid, sid> could be
used; day<8/9/94 must then be checked.
Algorithms for relational operations:

1. Selection: The selection operation is a simple retrieval of tuples from a table, and its implementation is
essentially covered in our discussion of access paths.
Given a selection of the form  R.attr op value(R), if there is no index on R attr, we have to scan R.
If one or more indexes on R match the selection, we can use index to retrieve matching tuples and apply any
matching selection conditions to further restrict the result set.
Using an Index for Selections
• Cost depends on #qualifying tuples, and clustering.
• Cost of finding qualifying data entries (typically small)plus cost of retrieving records (could be large
w/o clustering).
In example, assuming uniform distribution of names,about 10% of tuples qualify (100 pages, 10000 tuples).
With a clustered index, cost is little more than 100 I/Os; if unclustered, upto 10000 I/Os!
SELECT * FROM Reserves R WHERE R.rname < ‘C%’;
As a rule thumb, it is probably cheaper to simply scan the entire table(instead of using an unclustered index)
if over 5% of the tuples are to be retrieved.
SSIT, Tumkur Page 4
DEPT. OF CSE

2. Projection: The projection operation requires us to drop certain fields of the input, which is easy to do. The
expensive part is removing duplicates.
SQL systems don’t remove duplicates unless the keyword DISTINCT is specified in a query.
SELECT DISTINCT R.sid, R.bid FROM Reserves R;
• Sorting Approach: Sort on <sid, bid> and remove duplicates. (Can optimize this by dropping
unwanted information while sorting.)
• Hashing Approach: Hash on <sid, bid> to create partitions. Load partitions into memory one at a
time, build in-memory hash structure, and eliminate duplicates.
• If there is an index with both R.sid and R.bid in the search key, may be cheaper to sort data entries!
3. Join: Joins are expensive operations and very common. This systems typically support several algorithms to
carry out joins.
Consider the join of Reserves and Sailors, with the join condition Reserves.sid=Sailors.sid. Suppose one of
the tables, say Sailors, has an index on the sid column. We can scan Reserves and for each tuple, use the index
to probe Sailors for matching tuples. This approach is called index nested loops join.
Ex: The cost of scanning Reserves and using the index to retrieve the matching Sailors tuple for each
Reserves tuple. The cost of scanning Reserves is 1000. There are 100*1000 =100000 tuples in Reserves. For
each of these tuples, retrieving the index page containing the rid of the matching Sailors tuple costs 1.2
I/Os(on avg), in addition we have to retrieve the Sailors page containing the qualifying tuple. Therefore we
have 100000*(1+1.2) I/Os to retrieve matching Sailors tuples. The total cost is 221000 I/Os.

• If we do not have an index that matches the join condition on either table, we cannot use index nested
loops. In this case, we can sort both tables on the join column, and then scan them to find matches.
This is called sort-merge join.
Ex: We can sort Reserves and Sailors in two passes. Read and Write Reserves in each pass the sorting cost
is 2*2*1000=4000 I/Os. Similarly we can sort Sailors at a cost of 2*2*500=2000 I/Os. In addition, the second
phase of the sort-merge join algorithm requires an additional scan of both tables. Thus the total cost is
4000+2000+1000+500=7500 I/Os.
Introduction to query optimization:
• Query optimization is one of the most important tasks of a relational DBMS. A more detailed view of
the query optimization and execution layer in the DBMS architecture is as shown in Fig(2).Queries
are parsed and then presented to query optimizer, which is responsible for identifying an efficient
execution plan. The optimizer generates alternative plan and chooses the plan with the least estimated
cost.
• The space of plans considered by a typical relational query optimizer can be carried out on the result
of the - - algebra expression.

SSIT, Tumkur Page 5

DEPT. OF CSE

Query

Query Parser

Parsed query

Query Optimizer

Plan Plan cost

Catalog Manager
Generator Estimator

Evaluation plan

Query Plan Evaluator

Fig(2): Query Parsing, Optimization and Execution

Optimizing such a relational algebra expression involves two basic steps:
1. Enumerating alternative plans for evaluating the expression. Typically an optimizer considers a subset
of all possible plans because the number of possible plans is very large.
2. Estimating the cost of each enumerated plan and choosing the plan with the lowest estimated cost.
Query Evaluation Plans:
A query evaluation plan consists of an extended relational algebra tree, with additional annotations at each
node indicating the access methods to use for each table and the implementation method to use for each
relational operator.
Ex: select s.sname from Reserves r, Sailors s where r.sid=s.sid and r.bid=100 and s.rating>5
The above query can be expressed in relational algebra as follows:
 Sname( bid=100 ^ rating>5(Reserves sid=sid Sailors))
This expression is shown in the form of a tree in Fig.(3).
 Sname

 bid=100 ^ rating>5

sid=sid

Reserves Sailors
Fig(3). Query Expressed as a Relational Algebra Tree

SSIT, Tumkur Page 6

DEPT. OF CSE

The algebra expression partially specifies how to evaluate the query – we first compute the natural join of
Reserves and Sailors, then perform the selections and finally project the sname field.
To obtain a fully specified evaluation plan, we must decide on an implementation for each of the algebra
operations involved. For ex, we can use a page-oriented simple nested loops join with Reserves as the outer
table and apply selections and projections to each tuple in the result of the join as it is produced; the result of
the join before the selections and projections is never stored in its entirety. This query evaluation plan is
shown in Fig.(4)
 Sname (on-the-fly)

 bid=100 ^ rating>5 (on-the-fly)

sid=sid (simple nested loops)

(File scan)Reserves Sailors(File scan)

Fig(4). Query Evaluation Plan for Sample Query
Multi-Operator Queries: Pipelined Evaluation
When a query is composed of several operators, the result of one operator is sometimes pipelined to another
operator without creating a temporary table to hold the intermediate results. The plan in fig(4) pipelines the
output the join Sailors and Reserves into selections and projections that follow.

Results tuples of C
first join pipelined
into join with C
A B

Fig(5). A Query Tree Illustrating Pipelining

In Fig(5) both joins can be evaluated in pipelined fashion using some version of a nested loops join.
Conceptually, the evaluation is initialized from the root, and the node joining A and B produces the tuples as
and when they are requested by its parent node. When the root node gets a page of tuples from left child, all
the matching inner tuples are retrieved and joined with matching outer tuples; the current page of outer tuples
is then discarded. A request is then made to the left child for the next page of tuples, and the process is
repeated. Pipelined evaluation is thus control strategy governing the rate at which different joins in the plan
proceed. It has the great virtue of not writing the result of not writing the result of intermediate joins to a
temporary file because the results are produced, consumed and discarded one page at a time.

SSIT, Tumkur Page 7

DEPT. OF CSE

The Iterator Interface:

A query evaluation plan is a tree of relational operators and is executed by calling the operators in
some (possibly interleaved) order. Each operator has one or more inputs and an output, which are
also nodes in the plan, and tuples must be between operators according to the plan's tree
structure.

To simplify the code responsible for coordinating the execution of a plan, the relational operators that
form the nodes of a plan tree (which is to be evaluated using pipelining) typically support a uniform
iterator interface, hiding the internal implementation details of each operator. The iterator
interface for an operator includes the functions open, get_next, and close. The open function
initializes the state of the iterator by allocating buffers for its inputs and output, and is also used to
pass in arguments such selection conditions that modify the behavior of the operator. The code
for the function calls the function on each input node and calls operator-specific
code to process the input tuples. The output tuples generated by the processing are placed in the
output buffer of the operator, and the state of the iterator is updated to keep track of how much input
been consumed. When all output tuples have been produced through repeated calls to the
close function is called (by the code that initiated execution of this operator) to deallocate state
information.
The iterator interface supports pipelining of results naturally: the decision to pipeline or
input tuples is encapsulated in the operator-specific code that processes input tuples. If the
algorithm implemented for the operator allows input tuples to be processed completely when
yhey are received, input tuples are not materialized and the evaluation is pipelined. If the algorithm
examines the same input tuples several times, they are materialized. This decision, like other details
of the operator's implementation, is hidden by the iterator interface for the operator.
The iterator interface is also used to encapsulate access methods such as B+ trees and hash-based
indexes. Externally, access methods can be viewed simply as operators that produce a stream of
output tuples. In this case, the open function can be used to pass the selection conditions that match
the access path.

Alternative Plans: Motivating Example

SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5;
• Cost: 500+500*1000 I/Os
• By no means the worst plan!
• Misses several opportunities: selections could have been `pushed’ earlier, no use is made of any
available indexes, etc.
• Goal of optimization: To find more efficient plans that compute the same answer RA Tree: is as
shown in fig(2) and plan is as shown in fig(3)

SSIT, Tumkur Page 8

DEPT. OF CSE

Pushing Selections:
A join is a relatively expensive operation, and a good heuristic is to reduce the sizes of the tables to be
joined as much as possible. One approach is to apply selections early; if a selection operator appears after
a join operator, it is worth examining whether the selection can be 'pushed' ahead of the join. As an
example, the selection bid=100 involves only the attributes of Reserves and can be applied to Reserves
the join. Similarly, the selection 5 involves only attributes of Sailors and can be applied to
Sailors before the join. Let us suppose that the selections are performed using simple file scan, that the
result of each selection is written to temporary table on disk, and that the temporary tables are then
joined using sort-merge join. The resulting query evaluation plan is shown in Figure(5).
 Sname (on-the-fly)

sid=sid (sort-merge join)

(Scan; Write to temp T1)  bid=100  rating>5 (Scan; write to temp T2)

(File scan)Reserves Sailors(File scan)

Fig(5). A Second Query Evaluation Plan
To estimate the size of Tl, we require additional information. For example, if we assume that the maximum
number of reservations of a given boat is one, just one tuple appears in the result. Alternatively, if we know
that there are 100 boats, we can assume that reservations are spread out uniformly across all boats and
estimate the number of pages in Tl to be 10. For concreteness, assume that the number of pages in Tl is
indeed 10.
The cost of applying rating > 5 to Sailors is the cost of scanning Sailors (500 pages) plus the cost of
writing out the result to a temporary table, say, T2. If we assume that ratings are uniformly distributed
over the range 1 to 10, we can approximately estimate the size of T2 as 250 pages.

SSIT, Tumkur Page 9

DEPT. OF CSE

EXTERNAL SORTING
Why Sort?

• A classic problem in computer science!

• Data requested in sorted order

- e.g., find students in increasing gpa order

• Sorting is first step in bulk loading B+ tree index.

• Sorting useful for eliminating duplicate copies in a collection of records (Why?)

• Sort-merge join algorithm involves sorting.

• Problem: sort 1Gb of data with 1Mb of RAM.

- why not virtual memory?

When does a DBMS sort data?

Sorting a collection of records on some search key is a very useful operation. The key can be a single
attribute or an ordered list of attributes. Sorting is required in a variety of situations, including the following
important ones:

• Users may want answers in some order; for example, by increasing age.

• Sorting records is the first step in bulk loading a tree index.

• Sorting is useful for eliminating duplicate copies in a collection of records.

• A widely used algorithm for performing a very important relational algebra operation, called join,
requires a sorting step.

Although main memory sizes are growing rapidly the ubiquity of database systems has lead to increasingly
larger datasets as well. When the data to be sorted is too large to fit into available main memory, we need an
external sorting algorithm. Such algorithm seek to minimize the cost of disk accesses.

A SIMPLE TWO-WAY MERGE SORT

We begin by presenting a simple algorithm to illustrate the idea behind external sorting. This algorithm
utilizes only three pages of main memory, and it is presented only for pedagogical purposes. When sorting a
file, several sorted subfiles are typically generated in intermediate steps. Here we refer to each subfile as a
run.

Even if the entire file does not fit into the available main memory, we can sort it by breaking it into smaller
subfiles, sorting these subfiles and then merging them using a minimal amount of main memory at any given
time. In the first pass, the pages in the file are read in one at a time. After a page is read in, the records on it
are sorted and the sorted page is written out. Quicksort or any other in –memory sorting technique can be used
SSIT, Tumkur Page 10
DEPT. OF CSE

to sort the records on a page. In subsequent passes, pairs of runs from the output of the previous pass are read
in and merged to produce runs that are twice as long. This algorithm is shown in Fig(1).

proc 2-way_extsort (file)

// Given a file on disk; sorts it using three buffer pages

//Produce runs that are one page long: Pass 0

Read each page into memory, sort it, Write it out.

//Merge pairs of runs to produce longer runs until only

//one run ( containing all records of input file) is left

While the number of runs at end of previous pass is > 1:

// Pass i=1,2,….

While there are runs to be merged from previous pass:

Choose next two runs (from previous pass).

Read each run into an input buffer; page at a time.

Merge the runs and write to the output buffer;

force output buffer to disk one page at a time.

endproc

Fig(1). Two-Way Merge Sort

If the number of pages in the input file is 2k, for some k, then:

Pass 0 produces 2k sorted runs of one page each,

Pass 1 produces 2k-1 sorted runs of two pages each,

Pass 2 produces 2k-2 sorted runs of four pages each,

And so on, until

Pass K produces one sorted run of 2k pages.

In each pass, we read every page in the file, process it and write it out. Therefore we have two disk I/Os per
page, per pass. The number of passes is [log2N]+1, where N is the number of pages in the file. The overall
cost is 2N([log2N]+1) I/Os.

SSIT, Tumkur Page 11

DEPT. OF CSE

3,4 6,2 9,4 8,7 5,6 3,1 2 Input file

Pass 0

3,4 2,6 1-page runs

4,9 7, 5,6 1,3 2
8 Pass 1

2,3 4,7 1,3

2-page runs
4,6 8,9 5,6 2
Pass 2
2,3

4,4 1,2
3,5
6,7
6
8,9
4-page runs

Pass 3

1,2
8- page runs
2,3

3,4

4,5

6,6
7,8

The algorithm is illustrated on an example input file containing

Two-way Merge Sort of a seven pages in fig(2).

SSIT, Tumkur Page 12

DEPT. OF CSE

The sort takes four passes, and in each pass, we read and write seven pages, for a total of 56 I/Os. This
result agrees with the preceding analysis because 2.7([log 27]+1)=56. The dark pages in the figure illustrate
what would happen on a file of eight pages; the number of passes remains at four([log 28]+1=4), but we read
and write an additional page in each pass for a total of 64 I/Os.

The algorithm requires just three buffer pages in main memory, as shown in fig(3) illustrates. This
observation raises an important point: Even if we have more buffer space available, this simple algorithm
does not utilize it effectively.

Fig(3) Two-Way Merge Sort with Three buffer pages

Input 1

Output
Input 2

Disk Disk

EXTERNAL MERGE SORT

Suppose that B buffer pages are available in memory and that we need to sort large file with N pages. The
intuition behind the generalized algorithm that we now present is to retain the basic structure of making
multiple passes while trying to minimize the number of passes. There are two important modifications to the
two-way merge sort algorithm:

1. In pass 0, read in B pages at a time and sort internally to produce [N/B] runs of B pages each (except
for the last run, which may contain fewer pages). This modification is illustrated in fig(4), using the
input from fig(2) and a buffer pool with four pages.

2. In passes i=1,2,… use B-1 buffer pages for input and use the remaining page for output; hence, you do
a (B-1)-way merge in each pass. The utilization of buffer pages in the merging passes is illustrated in
fig(5).

SSIT, Tumkur Page 13

DEPT. OF CSE

Fig(4). External Merge Sort with B buffer pages: Pass 0

2,3
3,4

6,2 4,4
1st output run
6,7
1,2
8,9 3,4 8,9

2,3

Input file

8,9 7,8
1,2
2nd output run
5,6

3,5

3,1
Buffer pool with B=4 pages 6

SSIT, Tumkur Page 14

DEPT. OF CSE

Input 1

Input 2
OUTPUT

Input B-1

B main buffer pages

Fig(5) External Merge Sort with B Buffer Pages: Pass i >0

Disk Disk
The first refinement reduces the number of runs produced by Pass 0 to N1= [N/B], versus N for the two-way merge. The
second refinement is even more important. By doing a (B-1)-way merge, the number of passes is reduced dramatically-
including the initial pass, it becomes [logB-1N1]+1 versus [log2N]+1 for the two-way merge algorithm presented earlier.
Because B is typically quite large, the savings can be substantial. The external merge sort algorithm is shown in Fig(6).

proc extsort (file)

// Given a file on disk, sorts it using three buffer pages

SSIT, Tumkur Page 15

DEPT. OF CSE

//Produce runs that are B pages long: Pass 0

Read B page into memory, sort them, Write out a run.

//Merge B-1 runs at a time to produce longer runs until only

//one run ( containing all records of input file) is left

While the number of runs at end of previous pass is > 1:

// Pass i=1,2,….

While there are runs to be merged from previous

pass: Choose next B-1 runs (from previous pass).

Read each run into an input buffer; page at a

time. Merge the runs and write to the output

buffer; force output buffer to disk one page at a

time.

Endproc Fig(6). External Merge sort.

SSIT, Tumkur Page 16

DEPT. OF CSE

As an example, suppose that we have five buffer pages available and want to sort a file with 108 pages.

Pass 0 produces [108/5]=22 sorted runs of five pages each, except for the last run, which is only three pages
long.

Pass 1 does a four-way merge to produce [22/4]=six sorted of 20 pages each, except for the last run, which is
only eight pages long.

Pass 2 produces[6/4]= two sorted runs; one with 80 pages and one with 28 pages.

Psss 3 merges the two runs produced in Pass 2 to produce the sorted file.

In each pass we read and write 108 pages; thus the total cost is 2*108*4= 864 I/Os. Applying our formula, we
have N1=[108/5]=22 and cost=2*N*([logB-1N1]+1)= 2*108*([log422]+1)=864, as expected.

To emphasize the potential gains in using all available buffers, in fig(7), we show the number of passes,
computed using our formula, for several values of N and B. To obtain the cost, the number of passes should
be multiplied by 2N. In practice, one would expect to have more than 257 buffers, but this table illustrates the
importance of a high fan-in during merging.

N B=3 B=5 B=9 B=17 B=129 B=257

100 7 4 3 2 1 1

1000 10 5 4 3 2 2

10000 13 7 5 4 2 2

100000 17 9 6 5 3 3

1000000 20 10 7 5 3 3

10000000 23 12 8 6 4 3

100000000 26 14 9 7 4 4

1000000000 30 15 10 8 5 4

SSIT, Tumkur Page 17

Database Query Optimization Guide
No ratings yet
Database Query Optimization Guide
30 pages
Adbms Unit 2
No ratings yet
Adbms Unit 2
137 pages
Overview of Query Evaluation: R&G Chapter 12
No ratings yet
Overview of Query Evaluation: R&G Chapter 12
30 pages
Unit - Iii
No ratings yet
Unit - Iii
39 pages
Lec20Indexing v1
No ratings yet
Lec20Indexing v1
57 pages
Query Evaluation
No ratings yet
Query Evaluation
36 pages
Query Evaluation in DBMS
No ratings yet
Query Evaluation in DBMS
29 pages
DINLect 1
No ratings yet
DINLect 1
69 pages
Session - 10 Querying
No ratings yet
Session - 10 Querying
36 pages
L4 Indexing
No ratings yet
L4 Indexing
56 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
84 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
84 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
84 pages
Chapter 12: Indexing and Hashing
No ratings yet
Chapter 12: Indexing and Hashing
84 pages
Dynamic Hashing and Indexing
No ratings yet
Dynamic Hashing and Indexing
24 pages
11.2 Indexing
No ratings yet
11.2 Indexing
26 pages
Database Storage and Indexing
No ratings yet
Database Storage and Indexing
14 pages
Indexing
No ratings yet
Indexing
62 pages
Lt20 21 Index
No ratings yet
Lt20 21 Index
28 pages
DBMS-U5 Notes
No ratings yet
DBMS-U5 Notes
16 pages
Indexing and Hashing in Databases
No ratings yet
Indexing and Hashing in Databases
50 pages
DBMS Unit9
No ratings yet
DBMS Unit9
44 pages
UEU Basis Data Pertemuan 14
No ratings yet
UEU Basis Data Pertemuan 14
32 pages
DBMS File & Index Organization
No ratings yet
DBMS File & Index Organization
10 pages
V Unit
No ratings yet
V Unit
36 pages
V Unit
No ratings yet
V Unit
15 pages
Lecture12 (CNC 312)
No ratings yet
Lecture12 (CNC 312)
36 pages
Query Processing & Evaluation Guide
No ratings yet
Query Processing & Evaluation Guide
23 pages
Database Tuning: Database Tuning Describes A Group of Activities Used To Optimize and Homogenize The
No ratings yet
Database Tuning: Database Tuning Describes A Group of Activities Used To Optimize and Homogenize The
20 pages
Indexing and Hashing: Basic Concept, Ordered Indices: Adbms
No ratings yet
Indexing and Hashing: Basic Concept, Ordered Indices: Adbms
22 pages
Chap. 2 File Organization and Indexing: Abel J.P. Gomes
No ratings yet
Chap. 2 File Organization and Indexing: Abel J.P. Gomes
20 pages
Database Storage & Indexing Guide
No ratings yet
Database Storage & Indexing Guide
41 pages
Chap 12
No ratings yet
Chap 12
73 pages
Indexing in Database
No ratings yet
Indexing in Database
33 pages
Unit-6 Storage Strategies
No ratings yet
Unit-6 Storage Strategies
43 pages
Indexing and Hashing
No ratings yet
Indexing and Hashing
84 pages
Ss Three Data PR 1therm
No ratings yet
Ss Three Data PR 1therm
17 pages
Database Query Processing & Security
No ratings yet
Database Query Processing & Security
39 pages
Chapter 13: Query Processing: Database System Concepts, 5th Ed
No ratings yet
Chapter 13: Query Processing: Database System Concepts, 5th Ed
55 pages
Query Processing, Optimization, and Indexing Techniques
No ratings yet
Query Processing, Optimization, and Indexing Techniques
29 pages
Module Iippt
No ratings yet
Module Iippt
27 pages
Database Management Systems
No ratings yet
Database Management Systems
20 pages
Indexing - II
No ratings yet
Indexing - II
57 pages
Final Review
No ratings yet
Final Review
96 pages
Indexing & Hashing
No ratings yet
Indexing & Hashing
65 pages
Indexing and Hashing
No ratings yet
Indexing and Hashing
76 pages
Lec6 QP Indexing
No ratings yet
Lec6 QP Indexing
40 pages
L10-Query Evaluaion
No ratings yet
L10-Query Evaluaion
50 pages
1 Indexing Techniques
No ratings yet
1 Indexing Techniques
30 pages
Dbms r18 Unit 5 Notes
No ratings yet
Dbms r18 Unit 5 Notes
24 pages
Indexing Dbms
No ratings yet
Indexing Dbms
2 pages
Advanced SQL Architecture and Querying
No ratings yet
Advanced SQL Architecture and Querying
60 pages
Query Evaluation
No ratings yet
Query Evaluation
51 pages
26 - Databse Indexes
No ratings yet
26 - Databse Indexes
48 pages
DBMS Unit 5
No ratings yet
DBMS Unit 5
58 pages
Database Tuning and Index Optimization
0% (1)
Database Tuning and Index Optimization
27 pages
QEII
No ratings yet
QEII
44 pages
Lesson 9 Lecture9
No ratings yet
Lesson 9 Lecture9
45 pages
File Organization and Indexing in Databases
No ratings yet
File Organization and Indexing in Databases
45 pages
Vulcan
No ratings yet
Vulcan
8 pages
Practical Spring LDAP: Using Enterprise Java-Based LDAP in Spring Data and Spring Framework 6 2nd Edition Varanasi Balaji
100% (4)
Practical Spring LDAP: Using Enterprise Java-Based LDAP in Spring Data and Spring Framework 6 2nd Edition Varanasi Balaji
66 pages
It Practical Record 24-25
No ratings yet
It Practical Record 24-25
14 pages
Oracle
No ratings yet
Oracle
52 pages
Data Analysis Training
No ratings yet
Data Analysis Training
21 pages
Accounting Information System Reviewer
No ratings yet
Accounting Information System Reviewer
26 pages
STA 122 Notes Part I
No ratings yet
STA 122 Notes Part I
11 pages
Analyzing The Web and UWP Versions
100% (1)
Analyzing The Web and UWP Versions
11 pages
Syllabus Oscit
No ratings yet
Syllabus Oscit
23 pages
Instruction Manual For R4 Action Replay Cheat
No ratings yet
Instruction Manual For R4 Action Replay Cheat
9 pages
Java Data Access Object Pattern Guide
No ratings yet
Java Data Access Object Pattern Guide
4 pages
Oracle Project Query Development Guide
No ratings yet
Oracle Project Query Development Guide
8 pages
Paid IT
No ratings yet
Paid IT
54 pages
AI Project Logbook for R2D2 Health Bot
No ratings yet
AI Project Logbook for R2D2 Health Bot
42 pages
Database Marketing Strategies Explained
No ratings yet
Database Marketing Strategies Explained
6 pages
Machine Learning Diabetes Project
No ratings yet
Machine Learning Diabetes Project
45 pages
Snow Pro Core Study Guide
No ratings yet
Snow Pro Core Study Guide
14 pages
InduSoft Web Studio v8 1 Quick Start Guide QuickStart IWS81SP5
No ratings yet
InduSoft Web Studio v8 1 Quick Start Guide QuickStart IWS81SP5
85 pages
Vikas - MicroStrategy Professional Resume
No ratings yet
Vikas - MicroStrategy Professional Resume
4 pages
Oracle System Tables Overview
No ratings yet
Oracle System Tables Overview
3 pages
DBMS Lab Manual for BSc Computer Science
No ratings yet
DBMS Lab Manual for BSc Computer Science
42 pages
Unit 1 Introduction To Big Data
No ratings yet
Unit 1 Introduction To Big Data
12 pages
Uwaheren Siwes Report in Computer Engineering
No ratings yet
Uwaheren Siwes Report in Computer Engineering
19 pages
ObserveIT Custom Installation Guide
No ratings yet
ObserveIT Custom Installation Guide
114 pages
PL 400
No ratings yet
PL 400
13 pages
Introduction to jBase Database System
No ratings yet
Introduction to jBase Database System
110 pages
Rag M5
No ratings yet
Rag M5
73 pages
VetCare App: Revolutionizing Pet Healthcare
No ratings yet
VetCare App: Revolutionizing Pet Healthcare
67 pages
Assignment
No ratings yet
Assignment
2 pages
Lecture 3 Interaction Models
No ratings yet
Lecture 3 Interaction Models
64 pages