0% found this document useful (0 votes)

24 views

Ext Sorting

Uploaded by

fovoni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Ext Sorting

Uploaded by

fovoni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Spring 2017

EXTERNAL SORTING
(CH. 13 IN THE COW BOOK)

2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1

Motivation for External Sort
• Often have a large (size greater than the available
main memory) that we need to sort.
• Why are we sorting:
– Query processing: e.g. there are sort-based join and
aggregate algorithms
– Bulkload B+-tree: recall you had to sort the data
entries in the leaf level for this.
– One can specify ORDER BY in SQL, which sorts the
output of the query
–…
2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 2
Problem Statement
• Given M memory pages, and a relation of size N pages,
where N > M, sort R on a sort key, to produce an output
relation R’ that is sorted on the sort key.
• Example: Sort the following table on zipcode
CREATE TABLE Tweets (
uniqueMsgID INTEGER, -- unique message id
tstamp TIMESTAMP, -- when was the tweet posted
uid INTEGER, -- unique id of the user
msg VARCHAR (140), -- the actual message
zip INTEGER, -- zipcode when posted
retweet BOOLEAN -- retweeted?
);

• Another example: SELECT * FROM Tweets

WHERE tstamp = TODAY
Note the sort key can be composite
ORDER BY zip
2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 3
Goal of a good sort algorithm
• Sort efficiently! Where does the
memory come from?
• Sort well!
– Able to sort large relations with “small” amounts of
main memory
• What does sort efficiently mean:
– Minimize the number of disk I/Os
– Try using sequential I/Os rather than random I/Os
– Minimize the CPU costs
– Overlap I/O operations with CPU operations
Quick note: Sorting is very important in MapReduce. The reducer
expects data to arrive in sorted order from the mappers.
2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 4
2-Way Sort: Requires 3 Buffers
• Pass 1: Read a page, sort it, write it (a run).
– only one buffer page is used
• Pass 2, 3, …, etc.: Algorithms for
sorting in memory?
– three buffer pages used.

INPUT 1

OUTPUT
INPUT 2

Disk Main memory buffers Disk

2/7/17 CS 564: Database Management Systems 5

Two-Way External Merge Sort
3,4 6,2 9,4 8,7 5,6 3,1 2 Input file
• Read & write entire file in PASS 0
each pass 3,4 2,6 4,9 7,8 5,6 1,3 2 1-page runs
PASS 1
• N pages, # passes = 2,3 4,7 1,3
2-page runs
!"log 2 N #$ +1 4,6 8,9 5,6 2
PASS 2
• So total cost is: 2,3
4,4 1,2
2N ("log 2 N # + 1)
4-page runs
6,7 3,5
6
• Divide and conquer 8,9
PASS 3

How can we utilize more

1,2
2,3

than three buffer pages? 3,4

4,5
8-page runs

6,6
7,8
9
2/7/17 CS 564: Database Management Systems 6
General External Merge Sort
• Sort a file with N pages using B buffer pages:
– Pass 0: use B buffer pages (run size = B pgs).
Produce éN/Bù sorted runs of B pages each.
– Pass 2, 3, …: merge B-1 runs.

INPUT 1

... ...
INPUT 2
... OUTPUT

INPUT B-1
Disk Disk
B-1 way merge.
Total buffer pages: B Where are the main memory
buffer pages allocated?
2/7/17 CS 564: Database Management Systems 7
Cost of External Sort Merge
• # passes =
• I/O Cost = # passes * 2 N
• Consider sorting a file with a 1000 pages, using 11
buffer pages.
!1000 #
– At the end of the first pass, we have "" $$ = 91 runs of
11
size 11 pages
! 91#
– Next pass produces "" ``$$ = 10 runs
of size 110 pages each
10
– The next pass produces the fully
`` sorted file

2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 8

Number of Passes of External Sort
N (# of pages) B=3 B=17 B=257
100 7 2 1
10,000 13 4 2
1,000,000 20 5 3
10,000,000 23 6 3
100,000,000 26 7 4
1,000,000,000 30 8 4
32K pg
size, 32TB @1ms per read, 1111
relation
hours = 46 days!
2/7/17 CS 564: Database Management Systems 9
Size of the
Internal Sort Algorithm: Replacement Sort buffer pool?
Example: M = 2 pages, 2 tuples per page.
Input Sequence: 10, 20, 30, 40, 25, 35, 9, 8, 7, 6, 5, …
1. In-memory 10, 20, 30, 40
2. Read 25, Output 10. In-memory: 20, 25, 30, 40
3. Read 35, Output 20. In-memory : 25, 30, 35, 40
4. Read 9, Output 25. In-memory : 9, 30, 35, 40
5. Read 8, Output 30. In-memory : 8, 9, 35, 40
6. Read 7, Output 35. In-memory : 7, 8, 9, 40
7. Read 6, Output 40. In-memory : 6, 7, 8, 9
8. Read 5, Flush output, Start new run. In-memory …
On Disk: 10, 20, 25, 30, 35, 40

Average length of a run in replacement sort is 2M

2/7/17 CS 564: Database Management Systems 10
Internal Sort Algorithm
• Quicksort is a fast way to sort in memory.
• An alternative is replacement sort, which is also called tournament
sort or heapsort
– Top:Read in M pages of the relation R
– Output:move smallest record to output buffer
– Read in a new record r
– insert r into “sorted heap”
– if r not smallest, then GOTO Output
– else remove r from “heap”
– output “heap” in order; GOTO Top
• Worst-Case: What is min length of a run? How does this arise?
• Best-Case: What is max length of a run? How does this arise?
• Quicksort is faster, but longer runs often means fewer passes!
2/7/17 CS 564: Database Management Systems 11
Blocked I/Os
• So far we reading/writing one page at a time, but we
know that reading a block of pages sequentially is faster.
• Make each buffer (input/output) be a block of pgs.
– Will reduce fan-out during merge passes! Side-effect?
– Reduces per page I/O cost.
– First Pass: Each run 2B pages, ⌈N/2B⌉ runs (where B is the size
of the buffer pool in #pages)
• Which internal sort algorithm are we using?

– Merge Tree Fanout: F = ⌊B/b⌋ - 1, b is block size

– # passes: ⌈logF …⌉ + 1
– In practice, buffer pools are large, so most files are sorted in 2-3
passes
2/7/17 CS 564: Database Management Systems 12
Reduces response time.
Double Buffering What about throughput?

• Overlap CPU and IO processing

• Prefetch into shadow block.
– Potentially, more passes; in practice, 2-3 passes.

INPUT 1

INPUT 1'

INPUT 2
OUTPUT
INPUT 2'
OUTPUT'

b
block size
Disk INPUT k
Disk
INPUT k'

B main memory buffers, k-way merge

2/7/17 CS 564: Database Management Systems 13
Using B+ Trees for Sorting
• Scenario: Table to be sorted has B+ tree index on
sorting column(s).
• Idea: Can retrieve records in order by traversing leaf
pages.
• Is this a good idea?
• Cases to consider:
– B+ tree is clustered Good idea!
– B+ tree is not clustered Could be a very bad idea!

2/7/17 CS 564: Database Management Systems 14

Clustered B+ Tree Used for Sorting
• Go to the left-most leaf,
then retrieve all leaf Index
pages (Directs search)

• If data entry has records,

Data Entries
then we are done! ("Sequence set")
• If the data entries have
rids, each data page is
fetched just once (since Data Records
this is a clustered index)
Faster than
external sorting! Why not scan the data file directly?

2/7/17 CS 564: Database Management Systems 15

Unclustered B+ Tree Used for Sorting
• Unclustered B+-trees only have rids in the data entries
• So, in general, one I/O per data record!

When can this be useful? Index (Directs search)

Data Entries
("Sequence set")

Data Records

2/7/17 CS 564: Database Management Systems 16

Sorting Records!
• Sorting is a competitive sport!
• See http://sortbenchmark.org/
– Task is to sort 100 byte records.
– Different flavors of metrics that people compete on.
– Sort at trillion records as fast as you can,
• using general purpose sorting code (Daytona) or
• code specialized just for the benchmark (Indy)

2/7/17 CS 564: Database Management Systems 17

Designing Data Intensive Applications: Part 1: Storage and Retrieval
No ratings yet
Designing Data Intensive Applications: Part 1: Storage and Retrieval
85 pages
Sorting 2
No ratings yet
Sorting 2
19 pages
Ext Sort
No ratings yet
Ext Sort
4 pages
External Sorting: Sort-Merge Join Algorithm Involves Sorting
No ratings yet
External Sorting: Sort-Merge Join Algorithm Involves Sorting
7 pages
QueryProcessing Sorting
No ratings yet
QueryProcessing Sorting
44 pages
Lec9 04
No ratings yet
Lec9 04
21 pages
External Sorting: Comp 521 - Files and Databases Fall 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Fall 2010 1
21 pages
External Sorting: Comp 521 - Files and Databases Spring 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Spring 2010 1
21 pages
DBMS Internals: How Does It All Work?
No ratings yet
DBMS Internals: How Does It All Work?
94 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
20 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
No ratings yet
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
57 pages
Layers of a DBMS
No ratings yet
Layers of a DBMS
38 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
27 pages
External Sorting
No ratings yet
External Sorting
26 pages
10 Sorting
No ratings yet
10 Sorting
3 pages
3 - QueryProcessing - Ch15
No ratings yet
3 - QueryProcessing - Ch15
56 pages
10 Sorting
No ratings yet
10 Sorting
2 pages
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
No ratings yet
CAS CS 460/660 Introduction To Database Systems Query Evaluation I
32 pages
Lecture15 Fall
No ratings yet
Lecture15 Fall
102 pages
L11 QueryProcessing I
No ratings yet
L11 QueryProcessing I
42 pages
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
No ratings yet
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
53 pages
Review Session: External Sorting
No ratings yet
Review Session: External Sorting
6 pages
Chapter_4 - Algorithms for Query Processing and Optimization
No ratings yet
Chapter_4 - Algorithms for Query Processing and Optimization
119 pages
Execution
No ratings yet
Execution
37 pages
Ch13 External Sorting 1perpage Annotated
No ratings yet
Ch13 External Sorting 1perpage Annotated
27 pages
CPS 116 Introduction To Database Systems
No ratings yet
CPS 116 Introduction To Database Systems
10 pages
ADBMS-Chapter 1
No ratings yet
ADBMS-Chapter 1
16 pages
Lecture1-Introduction
No ratings yet
Lecture1-Introduction
29 pages
Overview - Explain - Measuring Performance - Disk Architectures - Indexes - Join Algorithms (CTD.)
No ratings yet
Overview - Explain - Measuring Performance - Disk Architectures - Indexes - Join Algorithms (CTD.)
69 pages
DBMS R19 UNIT IV
No ratings yet
DBMS R19 UNIT IV
25 pages
Introduction To Query Processing and Query Optimization Techniques
No ratings yet
Introduction To Query Processing and Query Optimization Techniques
77 pages
Sorting and Hashing: Why Sort?
No ratings yet
Sorting and Hashing: Why Sort?
6 pages
External Sorting: Demetris Zeinalipour
No ratings yet
External Sorting: Demetris Zeinalipour
18 pages
How Expensive Is SQL ORDER BY
No ratings yet
How Expensive Is SQL ORDER BY
3 pages
Unit 4 Part 1
No ratings yet
Unit 4 Part 1
23 pages
Lecture No. 1 PDF
No ratings yet
Lecture No. 1 PDF
57 pages
File Storage and Indexing: Lesson 13 Cs 3200 Kathleen Durant PHD
No ratings yet
File Storage and Indexing: Lesson 13 Cs 3200 Kathleen Durant PHD
46 pages
Database Management System: Introduction of DBMS
No ratings yet
Database Management System: Introduction of DBMS
25 pages
7. DB Part 1 - Creating MySQL Database and Tables
No ratings yet
7. DB Part 1 - Creating MySQL Database and Tables
46 pages
Heap File vs Sorted Files
No ratings yet
Heap File vs Sorted Files
35 pages
DBMS Indexing and Storage
No ratings yet
DBMS Indexing and Storage
53 pages
Implementing Sorting in Database Systems
No ratings yet
Implementing Sorting in Database Systems
37 pages
Advance Database Management System: Unit - 2 .Query Processing and Optimization
No ratings yet
Advance Database Management System: Unit - 2 .Query Processing and Optimization
38 pages
7-Query Processing
No ratings yet
7-Query Processing
47 pages
cs3353-cdsunit-v
No ratings yet
cs3353-cdsunit-v
6 pages
Chapter 1 DBMS DJSCE
No ratings yet
Chapter 1 DBMS DJSCE
27 pages
Data Storage and Access Methods: Min Song IS698
No ratings yet
Data Storage and Access Methods: Min Song IS698
50 pages
Storing Data: Disks and Files: (R&G Chapter 9)
No ratings yet
Storing Data: Disks and Files: (R&G Chapter 9)
39 pages
unit 3_DBMS
No ratings yet
unit 3_DBMS
15 pages
dsa small
No ratings yet
dsa small
21 pages
Discovering Computers 2010: Living in A Digital World
No ratings yet
Discovering Computers 2010: Living in A Digital World
44 pages
05_optimization (2)
No ratings yet
05_optimization (2)
58 pages
Final Review
No ratings yet
Final Review
96 pages
QueryProcess Optim
No ratings yet
QueryProcess Optim
60 pages
PostGIS Cookbook
From Everand
PostGIS Cookbook
Paolo Corti
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
Phaser III Game Design Workbook
From Everand
Phaser III Game Design Workbook
Stephen Gose
No ratings yet
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Polynomials
100% (1)
Polynomials
21 pages
Concordia University Machine Learning Assaignment with solutions
No ratings yet
Concordia University Machine Learning Assaignment with solutions
8 pages
Lecture Notes 4-Interpolation
No ratings yet
Lecture Notes 4-Interpolation
35 pages
Ea Drdo
No ratings yet
Ea Drdo
22 pages
Itc 2018-1
No ratings yet
Itc 2018-1
2 pages
Sumamtive Test No. 3 Grade 10 Quarter 1 Week 5 and 6
No ratings yet
Sumamtive Test No. 3 Grade 10 Quarter 1 Week 5 and 6
1 page
Max Flow Min Cut
No ratings yet
Max Flow Min Cut
8 pages
Arden's Theorem: A Short Presentation
No ratings yet
Arden's Theorem: A Short Presentation
8 pages
Image Processing Litreature Survey
No ratings yet
Image Processing Litreature Survey
4 pages
Homework Week 2 Big Oh
No ratings yet
Homework Week 2 Big Oh
3 pages
TD 4
No ratings yet
TD 4
2 pages
Crypt Hash Function
No ratings yet
Crypt Hash Function
26 pages
Week 7 Programming Assignment - Question: Expected Learning Outcomes From This Assignment
No ratings yet
Week 7 Programming Assignment - Question: Expected Learning Outcomes From This Assignment
3 pages
noc20-cs26_Week_08_Assignment_01
No ratings yet
noc20-cs26_Week_08_Assignment_01
75 pages
Numerical Solution of Sixth-Order Differential Equations Arising in Astrophysics by Neural Network
No ratings yet
Numerical Solution of Sixth-Order Differential Equations Arising in Astrophysics by Neural Network
6 pages
CSE2208-Lab Manual
No ratings yet
CSE2208-Lab Manual
28 pages
9 Vision Lec 6
No ratings yet
9 Vision Lec 6
58 pages
Advanced Training Course On FPGA Design and VHDL For Hardware Simulation and Synthesis
No ratings yet
Advanced Training Course On FPGA Design and VHDL For Hardware Simulation and Synthesis
17 pages
Introduction of ADC
No ratings yet
Introduction of ADC
77 pages
Ict 4052 Nnfl-Mkp-Part B
No ratings yet
Ict 4052 Nnfl-Mkp-Part B
2 pages
N CPC 2008 Slides
No ratings yet
N CPC 2008 Slides
15 pages
Introduction To Maxima and Minima
No ratings yet
Introduction To Maxima and Minima
9 pages
2 Algorithm Analysis
No ratings yet
2 Algorithm Analysis
72 pages
Questions Mathématiques
No ratings yet
Questions Mathématiques
2 pages
Spline Cubic Aplications
100% (1)
Spline Cubic Aplications
13 pages
Applied Numerical Methods - (NAFTI - Ir)
No ratings yet
Applied Numerical Methods - (NAFTI - Ir)
593 pages
RBF and Unsupervised Learning
No ratings yet
RBF and Unsupervised Learning
34 pages
LDA Two Classes - Example: Compute The Linear Discriminant Projection For The Following Two-Dimensional Dataset
No ratings yet
LDA Two Classes - Example: Compute The Linear Discriminant Projection For The Following Two-Dimensional Dataset
14 pages
Daa Unit I
No ratings yet
Daa Unit I
15 pages
Stable Diffusion Clearly Explained! _ by Steins _ Medium
No ratings yet
Stable Diffusion Clearly Explained! _ by Steins _ Medium
13 pages

Ext Sorting

Uploaded by

Ext Sorting

Uploaded by

Spring 2017

2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1

• Another example: SELECT * FROM Tweets

Disk Main memory buffers Disk

2/7/17 CS 564: Database Management Systems 5

How can we utilize more

than three buffer pages? 3,4

2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 8

Average length of a run in replacement sort is 2M

– Merge Tree Fanout: F = ⌊B/b⌋ - 1, b is block size

• Overlap CPU and IO processing

B main memory buffers, k-way merge

2/7/17 CS 564: Database Management Systems 14

• If data entry has records,

2/7/17 CS 564: Database Management Systems 15

When can this be useful? Index (Directs search)

2/7/17 CS 564: Database Management Systems 16

2/7/17 CS 564: Database Management Systems 17

You might also like