10 Sorting

The document discusses sorting and aggregation algorithms used in database management systems. It describes external merge sort, including two-way and k-way merge sort. It also covers using indexes and double buffering as optimizations. Aggregation algorithms of sorting and hashing are explained, along with handling hash tables that exceed memory size.

Uploaded by

bten8348

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views3 pages

10 Sorting

Uploaded by

bten8348

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Lecture #10: Sorting & Aggregation Algorithms

15-445/645 Database Systems (Fall 2022)

https://15445.courses.cs.cmu.edu/fall2022/
Carnegie Mellon University
Andy Pavlo

1 Sorting
DBMSs need to sort data because tuples in a table have no specific order under the relation model. Sorting
is (potentially) used in ORDER BY, GROUP BY, JOIN, and DISTINCT operators. If the data that that needs to
be sorted fits in memory, then the DBMS can use a standard sorting algorithms (e.g., quicksort). If the data
does not fit, then the DBMS needs to use external sorting that is able to spill to disk as needed and prefers
sequential over random I/O.
If a query contains an ORDER BY with a LIMIT, then the DBMS only needs to scan the data once to find the
top-N elements. This is called the Top-N Heap Sort. The ideal scenario for heapsort is when the top-N
elements fit in memory, so that the DBMS only has to maintain an in-memory sorted priority queue while
scanning the data.
The standard algorithm for sorting data which is too large to fit in memory is external merge sort. It
is a divide-and-conquer sorting algorithm that splits the data set into separate runs and then sorts them
individually. It can spill runs to disk as needed then read them back in one at a time. The algorithm is
comprised of two phases:
Phase #1 – Sorting: First, the algorithm sorts small chunks of data that fit in main memory, and then writes
the sorted pages back to disk.
Phase #2 – Merge: Then, the algorithm combines the sorted sub-files into a larger single file.

Two-way Merge Sort

The most basic version of the algorithm is the two-way merge sort. The algorithm reads each page during
the sorting phase, sorts it, and writes the sorted version back to disk. Then, in the merge phase, it uses three
buffer pages. It reads two sorted pages in from disk, and merges them together into a third buffer page.
Whenever the third page fills up, it is written back to disk and replaced with an empty page. Each set of
sorted pages is called a run. The algorithm then recursively merges the runs together.
If N is the total number of data pages, the algorithm makes 1 + ⌈log2 N ⌉ total passes through the data (1 for
the first sorting step then ⌈log2 N ⌉ for the recursive merging). The total I/O cost is 2N × (# of passes) since
each pass performs an I/O read and an I/O write for each page.

General (K-way) Merge Sort

The generalized version of the algorithm allows the DBMS to take advantage of using more than three buffer
pages. Let B be the total number of buffer pages available. Then, during the sort phase, the algorithm can
read B pages at a time and write N

B sorted runs back to disk. The merge phase can also combine up to
B − 1 runs in each pass, again using one buffer page for the combined data and writing back to disk as
needed.
Fall 2022 – Lecture #10 Sorting & Aggregation Algorithms

In the generalized version, the algorithm performs 1 + logB−1 N

B passes (one for the sorting phase and
N
logB−1 B for the merge phase. Then, the total I/O cost is 2N × (# of passes) since it again has to make
a read and write for each page in each pass.

Double Buffering Optimization

One optimization for external merge sort is prefetching the next run in the background and storing it in a
second buffer while the system is processing the current run. This reduces the wait time for I/O requests at
each step by continuously utilizing the disk. This optimization requires the use of multiple threads, since
the prefetching should occur while the computation for the current run is happening.

Using B+Trees
It is sometimes advantageous for the DBMS to use an existing B+tree index to aid in sorting rather than
using the external merge sort algorithm. In particular, if the index is a clustered index, the DBMS can just
traverse the B+tree. Since the index is clustered, the data will be stored in the correct order, so the I/O access
will be sequential. This means it is always better than external merge sort since no computation is required.
On the other hand, if the index is unclustered, traversing the tree is almost always worse, since each record
could be stored in any page, so nearly all record accesses will require a disk read.

2 Aggregations
An aggregation operator in a query plan collapses the values of one or more tuples into a single scalar value.
There are two approaches for implementing an aggregation: (1) sorting and (2) hashing.

Sorting
The DBMS first sorts the tuples on the GROUP BY key(s). It can use either an in-memory sorting algorithm
if everything fits in the buffer pool (e.g., quicksort) or the external merge sort algorithm if the size of the
data exceeds memory. The DBMS then performs a sequential scan over the sorted data to compute the
aggregation. The output of the operator will be sorted on the keys.
When performing sorting aggregations, it is important to order the query operations to maximize efficiency.
For example, if the query requires a filter, it is better to perform the filter first and then sort the filtered data
to reduce the amount of data that needs to be sorted.

Hashing
Hashing can be computationally cheaper than sorting for computing aggregations. The DBMS populates
an ephemeral hash table as it scans the table. For each record, check whether there is already an entry in
the hash table and perform the appropriate modification. If the size of the hash table is too large to fit in
memory, then the DBMS has to spill it to disk. There are two phases to accomplishing this:
• Phase #1 – Partition: Use a hash function h1 to split tuples into partitions on disk based on target
hash key. This will put all tuples that match into the same partition. The DBMS spills partitions to
disk via output buffers.
• Phase #2 – ReHash: For each partition on disk, read its pages into memory and build an in-memory
hash table based on a second hash function h2 (where h1 ̸= h2 ). Then go through each bucket of
this hash table to bring together matching tuples to compute the aggregation. This assumes that each
partition fits in memory.
During the ReHash phase, the DBMS can store pairs of the form (GroupByKey→RunningValue) to compute

15-445/645 Database Systems

Page 2 of 3
Fall 2022 – Lecture #10 Sorting & Aggregation Algorithms

the aggregation. The contents of RunningValue depends on the aggregation function. To insert a new tuple
into the hash table:
• If it finds a matching GroupByKey, then update the RunningValue appropriately.
• Else insert a new (GroupByKey→RunningValue) pair.

15-445/645 Database Systems

Page 3 of 3

Sorting & Aggregations: Intro To Database Systems Andy Pavlo
No ratings yet
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
57 pages
Chapter - 4 - Algorithms For Query Processing and Optimization
No ratings yet
Chapter - 4 - Algorithms For Query Processing and Optimization
119 pages
Unit 2
No ratings yet
Unit 2
104 pages
Algorithms For Query Processing and Optimization
No ratings yet
Algorithms For Query Processing and Optimization
77 pages
Sorting UNIT 5
No ratings yet
Sorting UNIT 5
66 pages
QueryProcessing Sorting
No ratings yet
QueryProcessing Sorting
44 pages
IRC SP 84 2014 - Manual of Specifications & Standards For Four Laning of Highways Through PPP
100% (2)
IRC SP 84 2014 - Manual of Specifications & Standards For Four Laning of Highways Through PPP
2 pages
Introduction To Query Processing and Query Optimization Techniques
No ratings yet
Introduction To Query Processing and Query Optimization Techniques
77 pages
Chapter - 3 Algorithms For Query Processing and Optimization PDF
No ratings yet
Chapter - 3 Algorithms For Query Processing and Optimization PDF
100 pages
3 - QueryProcessing - Ch15
No ratings yet
3 - QueryProcessing - Ch15
56 pages
7-Query Processing
No ratings yet
7-Query Processing
47 pages
Ext Sorting
No ratings yet
Ext Sorting
17 pages
Data Structures 1st Unit
No ratings yet
Data Structures 1st Unit
33 pages
Unit-III Sorting and Searching
No ratings yet
Unit-III Sorting and Searching
20 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
Layers of A DBMS
No ratings yet
Layers of A DBMS
38 pages
Unit 4 Database
No ratings yet
Unit 4 Database
21 pages
Ind. Partnership Act Case Studies
No ratings yet
Ind. Partnership Act Case Studies
10 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
27 pages
External Sorting
No ratings yet
External Sorting
26 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
20 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
81 pages
Unit-2 Query Processing and Optimization, Query Equivalence, Join Strategies
No ratings yet
Unit-2 Query Processing and Optimization, Query Equivalence, Join Strategies
38 pages
ADBMS-Chapter 1
No ratings yet
ADBMS-Chapter 1
16 pages
DBMS UNIT 4 Part 1
No ratings yet
DBMS UNIT 4 Part 1
15 pages
Final Review
No ratings yet
Final Review
96 pages
Daa Miniproject
No ratings yet
Daa Miniproject
20 pages
External Sorting: Comp 521 - Files and Databases Spring 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Spring 2010 1
21 pages
Sorting 2
No ratings yet
Sorting 2
19 pages
External Sorting: R & G - Chapter 13
No ratings yet
External Sorting: R & G - Chapter 13
52 pages
Data Structures Project Not Final Project
No ratings yet
Data Structures Project Not Final Project
21 pages
Unit V
No ratings yet
Unit V
64 pages
DBMS R19 Unit Iv
No ratings yet
DBMS R19 Unit Iv
25 pages
DBMS Internals: How Does It All Work?
No ratings yet
DBMS Internals: How Does It All Work?
94 pages
External Sorting: Comp 521 - Files and Databases Fall 2010 1
No ratings yet
External Sorting: Comp 521 - Files and Databases Fall 2010 1
21 pages
Practical Consideration of Internal Sorting and External
No ratings yet
Practical Consideration of Internal Sorting and External
20 pages
Sorting 1
No ratings yet
Sorting 1
30 pages
2.:A Binomial Heap Is A Collection of Binomial Trees: H 2h2 (Key) 1h1 (Key)
No ratings yet
2.:A Binomial Heap Is A Collection of Binomial Trees: H 2h2 (Key) 1h1 (Key)
15 pages
Da Unit IV
No ratings yet
Da Unit IV
8 pages
BCS Topic
No ratings yet
BCS Topic
66 pages
Advance Database Management System: Unit - 2 .Query Processing and Optimization
No ratings yet
Advance Database Management System: Unit - 2 .Query Processing and Optimization
38 pages
DAA03 Quick Sort Stressen
No ratings yet
DAA03 Quick Sort Stressen
35 pages
Sorting and Hashing: Why Sort?
No ratings yet
Sorting and Hashing: Why Sort?
6 pages
Sorting 1
No ratings yet
Sorting 1
27 pages
Lec9 04
No ratings yet
Lec9 04
21 pages
DS Unit - 5
No ratings yet
DS Unit - 5
18 pages
Query Processing + Optimization: Outline: Operator Evaluation Strategies
No ratings yet
Query Processing + Optimization: Outline: Operator Evaluation Strategies
53 pages
Data Structure Operations
No ratings yet
Data Structure Operations
8 pages
Ta d4 MRKG
No ratings yet
Ta d4 MRKG
79 pages
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
No ratings yet
Online Instructions For Chapter 2: Divide-And-Conquer: Algorithms Analysis and Design (CO3031)
16 pages
CIVI 6051: Design of Industrial Structure
No ratings yet
CIVI 6051: Design of Industrial Structure
42 pages
External Sorting: Sort-Merge Join Algorithm Involves Sorting
No ratings yet
External Sorting: Sort-Merge Join Algorithm Involves Sorting
7 pages
DBMS
No ratings yet
DBMS
68 pages
Data Structure and Algorithms: Samhitha Madala 12214200
No ratings yet
Data Structure and Algorithms: Samhitha Madala 12214200
23 pages
Sorting Hashing: Sam Dominic B. Antonio March 15, 2017 Ms - Jovy Ruth Obliosca IS211
No ratings yet
Sorting Hashing: Sam Dominic B. Antonio March 15, 2017 Ms - Jovy Ruth Obliosca IS211
6 pages
Sorting Techniques 1
No ratings yet
Sorting Techniques 1
4 pages
B/U Dorsuma Ganderbal
100% (1)
B/U Dorsuma Ganderbal
2 pages
NSP Mini Project Synopsis
No ratings yet
NSP Mini Project Synopsis
4 pages
Review Session: External Sorting
No ratings yet
Review Session: External Sorting
6 pages
Bubble Sort
No ratings yet
Bubble Sort
16 pages
Luvlygurumi Kitty (ING)
No ratings yet
Luvlygurumi Kitty (ING)
5 pages
Extrema and Average Rates of Change+
No ratings yet
Extrema and Average Rates of Change+
63 pages
Comparison of Various Sorting Algorithms - A Review
No ratings yet
Comparison of Various Sorting Algorithms - A Review
4 pages
APT cnc2
No ratings yet
APT cnc2
65 pages
Sprite Library For CSharp PDF
No ratings yet
Sprite Library For CSharp PDF
29 pages
Event Handling - V
No ratings yet
Event Handling - V
49 pages
Information About Netbook Axioo Neon CNW
0% (1)
Information About Netbook Axioo Neon CNW
16 pages
Sample SOP For Visitor Visa Australia
No ratings yet
Sample SOP For Visitor Visa Australia
6 pages
Commercial Banks: Sector Update
No ratings yet
Commercial Banks: Sector Update
19 pages
11 04 2019 Asea P1
No ratings yet
11 04 2019 Asea P1
40 pages
Institutionalization Stage Revalida
No ratings yet
Institutionalization Stage Revalida
59 pages
LST-1198 Decom &amp Xfrto Morroco 7-9-84
No ratings yet
LST-1198 Decom &amp Xfrto Morroco 7-9-84
30 pages
LITERA03Z - Week 7
No ratings yet
LITERA03Z - Week 7
42 pages
Work Psychology Understanding Human Behaviour in The Workplace 4th Edition Joanne Silvester Ebook All Chapters PDF
100% (2)
Work Psychology Understanding Human Behaviour in The Workplace 4th Edition Joanne Silvester Ebook All Chapters PDF
41 pages
Chp2-Binary Numbers and Codes (15.1.09)
No ratings yet
Chp2-Binary Numbers and Codes (15.1.09)
16 pages
Ey HK Tax Alert 1 Dec Issue 17
No ratings yet
Ey HK Tax Alert 1 Dec Issue 17
6 pages
Use Case Lookup
No ratings yet
Use Case Lookup
17 pages
Object Oriented Programming Paradigm (OOPP)
No ratings yet
Object Oriented Programming Paradigm (OOPP)
14 pages
Bachelor of Science in Hospitality Management
No ratings yet
Bachelor of Science in Hospitality Management
21 pages
Unit 123: Fixing Sheet Materials: Multiple Choice Questions
100% (2)
Unit 123: Fixing Sheet Materials: Multiple Choice Questions
4 pages
Oxford Exam Excellence Recording 26
No ratings yet
Oxford Exam Excellence Recording 26
1 page
Amd BKK Amd 7 Seas
No ratings yet
Amd BKK Amd 7 Seas
2 pages
Impact of Fused Deposition Modeling (FDM) Process Parameters
No ratings yet
Impact of Fused Deposition Modeling (FDM) Process Parameters
13 pages
USE Modals: Reading of Academic Texts in English II Modal Verbs
No ratings yet
USE Modals: Reading of Academic Texts in English II Modal Verbs
1 page
025 QHSEC SOP Manual Handling
No ratings yet
025 QHSEC SOP Manual Handling
4 pages
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

10 Sorting

Uploaded by

10 Sorting

Uploaded by

Lecture #10: Sorting & Aggregation Algorithms

15-445/645 Database Systems (Fall 2022)

Two-way Merge Sort

General (K-way) Merge Sort

In the generalized version, the algorithm performs 1 + logB−1 N

Double Buffering Optimization

15-445/645 Database Systems

15-445/645 Database Systems

You might also like