0% found this document useful (0 votes)
10 views98 pages

15 Optimization

The document outlines the course structure for Database Systems focusing on Query Planning and Optimization for Fall 2024, led by Prof. Andy Pavlo. It includes important dates for project and homework submissions, upcoming database seminars, and a summary of the last class discussing DBMS architecture and query planning. Additionally, it covers topics such as query optimization techniques, logical vs. physical plans, and various optimization strategies.

Uploaded by

ayush anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views98 pages

15 Optimization

The document outlines the course structure for Database Systems focusing on Query Planning and Optimization for Fall 2024, led by Prof. Andy Pavlo. It includes important dates for project and homework submissions, upcoming database seminars, and a summary of the last class discussing DBMS architecture and query planning. Additionally, it covers topics such as query optimization techniques, logical vs. physical plans, and various optimization strategies.

Uploaded by

ayush anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Database

Systems
Query Planning &
Optimization
15-445/645 FALL 2024 PROF. ANDY PAVLO

15-445/645 FALL 2024 PROF. ANDY PAVLO


2

ADMINISTRIVIA
Project #3 is due Sunday Nov 17th @ 11:59pm
→ Recitation will be next week

Homework #4 is due Sunday Nov 3rd @ 11:59pm

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


3

UPCOMING DATABASE TALKS


Exon (DB Seminar)
→ Monday Oct 28th @ 4:30pm
→ Zoom

Synnada (DB Seminar)


→ Monday Nov 4th @ 4:30pm
→ Zoom

InfluxDB (DB Seminar)


→ Monday Nov 11th @ 4:30pm
→ Zoom

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


4

LAST CLASS
We talked about how to design the DBMS's
architecture to execute queries in parallel.

The query plan is comprised of physical operators


that specify the algorithm to invoke at each step of
the plan.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


4

LAST CLASS
We talked about how to design the DBMS's
architecture to execute queries in parallel.

The query plan is comprised of physical operators


that specify the algorithm to invoke at each step of
the plan.

But how do we go from SQL to a query plan?

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


5

MOTIVATION
SELECT
FROM
ON
DISTINCT ename
Emp E JOIN Dept D
E.did = D.did
π
ename

WHERE D.dname = 'Toy'

Catalog
clustered unclustered unclustered
σ dname = 'Toy'

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
σ Emp.did = Dept.did

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
×
50 pages
Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


5

MOTIVATION Total: 2M I/Os


SELECT
FROM
ON
DISTINCT ename
Emp E JOIN Dept D
E.did = D.did
4 reads + 1 write πename

WHERE D.dname = 'Toy'

Catalog
clustered unclustered unclustered
2,000 reads + 4 writes
(10K/500 = 20 emps per dept)
σ dname = 'Toy'

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
1,000,000 reads + 2,000 writes
(FK join, 10k tuples in temp T2)
σ Emp.did = Dept.did

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
(50 + 50,000) reads
+ 1,000,000 writes ×
50 pages Write temp file T1
5 tuples per page in T1 Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


6

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did
WHERE D.dname = 'Toy'
Total: 54k I/Os

Catalog
clustered unclustered unclustered
4 reads + 4 writes
Read temp T2
πename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
2,000 reads + 4 writes
Read temp T1, Write temp T2
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
(50 + 50,000) reads
+ 2,000 writes ⋈ Emp.did = Dept.did

50 pages Page Nested-Loop Join


Write Temp T1 Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


6

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did
WHERE D.dname = 'Toy'
Total: 54k I/Os

Catalog
clustered unclustered unclustered
4 reads + 4 writes
Read temp T2
πename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
2,000 reads + 4 writes
Read temp T1, Write temp T2
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
(50 + 50,000) reads
+ 2,000 writes ⋈ Emp.did = Dept.did

50 pages Page Nested-Loop Join


Write Temp T1 Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


7

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did
WHERE D.dname = 'Toy'
Total: 7,159 I/Os

Catalog
clustered unclustered unclustered
4 reads + 4 writes
Read temp T2
π ename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
2,000 reads + 4 writes
Read temp T1, Write temp T2
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
3×(|Emp| + |Dept| =
3,150 reads + 2,000 writes ⋈ Emp.did = Dept.did

50 pages Sort-Merge Join (50 Buffers)


Write Temp T1 Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


7

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did Materialization Model
WHERE D.dname = 'Toy'
Total: 7,159 I/Os

Catalog
clustered unclustered unclustered
4 reads + 4 writes
Read temp T2
π ename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
2,000 reads + 4 writes
Read temp T1, Write temp T2
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
3×(|Emp| + |Dept| =
3,150 reads + 2,000 writes ⋈ Emp.did = Dept.did

50 pages Sort-Merge Join (50 Buffers)


Write Temp T1 Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


7

MOTIVATION
SELECT DISTINCT ename Vectorization Model Total: 3,151 I/Os
FROM Emp E JOIN Dept D
ON E.did = D.did Materialization Model
WHERE D.dname = 'Toy'
Total: 7,159 I/Os

Catalog
clustered unclustered unclustered
4 reads + 4 writes
Read temp T2
π ename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
2,000 reads + 4 writes
Read temp T1, Write temp T2
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
3×(|Emp| + |Dept| =
3,150 reads + 2,000 writes ⋈ Emp.did = Dept.did

50 pages Sort-Merge Join (50 Buffers)


Write Temp T1 Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


8

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did
WHERE D.dname = 'Toy'

Catalog
clustered unclustered unclustered
π
ename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
⋈ Emp.did = Dept.did

50 pages
Emp Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


8

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did
WHERE D.dname = 'Toy'

Catalog
clustered unclustered unclustered
π
ename

Emp(ssn,ename,addr,sal,did)
10,000 records
1,000 pages
σ dname = 'Toy'

clustered unclustered

Dept(did,dname,floor,mgr)
500 records
⋈ Emp.did = Dept.did

50 pages
Dept Emp
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


8

MOTIVATION
SELECT DISTINCT ename
FROM Emp E JOIN Dept D
ON E.did = D.did
WHERE D.dname = 'Toy'
Total: 37 I/Os

Catalog
clustered unclustered unclustered
4 reads + 1 writes
Read temp T2
π ename

Emp(ssn,ename,addr,sal,did)
1 + 3 (idx) + 20 (ptr chase) reads
10,000 records
1,000 pages + 4 writes ⋈ Emp.did = Dept.did
Index Nested-Loop Join
clustered unclustered

Dept(did,dname,floor,mgr)
3 reads + 1 writes
σ Emp
dname = 'Toy'

500 records
50 pages Access: Index(dname)
Dept
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


16

TODAY'S AGENDA
Background
Heuristic / Ruled-based Optimization
Cost-based Optimization
Cost Model Estimation

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


10

ARCHITECTURE OVERVIEW
Cost
Application Schema Info
Model
System
Catalog

1 SQL Query Estimates

Optimizer
Parser
Name→Internal ID 4 Physical
Plan
Binder
2 Abstract 3 Logical
Syntax Plan
Tree

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


11

LOGICAL VS. PHYSICAL PLANS


The optimizer generates a mapping of a logical
algebra expression to the optimal equivalent
physical algebra expression.

Physical operators define a specific execution


strategy using an access path.
→ They can depend on the physical format of the data that
they process (i.e., sorting, compression).
→ Not always a 1:1 mapping from logical to physical.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


13

QUERY OPTIMIZATION (QO)


1. Identify candidate equivalent trees
(logical). It is an NP-hard
problem, so the space is large. pn p1
2. For each candidate, find the pi p2
execution plan (physical).
p3
Estimate the cost of each plan.
3. Choose the best (physical) plan.
Entire search space very
large, as QO is NP-hard
Practically: Choose from a subset (w.r.t. # joins)
of all possible plans.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


14

QUERY OPTIMIZATION
Heuristics / Rules
→ Rewrite the query to remove (guessed) inefficiencies.
→ Examples: always do selections first or push down
projections as early as possible.
→ These techniques may need to examine catalog, but they do
not need to examine data.

Cost-based Search
→ Use a model to estimate the cost of executing a plan.
→ Enumerate multiple equivalent plans for a query and pick
the one with the lowest cost.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


15

LOGICAL PLAN OPTIMIZATION


Transform a logical plan into an equivalent logical
plan using pattern matching rules.
The goal is to increase the likelihood of
enumerating the optimal plan in the search.
→ Many equivalence rules for relational algebra!

Cannot compare plans because there is no cost


model but can "direct" a transformation to a
preferred side.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


16

PREDICATE PUSHDOWN
π ename π ename

σ dname = 'Toy' ⋈ Emp.did = Dept.did

⋈ Emp.did = Dept.did
σ Emp
dname = 'Toy'

Dept Emp Dept

π ename (σ dname = 'Toy' (Dept ⋈ Emp)) Rewrite π ename (Emp ⋈ σ dname = 'Toy' (Dept))
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


17

REPLACE CARTESIAN PRODUCT

σ Emp.did = Dept.did

× ⋈ Emp.did = Dept.did

Emp Dept Emp Dept

… (σ Dept.did = Emp.did (Dept × Emp)) Rewrite … (Emp ⋈ Emp.did = Dept.did Dept)


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


18

PROJECTION PUSHDOWN

π ename

π ename
⋈ Emp.did = Dept.did

⋈ Emp.did = Dept.did
… π ename,did

… Emp Emp

π Emp.ename (… ⋈ did Emp) Rewrite π Emp.ename (… ⋈ did ( π


ename, did Emp) )
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


20

QUERY OPTIMIZATION
Heuristics / Rules
→ Rewrite the query to remove (guessed) inefficiencies.
→ Examples: always do selections first or push down
projections as early as possible.
→ These techniques may need to examine catalog, but they do
not need to examine data.

Cost-based Search
→ Use a model to estimate the cost of executing a plan.
→ Enumerate multiple equivalent plans for a query and pick
the one with the lowest cost.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


21

COST-BASED QUERY OPTIMIZATION


We will start with cost-based, bottom-up QO
→ Aka the "classic" IBM System R optimizer
Approach: Enumerate different plans for the query
and estimate their costs.
→ Single relation.
→ Multiple relations.
→ Nested sub-queries.
It chooses the best plan it has seen for the query
after exhausting all plans or some timeout.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


22

SINGLE-RELATION QUERY PLANNING


Pick the best access method.
→ Sequential Scan
→ Binary Search (clustered indexes)
→ Index Scan
Predicate evaluation ordering.

Simple heuristics are often good enough for this.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


23

MULTI-RELATION QUERY PLANNING


Approach #1: Generative / Bottom-Up
→ Start with nothing and then iteratively assemble and add
building blocks to generate a query plan.
→ Examples: System R, Starburst

Approach #2: Transformation / Top-Down


→ Start with the outcome that the query wants, and then
transform it to equivalent alternative sub-plans to find the
optimal plan that gets to that goal.
→ Examples: Volcano, Cascades

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


24

BOTTOM-UP OPTIMIZATION
Use static rules to perform initial optimization.
Then use dynamic programming to determine
the best join order for tables using a divide-and-
conquer search method

Examples: IBM System R, DB2, MySQL, Postgres,


most open-source DBMSs.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


25

SYSTEM R OPTIMIZER
Left-Deep Tree
Break query into blocks and generate
logical operators for each block. D
For each logical operator, generate a C
set of physical operators that
implement it. A
outer
B
inner
→ All combinations of join algorithms and
access paths
Bushy Tree
Then, iteratively construct a “left-deep”
join tree that minimizes the estimated
amount of work to execute the plan.
A B C D
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


25

SYSTEM R OPTIMIZER
Left-Deep Tree
Break query into blocks and generate
logical operators for each block. D
For each logical operator, generate a C
set of physical operators that
implement it. A
outer
B
inner
→ All combinations of join algorithms and
access paths
Bushy Tree
Then, iteratively construct a “left-deep”
join tree that minimizes the estimated
amount of work to execute the plan.
A B C D
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


26

SYSTEM R OPTIMIZER
SELECT ARTIST.NAME
FROM ARTIST, APPEARS, ALBUM ARTIST: Sequential Scan
WHERE ARTIST.ID=APPEARS.ARTIST_ID APPEARS: Sequential Scan
AND APPEARS.ALBUM_ID=ALBUM.ID
ALBUM: Index Look-up on NAME
AND ALBUM.NAME=“Andy's OG Remix”
ORDER BY ARTIST.ID

Step #1: Choose the best access paths


to each table

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


26

SYSTEM R OPTIMIZER
SELECT ARTIST.NAME
FROM ARTIST, APPEARS, ALBUM ARTIST: Sequential Scan
WHERE ARTIST.ID=APPEARS.ARTIST_ID APPEARS: Sequential Scan
AND APPEARS.ALBUM_ID=ALBUM.ID
ALBUM: Index Look-up on NAME
AND ALBUM.NAME=“Andy's OG Remix”
ORDER BY ARTIST.ID

Step #1: Choose the best access paths ARTIST ⨝ APPEARS ⨝ ALBUM
to each table APPEARS ⨝ ALBUM ⨝ ARTIST
ALBUM ⨝ APPEARS ⨝ ARTIST
Step #2: Enumerate all possible join
APPEARS ⨝ ARTIST ⨝ ALBUM
orderings for tables ARTIST × ALBUM ⨝ APPEARS
ALBUM × ARTIST ⨝ APPEARS
⋮ ⋮ ⋮

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


26

SYSTEM R OPTIMIZER
SELECT ARTIST.NAME
FROM ARTIST, APPEARS, ALBUM ARTIST: Sequential Scan
WHERE ARTIST.ID=APPEARS.ARTIST_ID APPEARS: Sequential Scan
AND APPEARS.ALBUM_ID=ALBUM.ID
ALBUM: Index Look-up on NAME
AND ALBUM.NAME=“Andy's OG Remix”
ORDER BY ARTIST.ID

Step #1: Choose the best access paths ARTIST ⨝ APPEARS ⨝ ALBUM
to each table APPEARS ⨝ ALBUM ⨝ ARTIST
ALBUM ⨝ APPEARS ⨝ ARTIST
Step #2: Enumerate all possible join
APPEARS ⨝ ARTIST ⨝ ALBUM
orderings for tables ARTIST × ALBUM ⨝ APPEARS
Step #3: Determine the join ordering ALBUM × ARTIST ⨝ APPEARS
with the lowest cost ⋮ ⋮ ⋮

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 27

Physical Op SYSTEM R OPTIMIZER


ARTIST ⨝ APPEARS ⨝ ALBUM

ARTIST⨝APPEARS ALBUM⨝APPEARS APPEARS⨝ALBUM


ALBUM ARTIST ARTIST
•••

HASH_JOIN(A1,A3) MERGE_JOIN(A1,A3) HASH_JOIN(A2,A3) MERGE_JOIN(A2,A3) HASH_JOIN(A3,A2) MERGE_JOIN(A3,A2) •••


ALBUM.ID=APPEARS.ALBUM_ID

ARTIST.ID=APPEARS.ARTIST_ID APPEARS.ALBUM_ID=ALBUM.ID

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 27

Physical Op SYSTEM R OPTIMIZER


ARTIST ⨝ APPEARS ⨝ ALBUM

ARTIST⨝APPEARS ALBUM⨝APPEARS APPEARS⨝ALBUM


ALBUM ARTIST ARTIST
•••

HASH_JOIN(A1,A3) HASH_JOIN(A2,A3) MERGE_JOIN(A3,A2) •••


ALBUM.ID=APPEARS.ALBUM_ID

ARTIST.ID=APPEARS.ARTIST_ID APPEARS.ALBUM_ID=ALBUM.ID

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 27

Physical Op SYSTEM R OPTIMIZER


ARTIST ⨝ APPEARS ⨝ ALBUM

HASH_JOIN(A1⨝A3,A2) MERGE_JOIN(A1⨝A3,A2) HASH_JOIN(A2⨝A3,A1) MERGE_JOIN(A2⨝A3,A1) HASH_JOIN(A3⨝A2,A1) MERGE_JOIN(A3⨝A2,A1) •••

APPEARS.ALBUM_ID=ALBUM.ID APPEARS.ARTIST_ID=ARTIST.ID APPEARS.ARTIST_ID=ARTIST.ID

ARTIST⨝APPEARS ALBUM⨝APPEARS APPEARS⨝ALBUM


ALBUM ARTIST ARTIST
•••

HASH_JOIN(A1,A3) HASH_JOIN(A2,A3) MERGE_JOIN(A3,A2) •••


ALBUM.ID=APPEARS.ALBUM_ID

ARTIST.ID=APPEARS.ARTIST_ID APPEARS.ALBUM_ID=ALBUM.ID

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 27

Physical Op SYSTEM R OPTIMIZER


ARTIST ⨝ APPEARS ⨝ ALBUM

HASH_JOIN(A1⨝A3,A2) HASH_JOIN(A2⨝A3,A1) HASH_JOIN(A3⨝A2,A1) •••

APPEARS.ALBUM_ID=ALBUM.ID APPEARS.ARTIST_ID=ARTIST.ID APPEARS.ARTIST_ID=ARTIST.ID

ARTIST⨝APPEARS ALBUM⨝APPEARS APPEARS⨝ALBUM


ALBUM ARTIST ARTIST
•••

HASH_JOIN(A1,A3) HASH_JOIN(A2,A3) MERGE_JOIN(A3,A2) •••


ALBUM.ID=APPEARS.ALBUM_ID

ARTIST.ID=APPEARS.ARTIST_ID APPEARS.ALBUM_ID=ALBUM.ID

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 27

Physical Op SYSTEM R OPTIMIZER


ARTIST ⨝ APPEARS ⨝ ALBUM

The query has ORDER BY on


HASH_JOIN(A2⨝A3,A1) ARTIST.ID but the logical plans
do not contain sorting properties.
APPEARS.ARTIST_ID=ARTIST.ID

ALBUM⨝APPEARS
ARTIST

Hack: Keep track of best plans with and


without data in proper physical form,
HASH_JOIN(A2,A3)
ALBUM.ID=APPEARS.ALBUM_ID
and then check whether tacking on a sort
operator at the end is better.
ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


29

TOP-DOWN OPTIMIZATION
Start with a logical plan of what we want the query
to be. Perform a branch-and-bound search to
traverse the plan tree by converting logical
operators into physical operators.
→ Keep track of global best plan during search.
→ Treat physical properties of data as first-class entities
during planning.

Examples: MSSQL, Greenplum, CockroachDB

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical:
JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

HASH_JOIN(A1,A2)

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

HASH_JOIN(A1,A2)

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)

ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes


and traverse tree.
→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes HASH_JOIN(A1⨝A2,A3)

and traverse tree.


→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes HASH_JOIN(A1⨝A2,A3)

and traverse tree.


→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes QUICKSORT(A1.ID) HASH_JOIN(A1⨝A2,A3)

and traverse tree.


→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes QUICKSORT(A1.ID) HASH_JOIN(A1⨝A2,A3)

and traverse tree.


→ Logical→Logical: MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes QUICKSORT(A1.ID) HASH_JOIN(A1⨝A2,A3)

and traverse tree.


→ Logical→Logical: HASH_JOIN(A1⨝A2,A3)
MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Logical Op 30

Physical Op TOP-DOWN OPTIMIZATION


Start with a logical plan of what ARTIST ⨝ APPEARS ⨝ ALBUM

we want the query to be.


ORDER-BY(ARTIST.ID)

Invoke rules to create new nodes QUICKSORT(A1.ID) HASH_JOIN(A1⨝A2,A3)

and traverse tree.


→ Logical→Logical: HASH_JOIN(A1⨝A2,A3)
MERGE_JOIN(A1⨝A2,A3)

JOIN(A,B) to JOIN(B,A)
→ Logical→Physical:
JOIN(A,B) to HASH_JOIN(A,B) ARTIST⨝APPEARS ALBUM⨝APPEARS ARTIST⨝ALBUM

Can create "enforcer" rules HASH_JOIN(A1,A2) MERGE_JOIN(A1,A2)


that require input to have
certain properties. ARTIST ALBUM APPEARS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


31

OBSERVATION
Applications often execute nested queries.
→ We could optimize each block using the methods we have
discussed.
→ However, this may be inefficient since we optimize each
block separately without a global approach.

What if we could flatten a nested query into a single


block and optimize it?
→ Then, apply single-block query optimization methods.
→ Even if one cannot flatten to a single block, flattening to
fewer blocks is still beneficial.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


32

NESTED SUB-QUERIES
The DBMS treats nested sub-queries in the where
clause as functions that take parameters and return
a single value or set of values.

Approach #1: Rewrite to de-correlate and/or


flatten them.
Approach #2: Decompose nested query and
store results in a temporary table.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


33

NESTED SUB-QUERIES: REWRITE

SELECT name FROM sailors AS S


WHERE EXISTS (
SELECT * FROM reserves AS R
WHERE S.sid = R.sid
AND R.day = '2022-10-25'
)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


33

NESTED SUB-QUERIES: REWRITE

SELECT name FROM sailors AS S


WHERE EXISTS (
SELECT * FROM reserves AS R
WHERE S.sid = R.sid
AND R.day = '2022-10-25'
)

SELECT name
FROM sailors AS S, reserves AS R
WHERE S.sid = R.sid
AND R.day = '2022-10-25'

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


34

DECOMPOSING QUERIES
For harder queries, the optimizer breaks up queries
into blocks and then concentrates on one block at a
time.

Sub-queries are written to temporary tables that are


discarded after the query finishes.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


35

DECOMPOSING QUERIES

SELECT S.sid, MIN(R.day)


FROM sailors S, reserves R, boats B
WHERE S.sid = R.sid
AND R.bid = B.bid
AND B.color = 'red'
AND S.rating = (SELECT MAX(S2.rating)
FROM sailors S2)
GROUP BY S.sid
HAVING COUNT(*) > 1
Nested Block
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


35

DECOMPOSING QUERIES

SELECT MAX(rating) FROM sailors

SELECT S.sid, MIN(R.day)


FROM sailors S, reserves R, boats B
WHERE S.sid = R.sid
AND R.bid = B.bid
AND B.color = 'red'
AND S.rating = (SELECT MAX(S2.rating)
FROM sailors S2)
GROUP BY S.sid
HAVING COUNT(*) > 1
Nested Block
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


35

DECOMPOSING QUERIES

SELECT MAX(rating) FROM sailors

SELECT S.sid, MIN(R.day)


FROM sailors S, reserves R, boats B
WHERE S.sid = R.sid
AND R.bid = B.bid
AND B.color = 'red'
AND ###
S.rating = (SELECT MAX(S2.rating)
FROM sailors S2)
GROUP BY S.sid
HAVING COUNT(*) > 1

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


35

DECOMPOSING QUERIES

Inner Block SELECT MAX(rating) FROM sailors


SELECT S.sid, MIN(R.day)
FROM sailors S, reserves R, boats B
WHERE S.sid = R.sid
AND R.bid = B.bid
AND B.color = 'red'
AND ###
S.rating = (SELECT MAX(S2.rating)
FROM sailors S2)
GROUP BY S.sid
HAVING COUNT(*) > 1
Outer Block
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


36

EXPRESSION REWRITING
An optimizer transforms a query’s expressions (e.g.,
WHERE/ON clause predicates) into the minimal set of
expressions.

Implemented using if/then/else clauses or a


pattern-matching rule engine.
→ Search for expressions that match a pattern.
→ When a match is found, rewrite the expression.
→ Halt if there are no more rules that match.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

EXPRESSION REWRITING
Impossible / Unnecessary Predicates
SELECT * FROM A WHERE 1 = 0;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

EXPRESSION REWRITING
Impossible / Unnecessary Predicates
SELECT * FROM A WHERE false;
1 = 0;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

EXPRESSION REWRITING
Impossible / Unnecessary Predicates
SELECT * FROM A WHERE false;
1 = 0;
SELECT * FROM A WHERE NOW() IS NULL;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

EXPRESSION REWRITING
Impossible / Unnecessary Predicates
SELECT * FROM A WHERE false;
1 = 0;
SELECT * FROM A WHERE false;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

EXPRESSION REWRITING
Impossible / Unnecessary Predicates
SELECT * FROM A WHERE false;
1 = 0;
SELECT * FROM A WHERE false;

Merging Predicates
SELECT * FROM A
WHERE val BETWEEN 1 AND 100
OR val BETWEEN 50 AND 150;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

EXPRESSION REWRITING
Impossible / Unnecessary Predicates
SELECT * FROM A WHERE false;
1 = 0;
SELECT * FROM A WHERE false;

Merging Predicates
SELECT * FROM A
WHERE val BETWEEN 1 AND 150;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


38

OBSERVATION
We have formulas for the operator
algorithms (e.g. the cost formulas for π
ename
hash join, sort merge join, …), but we
also need to estimate the size of the
output that an operator produces. ⋈ Emp.did = Dept.did

Dept σ ename,did

Emp

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


38

OBSERVATION
We have formulas for the operator
algorithms (e.g. the cost formulas for ??? π
ename
hash join, sort merge join, …), but we
also need to estimate the size of the
output that an operator produces. ⋈ Emp.did = Dept.did ???

This is hard because the output of


Dept σ ename,did ???
each operators depends on its input. Emp

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


39

COST ESTIMATION
The DBMS uses a cost model to predict the
behavior of a query plan given a database state.
→ This is an internal cost that allows the DBMS to compare
one plan with another.

It is too expensive to run every possible plan to


determine this information, so the DBMS need a
way to derive this information.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


40

COST MODEL COMPONENTS


Choice #1: Physical Costs
→ Predict CPU cycles, I/O, cache misses, RAM consumption,
network messages…
→ Depends heavily on hardware.

Choice #2: Logical Costs


→ Estimate output size per operator.
→ Independent of the operator algorithm.
→ Need estimations for operator result sizes.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


41

POSTGRES COST MODEL


Uses a combination of CPU and I/O costs that are
weighted by “magic” constant factors.

Default settings are obviously for a disk-resident


database without a lot of memory:
→ Processing a tuple in memory is 400x faster than reading a
tuple from disk.
→ Sequential I/O is 4x faster than random I/O.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


2

41

POSTGRES COST MODEL


Uses a combination of CPU and I/O costs that are
weighted by “magic” constant factors.

Default settings are obviously for a disk-resident


database without a lot of memory:
→ Processing a tuple in memory is 400x faster than reading a
tuple from disk.
→ Sequential I/O is 4x faster than random I/O.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


42

STATISTICS
The DBMS stores internal statistics about tables,
attributes, and indexes in its internal catalog.
Different systems update them at different times.

Manual invocations:
→ Postgres/SQLite: ANALYZE
→ Oracle/MySQL: ANALYZE TABLE
→ SQL Server: UPDATE STATISTICS
→ DB2: RUNSTATS

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


43

SELECTION CARDINALITY
The selectivity (sel) of a predicate P
is the fraction of tuples that qualify.
SELECT * FROM people
Equality Predicate: A=constant WHERE age = 9
→ sel(A=constant) = #occurences/|R|

# of occurrences 10

5 Distinct values
0 of attribute
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
age
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


43

SELECTION CARDINALITY
The selectivity (sel) of a predicate P
is the fraction of tuples that qualify.
SELECT * FROM people
Equality Predicate: A=constant WHERE age = 9
→ sel(A=constant) = #occurences/|R|
→ Example: sel(age=9) =

# of occurrences 10
SC(age=9)=4
5 Distinct values
0 of attribute
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
age
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


43

SELECTION CARDINALITY
The selectivity (sel) of a predicate P
is the fraction of tuples that qualify.
SELECT * FROM people
Equality Predicate: A=constant WHERE age = 9
→ sel(A=constant) = #occurences/|R|
→ Example: sel(age=9) = 4/45

# of occurrences 10
SC(age=9)=4
5 Distinct values
0 of attribute
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
age
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


44

SELECTION CARDINALITY
Assumption #1: Uniform Data
→ The distribution of values (except for the heavy hitters) is
the same.

Assumption #2: Independent Predicates


→ The predicates on attributes are independent

Assumption #3: Inclusion Principle


→ The domain of join keys overlap such that each key in the
inner relation will also exist in the outer table.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


45

CORRELATED ATTRIBUTES
Consider a database of automobiles:
→ # of Makes = 10, # of Models = 100
And the following query:
→ (make=“Honda” AND model=“Accord”)
With the independence and uniformity
assumptions, the selectivity is:
→ 1/10 × 1/100 = 0.001
But since only Honda makes Accords the real
selectivity is 1/100 = 0.01

Source: Guy Lohman


Guy Lohman

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


46

STATISTICS
Choice #1: Histograms
→ Maintain an occurrence count per value (or range of
values) in a column.

Choice #2: Sketches


→ Probabilistic data structure that gives an approximate
count for a given value.

Choice #3: Sampling


→ DBMS maintains a small subset of each table that it then
uses to evaluate expressions to compute selectivity.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


47

HISTOGRAMS
Our formulas are nice, but we assume that data
values are uniformly distributed.

Histogram
# of occurrences 10
5

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

15 Keys × 32-bits = 60 bytes Distinct values of attribute

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


48

EQUI-WIDTH HISTOGRAM
Maintain counts for a group of values instead of
each unique key. All buckets have the same width
(i.e., same # of value).
Non-Uniform Approximation
10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


48

EQUI-WIDTH HISTOGRAM
Maintain counts for a group of values instead of
each unique key. All buckets have the same width
(i.e., same # of value).
Non-Uniform Approximation
10

Bucket Ranges0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5
Count=8 Count=4 Count=15 Count=3 Count=14

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


48

EQUI-WIDTH HISTOGRAM
Maintain counts for a group of values instead of
each unique key. All buckets have the same width
(i.e., same # of value).
Equi-Width Histogram
15
10
5
Bucket Ranges0
1-3 4-6 7-9 10-12 13-15
Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5
Count=8 Count=4 Count=15 Count=3 Count=14

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


49

EQUI-DEPTH HISTOGRAMS
Vary the width of buckets so that the total number
of occurrences for each bucket is roughly the same.

Histogram (Quantiles)
10

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


49

EQUI-DEPTH HISTOGRAMS
Vary the width of buckets so that the total number
of occurrences for each bucket is roughly the same.

Histogram (Quantiles)
15
10
5
0
1-5 6-8 9-13 14-15

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


50

SKETCHES
Probabilistic data structures that generate
approximate statistics about a data set.
Cost-model can replace histograms with sketches to
improve its selectivity estimate accuracy.

Most common examples:


→ Count-Min Sketch (1988): Approximate frequency count
of elements in a set.
→ HyperLogLog (2007): Approximate the number of distinct
elements in a set.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


51

SAMPLING
Modern DBMSs also collect samples SELECT AVG(age)
from tables to estimate selectivities. FROM people
WHERE age > 50
Update samples when the underlying id name age status
tables changes significantly. 1001 Obama 63 Rested
1002 Swift 34 Paid
1003 Tupac 25 Dead
1004 Bieber 30 Crunk
1005 Andy 43 Illin
1006 TigerKing 61 Jailed


1 billion tuples
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


51

SAMPLING
Modern DBMSs also collect samples SELECT AVG(age)
from tables to estimate selectivities. FROM people
WHERE age > 50
Update samples when the underlying id name age status
tables changes significantly. 1001 Obama 63 Rested
1002 Swift 34 Paid
1003 Tupac 25 Dead
1004 Bieber 30 Crunk
Table Sample 1005 Andy 43 Illin
1001 Obama 63 Rested 1006 TigerKing 61 Jailed
1003 Tupac 25 Dead
1005 Andy 43 Illin

1 billion tuples
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


51

SAMPLING
Modern DBMSs also collect samples SELECT AVG(age)
from tables to estimate selectivities. FROM people
WHERE age > 50
Update samples when the underlying id name age status
tables changes significantly. 1001 Obama 63 Rested
1002 Swift 34 Paid
1003 Tupac 25 Dead
1004 Bieber 30 Crunk
Table Sample 1005 Andy 43 Illin
1001 Obama 63 Rested 1006 TigerKing 61 Jailed
sel(age>50) = 1/3 1003 Tupac 25 Dead

1005 Andy 43 Illin
1 billion tuples
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


52

CONCLUSION
Query optimization is critical for a database system.
→ SQL → Logical Plan → Physical Plan
→ Flatten queries before going to the optimization part.
Expression handling is also important.
→ Estimate costs using models based on summarizations.

QO enumeration can be bottom-up or top-down.

If you like this and want to make cash money after


you leave CMU, take 15-799 in spring 2025.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


53

NEXT CLASS
Transactions!
→ aka the second hardest part about database systems

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)

You might also like