Advanced SQL: Intro To Database Systems Andy Pavlo

Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

02 Advanced SQL

Intro to Database Systems Andy Pavlo


15-445/15-645
Fall 2019 AP Computer Science
Carnegie Mellon University
2

R E L AT I O N A L L A N G UA G E S

User only needs to specify the answer that they


want, not how to compute it.

The DBMS is responsible for efficient evaluation


of the query.
→ Query optimizer: re-orders operations and generates
query plan

CMU 15-445/645 (Fall 2019)


3

S Q L H I S TO R Y

Originally “SEQUEL” from IBM’s


System R prototype.
→ Structured English Query Language
→ Adopted by Oracle in the 1970s.

IBM releases DB2 in 1983.

ANSI Standard in 1986. ISO in 1987


→ Structured Query Language

CMU 15-445/645 (Fall 2019)


4

S Q L H I S TO R Y

Current standard is SQL:2016


→ SQL:2016 → JSON, Polymorphic tables
→ SQL:2011 → Temporal DBs, Pipelined DML
→ SQL:2008 → TRUNCATE, Fancy sorting
→ SQL:2003 → XML, windows, sequences, auto-gen IDs.
→ SQL:1999 → Regex, triggers, OO

Most DBMSs at least support SQL-92


→ System Comparison: http://troels.arvin.dk/db/rdbms/

CMU 15-445/645 (Fall 2019)


5

R E L AT I O N A L L A N G UA G E S

Data Manipulation Language (DML)


Data Definition Language (DDL)
Data Control Language (DCL)

Also includes:
→ View definition
→ Integrity & Referential Constraints
→ Transactions

Important: SQL is based on bags (duplicates) not


sets (no duplicates).
CMU 15-445/645 (Fall 2019)
6

Aggregations + Group By
String / Date / Time Operations
Output Control + Redirection
Nested Queries
Common Table Expressions
Window Functions

CMU 15-445/645 (Fall 2019)


7

E X A M P L E D ATA B A S E

student(sid,name,login,gpa) enrolled(sid,cid,grade)
sid name login age gpa sid cid grade
53666 Kanye kayne@cs 39 4.0 53666 15-445 C
53688 Bieber jbieber@cs 22 3.9 53688 15-721 A
53655 Tupac shakur@cs 26 3.5 53688 15-826 B
53655 15-445 B
course(cid,name) 53666 15-721 C
cid name
15-445 Database Systems
15-721 Advanced Database Systems
15-826 Data Mining
15-823 Advanced Topics in Databases

CMU 15-445/645 (Fall 2019)


8

A G G R E G AT E S

Functions that return a single value from a bag of


tuples:
→ AVG(col)→ Return the average col value.
→ MIN(col)→ Return minimum col value.
→ MAX(col)→ Return maximum col value.
→ SUM(col)→ Return sum of values in col.
→ COUNT(col)→ Return # of values for col.

CMU 15-445/645 (Fall 2019)


9

A G G R E G AT E S

Aggregate functions can only be used in the


SELECT output list.

Get # of students with a “@cs” login:

SELECT COUNT(login) AS cnt


FROM student WHERE login LIKE '%@cs'

CMU 15-445/645 (Fall 2019)


9

A G G R E G AT E S

Aggregate functions can only be used in the


SELECT output list.

Get # of students with a “@cs” login:

SELECT COUNT(login) AS cnt


FROM student WHERE login LIKE '%@cs'

CMU 15-445/645 (Fall 2019)


9

A G G R E G AT E S

Aggregate functions can only be used in the


SELECT output list.

Get # of students with a “@cs” login:

SELECT COUNT(login) AS cnt


FROM student WHERE login LIKE '%@cs'
SELECT COUNT(*) AS cnt
FROM student WHERE login LIKE '%@cs'

CMU 15-445/645 (Fall 2019)


9

A G G R E G AT E S

Aggregate functions can only be used in the


SELECT output list.

Get # of students with a “@cs” login:

SELECT COUNT(login) AS cnt


FROM student WHERE login LIKE '%@cs'
SELECT COUNT(*) AS cnt
FROM student WHERE login LIKE '%@cs'
SELECT COUNT(1) AS cnt
FROM student WHERE login LIKE '%@cs'
CMU 15-445/645 (Fall 2019)
10

M U LT I P L E A G G R E G AT E S

Get the number of students and their average GPA that


have a “@cs” login.

AVG(gpa) COUNT(sid)
SELECT AVG(gpa), COUNT(sid) 3.25 12
FROM student WHERE login LIKE '%@cs'

CMU 15-445/645 (Fall 2019)


11

D I S T I N C T A G G R E G AT E S

COUNT, SUM, AVG support DISTINCT

Get the number of unique students that have an “@cs”


login.
COUNT(DISTINCT login)
SELECT COUNT(DISTINCT login) 10
FROM student WHERE login LIKE '%@cs'

CMU 15-445/645 (Fall 2019)


12

A G G R E G AT E S

Output of other columns outside of an aggregate is


undefined.

Get the average GPA of students enrolled in each course.


AVG(s.gpa) e.cid
SELECT AVG(s.gpa), e.cid 3.5 ???
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid

CMU 15-445/645 (Fall 2019)


13

G R O U P BY

Project tuples into subsets and SELECT AVG(s.gpa), e.cid


FROM enrolled AS e, student AS s
calculate aggregates against WHERE e.sid = s.sid
each subset. GROUP BY e.cid

e.sid s.sid s.gpa e.cid


53435 53435 2.25 15-721
53439 53439 2.70 15-721
56023 56023 2.75 15-826
59439 59439 3.90 15-826
53961 53961 3.50 15-826
58345 58345 1.89 15-445

CMU 15-445/645 (Fall 2019)


13

G R O U P BY

Project tuples into subsets and SELECT AVG(s.gpa), e.cid


FROM enrolled AS e, student AS s
calculate aggregates against WHERE e.sid = s.sid
each subset. GROUP BY e.cid

e.sid s.sid s.gpa e.cid


53435 53435 2.25 15-721 AVG(s.gpa) e.cid
53439 53439 2.70 15-721 2.46 15-721
56023 56023 2.75 15-826 3.39 15-826
59439 59439 3.90 15-826 1.89 15-445
53961 53961 3.50 15-826
58345 58345 1.89 15-445

CMU 15-445/645 (Fall 2019)


13

G R O U P BY

Project tuples into subsets and SELECT AVG(s.gpa), e.cid


FROM enrolled AS e, student AS s
calculate aggregates against WHERE e.sid = s.sid
each subset. GROUP BY e.cid

e.sid s.sid s.gpa e.cid


53435 53435 2.25 15-721 AVG(s.gpa) e.cid
53439 53439 2.70 15-721 2.46 15-721
56023 56023 2.75 15-826 3.39 15-826
59439 59439 3.90 15-826 1.89 15-445
53961 53961 3.50 15-826
58345 58345 1.89 15-445

CMU 15-445/645 (Fall 2019)


14

G R O U P BY

Non-aggregated values in SELECT output clause


must appear in GROUP BY clause.

X
SELECT AVG(s.gpa), e.cid, s.name
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid
GROUP BY e.cid

CMU 15-445/645 (Fall 2019)


14

G R O U P BY

Non-aggregated values in SELECT output clause


must appear in GROUP BY clause.

X
SELECT AVG(s.gpa), e.cid, s.name
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid
GROUP BY e.cid,
e.cid s.name

CMU 15-445/645 (Fall 2019)


15

H AV I N G

Filters results based on aggregation computation.


Like a WHERE clause for a GROUP BY

X
SELECT AVG(s.gpa) AS avg_gpa, e.cid
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid
AND avg_gpa > 3.9
GROUP BY e.cid

CMU 15-445/645 (Fall 2019)


15

H AV I N G

Filters results based on aggregation computation.


Like a WHERE clause for a GROUP BY

X
SELECT AVG(s.gpa) AS avg_gpa, e.cid
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid
GROUP
AND avg_gpa
BY e.cid> 3.9
GROUP BY
HAVING avg_gpa
e.cid > 3.9;

CMU 15-445/645 (Fall 2019)


15

H AV I N G

Filters results based on aggregation computation.


Like a WHERE clause for a GROUP BY

X
SELECT AVG(s.gpa) AS avg_gpa, e.cid
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid
GROUP
AND avg_gpa
BY e.cid> 3.9
GROUP BY
HAVING avg_gpa
e.cid > 3.9;

AVG(s.gpa) e.cid
3.75 15-415 avg_gpa e.cid
3.950000 15-721 3.950000 15-721
3.900000 15-826
CMU 15-445/645 (Fall 2019)
24

S T R I N G O P E R AT I O N S
String Case String Quotes
SQL-92 Sensitive Single Only
Postgres Sensitive Single Only
MySQL Insensitive Single/Double
SQLite Sensitive Single/Double
DB2 Sensitive Single Only
Oracle Sensitive Single Only

WHERE UPPER(name) = UPPER('KaNyE') SQL-92

WHERE name = "KaNyE" MySQL


CMU 15-445/645 (Fall 2019)
17

S T R I N G O P E R AT I O N S

LIKE is used for string matching.


String-matching operators SELECT * FROM enrolled AS e
→ '%' Matches any substring (including WHERE e.cid LIKE '15-%'
empty strings).
→ '_' Match any one character SELECT * FROM student AS s
WHERE s.login LIKE '%@c_'

CMU 15-445/645 (Fall 2019)


18

S T R I N G O P E R AT I O N S

SQL-92 defines string functions.


→ Many DBMSs also have their own unique functions
Can be used in either output and predicates:

SELECT SUBSTRING(name,0,5) AS abbrv_name


FROM student WHERE sid = 53688

SELECT * FROM student AS s


WHERE UPPER(e.name) LIKE 'KAN%'

CMU 15-445/645 (Fall 2019)


19

S T R I N G O P E R AT I O N S

SQL standard says to use || operator to


concatenate two or more strings together.

SELECT name FROM student SQL-92


WHERE login = LOWER(name) || '@cs'
SELECT name FROM student MSSQL
WHERE login = LOWER(name) + '@cs'
SELECT name FROM student MySQL
WHERE login = CONCAT(LOWER(name), '@cs')

CMU 15-445/645 (Fall 2019)


20

D AT E / T I M E O P E R AT I O N S

Operations to manipulate and modify DATE/TIME


attributes.
Can be used in either output and predicates.
Support/syntax varies wildly…

Demo: Get the # of days since the beginning of


the year.

CMU 15-445/645 (Fall 2019)


21

OUTPUT REDIRECTION

Store query results in another table:


→ Table must not already be defined.
→ Table will have the same # of columns with the same
types as the input.

SELECT DISTINCT cid INTO CourseIds SQL-92


FROM enrolled;

CREATE TABLE CourseIds ( MySQL


SELECT DISTINCT cid FROM enrolled);

CMU 15-445/645 (Fall 2019)


22

OUTPUT REDIRECTION

Insert tuples from query into another table:


→ Inner SELECT must generate the same columns as the
target table.
→ DBMSs have different options/syntax on what to do with
duplicates.

INSERT INTO CourseIds SQL-92


(SELECT DISTINCT cid FROM enrolled);

CMU 15-445/645 (Fall 2019)


23

OUTPUT CONTROL

ORDER BY <column*> [ASC|DESC]


→ Order the output tuples by the values in one or more of
their columns.
sid grade
SELECT sid, grade FROM enrolled 53123 A
WHERE cid = '15-721' 53334 A
ORDER BY grade 53650 B
53666 D

CMU 15-445/645 (Fall 2019)


23

OUTPUT CONTROL

ORDER BY <column*> [ASC|DESC]


→ Order the output tuples by the values in one or more of
their columns.
sid grade
SELECT sid, grade FROM enrolled 53123 A
WHERE cid = '15-721' 53334 A
ORDER BY grade 53650 B
53666 D
sid
SELECT sid FROM enrolled 53666
WHERE cid = '15-721' 53650
ORDER BY grade DESC, sid ASC 53123
53334

CMU 15-445/645 (Fall 2019)


24

OUTPUT CONTROL

LIMIT <count> [offset]


→ Limit the # of tuples returned in output.
→ Can set an offset to return a “range”

SELECT sid, name FROM student


WHERE login LIKE '%@cs'
LIMIT 10

CMU 15-445/645 (Fall 2019)


24

OUTPUT CONTROL

LIMIT <count> [offset]


→ Limit the # of tuples returned in output.
→ Can set an offset to return a “range”

SELECT sid, name FROM student


WHERE login LIKE '%@cs'
LIMIT 10

SELECT sid, name FROM student


WHERE login LIKE '%@cs'
LIMIT 20 OFFSET 10

CMU 15-445/645 (Fall 2019)


26

NESTED QUERIES

Queries containing other queries.


They are often difficult to optimize.

Inner queries can appear (almost) anywhere in


query.
Outer Query SELECT name FROM student WHERE
sid IN (SELECT sid FROM enrolled) Inner Query

CMU 15-445/645 (Fall 2019)


27

NESTED QUERIES

Get the names of students in '15-445'


SELECT name FROM student
WHERE ...

sid in the set of people that take 15-445

CMU 15-445/645 (Fall 2019)


27

NESTED QUERIES

Get the names of students in '15-445'


SELECT name FROM student
WHERE ...
SELECT sid FROM enrolled
WHERE cid = '15-445'

CMU 15-445/645 (Fall 2019)


27

NESTED QUERIES

Get the names of students in '15-445'


SELECT name FROM student
WHERE ...
sid IN (
SELECT sid FROM enrolled
WHERE cid = '15-445'
)

CMU 15-445/645 (Fall 2019)


27

NESTED QUERIES

Get the names of students in '15-445'


SELECT name FROM student
WHERE ...
sid IN (
SELECT sid FROM enrolled
WHERE cid = '15-445'
)

CMU 15-445/645 (Fall 2019)


28

NESTED QUERIES

ALL→ Must satisfy expression for all rows in sub-


query

ANY→ Must satisfy expression for at least one row


in sub-query.

IN→ Equivalent to '=ANY()' .

EXISTS→ At least one row is returned.

CMU 15-445/645 (Fall 2019)


29

NESTED QUERIES

Get the names of students in ‘15-445’


SELECT name FROM student
WHERE sid = ANY(
SELECT sid FROM enrolled
WHERE cid = '15-445'
)

CMU 15-445/645 (Fall 2019)


29

NESTED QUERIES

Get the names of students in ‘15-445’


SELECT (SELECT
name FROMS.name
student
FROM student AS S
WHERE sid
WHERE
= ANY(
S.sid = E.sid) AS sname
FROM
SELECT
enrolled
sid FROM
AS Eenrolled
WHERE
WHERE
cidcid
= '15-445'
= '15-445'
)

CMU 15-445/645 (Fall 2019)


30

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

SELECT MAX(e.sid), s.name


FROM enrolled AS e, student AS s
WHERE e.sid = s.sid;

CMU 15-445/645 (Fall 2019)


30

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

X
SELECT MAX(e.sid), s.name
FROM enrolled AS e, student AS s
WHERE e.sid = s.sid;

Won't work in SQL-92. This runs in SQLite, but


not Postgres or MySQL (v5.7 with strict mode).

CMU 15-445/645 (Fall 2019)


31

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

SELECT sid, name FROM student


WHERE ...

"Is greater than every other sid"

CMU 15-445/645 (Fall 2019)


31

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

SELECT sid, name FROM student


... is greater than every
WHERE sid
SELECT sid FROM enrolled

CMU 15-445/645 (Fall 2019)


31

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

SELECT sid, name FROM student sid name


WHERE sid
... is
=>greater
ALL( than every 53688 Bieber

SELECT sid FROM enrolled


)

CMU 15-445/645 (Fall 2019)


31

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

SELECT sid, name FROM student


WHERE ... is
sid
SELECT => ALL(
greater
sid, namethan every
FROM student
SELECT sid FROM
WHERE sid IN ( enrolled
) SELECT MAX(sid) FROM enrolled
)

CMU 15-445/645 (Fall 2019)


31

NESTED QUERIES

Find student record with the highest id that is enrolled


in at least one course.

SELECT sid, name FROM student


WHERE ... is
sid
SELECT => ALL(
greater
sid, namethan every
FROM student
SELECT
WHEREsid FROM
sid IN enrolled
( name FROM student
) SELECT sid,
SELECT
WHEREMAX(sid)
sid IN (FROM enrolled
) SELECT sid FROM enrolled
ORDER BY sid DESC LIMIT 1
)

CMU 15-445/645 (Fall 2019)


32

NESTED QUERIES

Find all courses that has no students enrolled in it.

SELECT * FROM course


WHERE ...

sid cid grade


cid name
53666 15-445 C
15-445 Database Systems
53688 15-721 A
15-721 Advanced Database Systems
53688 15-826 B
15-826 Data Mining
53655 15-445 B
15-823 Advanced Topics in Databases
53666 15-721 C

CMU 15-445/645 (Fall 2019)


32

NESTED QUERIES

Find all courses that has no students enrolled in it.

SELECT * FROM course


WHERE NOT
... EXISTS(

CMU 15-445/645 (Fall 2019)


32

NESTED QUERIES

Find all courses that has no students enrolled in it.

SELECT * FROM course


WHERE NOT
... EXISTS(
SELECT * FROM enrolled
) WHERE course.cid = enrolled.cid
)

cid name
15-823 Advanced Topics in Databases

CMU 15-445/645 (Fall 2019)


32

NESTED QUERIES

Find all courses that has no students enrolled in it.

SELECT * FROM course


WHERE NOT
... EXISTS(
SELECT * FROM enrolled
) WHERE course.cid = enrolled.cid
)

cid name
15-823 Advanced Topics in Databases

CMU 15-445/645 (Fall 2019)


33

WINDOW FUNCTIONS

Performs a "sliding" calculation across a set of


tuples that are related.
Like an aggregation but tuples are not grouped
into a single output tuples.

SELECT ... FUNC-NAME(...) OVER (...)


FROM tableName

CMU 15-445/645 (Fall 2019)


33

WINDOW FUNCTIONS

Performs a "sliding" calculation across a set of


tuples that are related.
Like an aggregation but tuples are not grouped
into a single output tuples.
Can also sort

SELECT ... FUNC-NAME(...) OVER (...)


FROM tableName
Aggregation Functions
Special Functions
CMU 15-445/645 (Fall 2019)
34

WINDOW FUNCTIONS

Aggregation functions:
→ Anything that we discussed earlier sid cid grade row_num
Special window functions: 53666
53688
15-445
15-721
C
A
1
2
→ ROW_NUMBER()→ # of the current row 53688 15-826 B 3
→ RANK()→ Order position of the current 53655 15-445 B 4
row. 53666 15-721 C 5

SELECT *, ROW_NUMBER() OVER () AS row_num


FROM enrolled

CMU 15-445/645 (Fall 2019)


34

WINDOW FUNCTIONS

Aggregation functions:
→ Anything that we discussed earlier sid cid grade row_num
Special window functions: 53666
53688
15-445
15-721
C
A
1
2
→ ROW_NUMBER()→ # of the current row 53688 15-826 B 3
→ RANK()→ Order position of the current 53655 15-445 B 4
row. 53666 15-721 C 5

SELECT *, ROW_NUMBER() OVER () AS row_num


FROM enrolled

CMU 15-445/645 (Fall 2019)


35

WINDOW FUNCTIONS

The OVER keyword specifies how to


group together tuples when cid sid row_number
computing the window function. 15-445 53666 1
15-445 53655 2
Use PARTITION BY to specify group. 15-721 53688 1
15-721 53666 2
15-826 53688 1
SELECT cid, sid,
ROW_NUMBER() OVER (PARTITION BY cid)
FROM enrolled
ORDER BY cid

CMU 15-445/645 (Fall 2019)


35

WINDOW FUNCTIONS

The OVER keyword specifies how to


group together tuples when cid sid row_number
computing the window function. 15-445 53666 1
15-445 53655 2
Use PARTITION BY to specify group. 15-721 53688 1
15-721 53666 2
15-826 53688 1
SELECT cid, sid,
ROW_NUMBER() OVER (PARTITION BY cid)
FROM enrolled
ORDER BY cid

CMU 15-445/645 (Fall 2019)


36

WINDOW FUNCTIONS

You can also include an ORDER BY in the window


grouping to sort entries in each group.

SELECT *,
ROW_NUMBER() OVER (ORDER BY cid)
FROM enrolled
ORDER BY cid

CMU 15-445/645 (Fall 2019)


37

WINDOW FUNCTIONS

Find the student with the highest grade for each course.

SELECT * FROM (
SELECT *,
RANK() OVER (PARTITION BY cid
ORDER BY grade ASC)
AS rank
FROM enrolled) AS ranking
WHERE ranking.rank = 1

CMU 15-445/645 (Fall 2019)


37

WINDOW FUNCTIONS

Find the student with the highest grade for each course.

Group tuples by cid


SELECT * FROM ( Then sort by grade
SELECT *,
RANK() OVER (PARTITION BY cid
ORDER BY grade ASC)
AS rank
FROM enrolled) AS ranking
WHERE ranking.rank = 1

CMU 15-445/645 (Fall 2019)


37

WINDOW FUNCTIONS

Find the student with the highest grade for each course.

Group tuples by cid


SELECT * FROM ( Then sort by grade
SELECT *,
RANK() OVER (PARTITION BY cid
ORDER BY grade ASC)
AS rank
FROM enrolled) AS ranking
WHERE ranking.rank = 1

CMU 15-445/645 (Fall 2019)


39

C O M M O N TA B L E E X P R E S S I O N S

Provides a way to write auxiliary statements for


use in a larger query.
→ Think of it like a temp table just for one query.
Alternative to nested queries and views.

WITH cteName AS (
SELECT 1
)
SELECT * FROM cteName

CMU 15-445/645 (Fall 2019)


39

C O M M O N TA B L E E X P R E S S I O N S

Provides a way to write auxiliary statements for


use in a larger query.
→ Think of it like a temp table just for one query.
Alternative to nested queries and views.

WITH cteName AS (
SELECT 1
)
SELECT * FROM cteName

CMU 15-445/645 (Fall 2019)


40

C O M M O N TA B L E E X P R E S S I O N S

You can bind output columns to names before the


AS keyword.

WITH cteName (col1, col2) AS (


SELECT 1, 2
)
SELECT col1 + col2 FROM cteName

CMU 15-445/645 (Fall 2019)


41

C O M M O N TA B L E E X P R E S S I O N S

Find student record with the highest id that is enrolled


in at least one course.

WITH cteSource (maxId) AS (


SELECT MAX(sid) FROM enrolled
)
SELECT name FROM student, cteSource
WHERE student.sid = cteSource.maxId

CMU 15-445/645 (Fall 2019)


41

C O M M O N TA B L E E X P R E S S I O N S

Find student record with the highest id that is enrolled


in at least one course.

WITH cteSource (maxId) AS (


SELECT MAX(sid) FROM enrolled
)
SELECT name FROM student, cteSource
WHERE student.sid = cteSource.maxId

CMU 15-445/645 (Fall 2019)


42

CTE RECURSION

Print the sequence of numbers from 1 to 10.


WITH RECURSIVE cteSource (counter) AS (
(SELECT 1)
UNION ALL
(SELECT counter + 1 FROM cteSource
WHERE counter < 10)
)
SELECT * FROM cteSource

Demo: Postgres CTE!

CMU 15-445/645 (Fall 2019)


43

C O N C LU S I O N

SQL is not a dead language.

You should (almost) always strive to compute your


answer as a single SQL statement.

CMU 15-445/645 (Fall 2019)


44

NEXT CLASS

Storage Management

CMU 15-445/645 (Fall 2019)

You might also like