Thinking in Sets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Thinking in Sets: how to program in SQL

By Joe Celko Copyright 2002

Joe Celko - Articles


Member of ANSI X3H2 since 1987 SQL for Smarties - DBMS Magazine Celko on SQL - DBP&D SQL Puzzle - Boxes & Arrows DBMS/Report - Systems Integration WATCOM SQL Column - PBDJ Celko on Software - COMPUTING(UK) Celko - Intelligent Enterprise SELECT FROM Austin - DB/M (Netherlands)

Joe Celko - Books


JOE CELKOS SQL FOR SMARTIES - 1995, 1999 Morgan-Kaufmann INSTANT SQL - 1995, Wrox Press JOE CELKOS SQL PUZZLES & ANSWERS 1997, Morgan-Kaufmann DATA & DATABASES - 2000, MorganKaufmann

SQL is Not Procedural


SQL is NOT a procedural language All computable problems can be solved in a non-procedural language like SQL - big fat hairy deal! SQL works best with whole tables, not with single rows. All relations are shown as columns and values You tell it what you want; it figures out how to get it Good specifications are hard to write!

SQL is Not Computational


SQL is NOT a computational language Standard SQL has only four function math; everything else is a vendor extension Rounding and truncation are implementation defined You really ought to pass the data to a report writer or statistical tool for any fancy calculations.

Principles

Think in sets, not single values and rows Much procedural code can be moved inside a query with the CASE expression and COALESCE() function GROUP BY is very useful WHERE and HAVING are not the same Algebra is important! Logic is very important! A good data model will save you a lot of pain.

Celkos Heuristics - 1

Do not draw boxes and arrows the arrows imply a flow of something flow means process process means procedures Draw circles -- set diagrams Sets can be nested, disjoint, overlapping, etc. These are relationships Test for empty sets, NULLs and special values Develop with small sets, but test with a large sets

Celkos Heuristics - 2

Do not use temp tables they usually hold steps in a process process means procedures Do use derived tables They are part of the query and the optimizer can get to them Nesting of functions is good and you can do it more than you think INSERT INTO Sequence (keycol, ) VALUES (COALESCE (SELECT MAX(keycol) FROM Sequence) +1, 0), ).

Think in Aggregates

Think in Aggregates -2

Do not try to figure out the details ! The completed block is 4 by 5 by 5 units, so its volume is 100 cubic units. It is missing 3 blocks, which are 2 cubic units each 100 - 6 = 94 cubic units 94/2 = 47 blocks

What makes an Entity?


Puzzle is in three parts and shows 14 leprechauns

What makes an Entity? -2


Swap the top two pieces and you have 15 leprechauns

What makes an Entity? -3


This is a false question Each set of Leprechauns is a totally different aggregation of leprechaun parts . It depends on not having a clear rule for knowing what makes a leprechaun. Break a piece of chalk in half and you have two pieces of chalk!

Dont sweat details

Sequence -1

There are no FOR-NEXT loops in SQL Instead of doing things one at a time, you have to do them all at once, in a set To get a subset of integers, first you need to have a set of integers Build an auxiliary table of sequential numbers from 1 to some value (n)

Sequence -2

Sequence tables should start with one and not with zero Other columns in the table can be Random numbers Number words (ordinal and cardinal) Complicated functions

Sequence -3

Sequence can be used as a loop replacement Example: given a string with a comma separated list, cut it into integers: 12,345,99,765 becomes a column in a table Procedural approach: parse the string left to right until you hit a comma slice off the substring to the left of the comma cast that substring as an integer loop and lop until the string is empty

Sequence -4

Non-procedural approach find all the commas at once find all the digits bracketed by pairs of sequential commas convert that set of substrings into integers as a set Hint: first find al the commas SELECT I1.keycol, S1.seq FROM InputStrings AS I1, Sequence AS S1 WHERE SUBSTRING (, || instring || , FROM S1.seq FOR 1) = ,; Now find the pairs of Commas

Sequence -5

SELECT I1.keycol, CAST (SUBSTRING (, || instring || , FROM S1.seq +1 FOR S2.seq - S1.seq -1) AS INTEGER) FROM InputStrings AS I1, Sequence AS S1, Sequence AS S2 WHERE SUBSTRING (, || instring || , FROM S1.seq FOR 1) = , AND SUBSTRING (, || instring || , FROM S2.seq FOR 1) = , AND S1.seq < S2.seq AND S2.seq = (SELECT MIN(S3.seq) FROM Sequence AS S3 WHERE S1.seq < S3.seq);

Sequence -6

Problem: find the smallest missing number in Sequence, with the restriction that it has to be positive. The obvious answer is to set up a loop and look for the first gap. Lets look for gaps SELECT MIN(seq) + 1 FROM Sequence WHERE (i+1) NOT IN (SELECT seq FROM Sequence); But it does not work. When you have {(1),(2),(3)} you get (4); when you have {(2),(3)}, you also get (4). Opps. Try to repair this so that the lower limits are not excluded:

Sequence -7

First attempt at a repair job - UNION of special cases SELECT MIN(seq) FROM (SELECT MIN(seq) + 1 FROM Sequence WHERE (seq+1) NOT IN (SELECT seq FROM Sequence) UNION ALL SELECT MAX(seq) - 1 FROM Sequence WHERE (seq-1) NOT IN (SELECT seq FROM Sequence)) AS X(seq);

But now {(1), (2), (3)} returns (0) as a result, which you did not want

Sequence -8

It looks like there are many special cases to consider, so let's think about a CASE expression. SELECT CASE WHEN MAX(seq) = (SELECT COUNT(*) FROM Sequence) THEN MAX(seq) + 1 -- no gaps, so go higher WHEN MIN(seq) > 1 -- one is missing THEN 1 ELSE -- first gap (SELECT MIN(seq) + 1 FROM Sequence WHERE seq+1 NOT IN (SELECT seq FROM Sequence) END FROM Sequence;

Nested Sets - 1

Not everything works on equality Less than, greater than and BETWEEN define subsets nested within a larger set This sort of query usually involves a selfjoin where one copy of the tables defines the elements of the subset and the other defines the boundary of the subset

Nested Sets - 2

6 6 7 6 9 8 7 6

Top(n) Values

One version exists in Microsoft ACCESS and SQL Server Those implementations use an ORDER BY clause This is best done procedurally with the Partition routine from QuickSort The MAX() and MIN() are okay because they return a scalar value TOP(n) returns a set of rows, so it is not a function

Top (n) Values -2


Procedural approach:
Sort the file in descending order return the top (n) of them with a loop

Problems: the spec is bad


How do you handle multiple copies of a value? How do you handle exactly (n) values? How do you handle less than (n) values?

Top (n) Values - 3


Subset approach: Decide if ties count or not; this is the pure set model versus SQLs multi-set model Find the subset with (n) or fewer members whose values are equal to the (n) highest values in the entire set Use one copy of the table as the elements of the subset and one to establish the boundary of it.

Nested Sets - 2

SELECT DISTINCT E1.salary FROM Employees AS E1 -- elements WHERE :n -- n is parameter > (SELECT COUNT(*) FROM Employees AS E2 -- boundary WHERE E1.salary > E2.salary); Use > or >= , depending on where you put the boundary in relation to the elements. Use SELECT or SELECT DISTINCT, depending on how you want to count elements Use COUNT(*) or COUNT( DISTINCT <col>), depending on how you want to count NULL elements

Nested Sets - 3

An equivalent version can also be done with a self-join and a GROUP BY clause SELECT E1.salary FROM Personnel AS E1, Personnel AS E2 WHERE E1.salary < E2.salary GROUP BY E1.salary -- boundary value HAVING COUNT(DISTINCT E2.salary) < :n; The same possible versions of the query exist here

Relational Division - 1

Relational division is easier to explain with an example. We have a table of pilots and the planes they can fly (dividend); we have a table of planes in the hanger (divisor); we want the names of the pilots who can fly every plane (quotient) in the hanger. CREATE TABLE PilotSkills (pilot CHAR(15) NOT NULL, plane CHAR(15) NOT NULL); CREATE TABLE Hanger(plane CHAR(15));

Relational Division -2

The standard solution is to find the pilots for whom there does not exist a plane in the hanger for which they have no skills. SELECT DISTINCT pilot FROM PilotSkills AS PS1 WHERE NOT EXISTS (SELECT * FROM Hanger WHERE NOT EXISTS (SELECT * FROM PilotSkills AS PS2 WHERE (PS1.pilot = PS2.pilot) AND (PS2.plane = Hanger.plane)));

Relational Division - 3

Imagine that each pilot gets a set of stickers that he pastes to each plane in the hanger he can fly. If the number of planes in the hanger is the same as the number of stickers he used, then he can fly all the planes in the hanger.

SELECT Pilot FROM PilotSkills AS PS1, Hanger AS H1 WHERE PS1.plane = H1.plane GROUP BY PS1.pilot HAVING COUNT(PS1.plane) = (SELECT COUNT(*) FROM Hanger)

Relational Division - 4

The SQL-92 set difference operator, EXCEPT, can be used to write a version of relational division. SELECT Pilot FROM PilotSkills AS P1 WHERE NOT EXISTS (SELECT plane FROM Hanger EXCEPT SELECT plane FROM PilotSkills AS P2 WHERE P1.pilot = P2.pilot);

Trees in SQL

Trees are graph structures used to represent Hierarchies Parts explosions Organizational charts

Three methods in SQL Adjacency list model Nested set model Transitive closure list

Tree as Graph

Root A0 A1 A2 B0

Tree as Nested Sets

root A0 A1 A2 B0

Graph as Table
node Root A0 A1 A2 B0 parent NULL Root A0 A0 Root

==========

Graph with Traversal

Root left = 1 right =10

A0 left = 2 right = 7

B0 left = 8 right = 9

A1 left = 3 right = 4

A2 left = 5 right = 6

Nested Sets with Numbers

2 3

4 5 6

9 10

Root A1

A0 A2

B0

Nested Sets as Numbers


Node Root A0 A1 A2 B0 lft 1 2 3 5 8 rgt 10 7 4 6 9

============

Problems with Adjacency list -1


Not normalized - change A0 and see You have to use cursors or self-joins to traverse the tree Cursors are not a table -- their order has meaning -- Closure violation! Cursors take MUCH longer than queries Ten level self-joins are worse than cursors

Problems with Adjacency list -2


Often mix structure (organizational chart, edges) with elements (personnel, nodes) These are different kinds of things and should be in separate tables Another advantage of separating them is that you can have multiple hierarchies on one set of nodes

Example of Self-Join

Find great grandchildren of X SELECT T1.node, T2.node, T3.node, T4.node FROM Tree AS T1, Tree AS T2, Tree AS T3, Tree AS T4 WHERE T1.node = X AND T1.node = T2.parent AND T2.node = T3.parent, AND T3.node = T4.parent;

Find Superiors of X

Traversal up tree via procedure or N-way selfjoin SELECT Super.* FROM Tree AS T1, Tree AS Supers WHERE node = X AND T1.lft BETWEEN Supers.lft AND Supers.rgt;

Find Subordinates of X

Traversal down tree via cursors or N-way selfjoin SELECT Subordinates.* FROM Tree AS T1, Tree AS Subordinates WHERE T1.node = X

AND Subordinates.lft BETWEEN T1.lft AND T1.rgt;

Totals by Level in Tree


In Adjacency model you put traversal results in a temp table, then group and total SELECT T1.node, SUM(C1.cost) FROM Tree AS T1, Tree AS T2, Costs AS C1 WHERE C1.node = T2.node AND T2.lft BETWEEN T1.lft AND T1.rgt GROUP BY T1.node;

The Median

The median is a statistic that measures central tendency in a set of values The median is the value such that there are as many cases below the median value as there are above it If the number of elements is odd, no problem If the number of elements is even, then average the middle values

Procedural Way

Sort the values Count the size of the set (n) If n is odd then read (n/2) records Print the next record If n is even then read (n/2) and (n/2)+1 records Average them and print results

Think in Sets

Do not ask for values, but for a set of values The median is the average of the subset of values which sit in the middle A middle implies something on either side of it The subset of greater values has the same cardinality as the subset of lesser values lesser median greater

Median by Partition -1

Now the question is how to define a median in terms of the partitions. Clearly, the definition of a median means that if (lesser = greater) then the value in the middle is the median. Lets use Chris Dates Parts table and find the Median weight of the Parts.

Median by Partition -2

If there are more greater values than half the size of the table, then weight cannot be a median. If there are more lesser values than half the size of the table, then the middle value(s) cannot be a median. If (lesser + equal) = greater, then the middle value(s) is a left hand median. If (greater + equal) = lesser, then the middle value(s) is a right hand median. If the middle value(s) is the median, then both lesser and greater have to have tallies less than half the size of the table.

First Attempt
SELECT AVG(DISTINCT weight) FROM (SELECT P1.pno, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight > P1.weight THEN 1 ELSE 0 END) FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight) AS Partitions (pno, weight, lesser, equal, greater) WHERE lesser = greater OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0 AND greater <= (SELECT COUNT(*) FROM Parts)/2.0))

Weighted vs. Unweighted


You can use either AVG(DISTINCT i) or AVG(i) in the SELECT clause. The AVG(DISTINCT i) will return the usual median when there are two values. This happens when you have an even number of rows and a partition in the middle, such as (1,2,2, 3, 3, 3) which has (2, 3) in the middle, which gives us 2.5 for the median. The AVG(i) will return the weighted median instead. The table with (1,2,2, 3, 3, 3) would return (2,2, 3, 3, 3) in the middle, which gives us 2.6 for the weighted median. The weighted median is a more accurate description of the data.

Optimize-1

The WHERE clause needs algebra it deals only with aggregate functions and scalar subqueries move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, it is not always possible.

Optimize -2

lesser <= (SELECT COUNT(*) FROM Parts)/2.0 We can replace the scalar subquery with lesser <= (lesser + equal + greater)/2.0

Algebra is good

Optimize-3

The WHERE clause needs algebra it deals only with aggregate functions and scalar subqueries move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, it is not always possible.

Optimize-4

The WHERE clause needs algebra it deals only with aggregate functions and scalar subqueries move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, it is not always possible.

Optimize -5

But this is the same as 2.0 * lesser <= lesser + equal + greater 2.0 * lesser - lesser <= equal + greater lesser <= equal + greater

Final Query (?)


SELECT AVG(DISTINCT weight) FROM (SELECT P1.pno, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight > P1.weight THEN 1 ELSE 0 END) FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight) AS Partitions (pno, weight, lesser, equal, greater) WHERE lesser = greater OR (lesser <= equal + greater AND greater <= equal + lesser);

Keep Working!

WHERE lesser = greater OR (equal >= lesser - greater AND equal >= greater - lesser) But this is the same as: WHERE lesser = greater OR equal >= ABS(lesser - greater) But if the first condition was true (lesser = greater), the second must necessarily also be true (i.e. equal >= 0), so the first clause is redundant and can be eliminated completely. WHERE equal >= ABS(lesser - greater)

Keep Working -2

Instead of a WHERE clause operating on the columns of the derived table, why not perform the same test as a HAVING clause on the inner query which derives Partitions? SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight HAVING SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END) >= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1 WHEN P2.weight > P1.weight THEN -1 ELSE 0 END))) AS Partitions;

Characteristic functions -1

A characteristic function returns a one or a zero if a predicate is TRUE or FALSE We can write it with a CASE expression in SQL-92 You can use it inside aggregate functions to get descriptions of subsets It gives you set properties

Characteristic functions -2

Example: find the number of men and women in the company in each department

SELECT department, SUM(CASE WHEN sex = m THEN 1 ELSE 0 END) AS men, SUM(CASE WHEN sex = f THEN 1 ELSE 0 END) AS women FROM Personnel GROUP BY department;

CASE expressions -1

Use in place of procedural code Example: raise price of cheap books by 10%, and reduce expensive books by 15%; Use $25 as break point

First attempt: BEGIN UPDATE Books SET price = price *1.10 WHERE price <= $25.00; UPDATE Books SET price = price *0.85 WHERE price > $25.00; END; Look at what happens to a book priced $25.00

CASE expressions -2

Second attempt: use a cursor

Third attempt: procedural code BEGIN IF (SELECT price FROM Books WHERE isbn = :my_book) <= $25.00 THEN UPDATE Books SET price = price *1.10 WHERE isbn = :my_book ELSE UPDATE Books SET price = price *0.85 WHERE isbn = :my_book END;

CASE expressions -3

Use the CASE expression inside the UPDATE statement UPDATE Books SET price = CASE WHEN price <= $25.00 THEN price *1.10 WHEN price > $25.00 THEN price *0.85 ELSE price END; The ELSE clause says leave it alone as a safety precaution

CASE expressions -4

Use the CASE expression to replace procedural code whenever possible usually faster more portable Problem: Given this table, find the smallest missing integer from the sequence CREATE TABLE Sequence (seq INTEGER NOT NULL PRIMARY KEY CHECK (i > 0));

CASE expressions 5

Problem: find the smallest missing number in Sequence, with the restriction that it has to be positive. The obvious answer is to set up a loop and look for the first gap. Lets look for gaps SELECT MIN(seq) + 1 FROM Sequence WHERE (i+1) NOT IN (SELECT seq FROM Sequence); But it does not work. When you have {(1),(2),(3)} you get (4); when you have {(2),(3)}, you also get (4). Opps. Try to repair this so that the lower limits are not excluded:

CASE Expression - 6

First attempt at a repair job - UNION of special cases SELECT MIN(seq) FROM (SELECT MIN(seq) + 1 FROM Sequence WHERE (seq+1) NOT IN (SELECT seq FROM Sequence) UNION ALL SELECT MAX(seq) - 1 FROM Sequence WHERE (seq-1) NOT IN (SELECT seq FROM Sequence)) AS X(seq);

But now {(1), (2), (3)} returns (0) as a result, which you did not want

Contiguous Regions - 1

You have a table with two columns that show the start and ending values of a duration or other sequence of numbers. CREATE TABLE Tickets (buyer CHAR(5) NOT NULL, start_nbr INTEGER NOT NULL, end_nbr INTEGER NOT NULL, CHECK(start_nbr <= end_nbr), PRIMARY KEY (start_nbr, end_nbr));

INSERT INTO Tickets VALUES (John, 1, 12), (John, 13, 20), (John, 22, 30);

Contiguous Regions - 2

We want to merge the contiguous regions into one row (John, 1, 12) (John, 13, 20) (John, 22, 30)

Becomes: (John, 1, 20) -- merged row (John, 22, 30) The usual approach is to use a cursor which is sorted by id, and start_nbr, loop thru the data and when the end_nbr of the previous row is equal to the strt_nbr of the current row plus one, then merge them.

Contiguous Regions - 3

Pick a start and an end point, then see if the sum of the segments inside that range is equal to the total length of the range. This will give you all the ranges, not the biggest one. MIN(T1.start_nbr), MAX( T2.end_nbr), COUNT(T3.start_nbr) FROM Tickets AS T1, Tickets AS T2, Tickets AS T3

SELECT T1.buyer,

WHERE T1.buyer = T2.buyer AND T1.buyer = T3.buyer AND T2.start_nbr BETWEEN T1.start_nbr AND T3.start_nbr AND T2.end_nbr BETWEEN T1.end_nbr AND T3.end_nbr GROUP BY T1.buyer HAVING SUM(T3.end_nbr T3.start_nbr +1) = MAX(T3.end_nbr) MIN(T1.start_nbr +1);

Contiguous Regions - 3

You have two choices at this point Use the count of the segments, COUNT(T3.start_nbr), contained in the range to find the longest ranges. Insert the new rows into the original table, and execute a DELETE FROM statement:
DELETE FROM Tickets WHERE NOT EXISTS (SELECT * FROM Tickets AS T1 WHERE T1.buyer = Tickets.buyer AND (Tickets.start_nbr BETWEEN T1.start_nbr +1 AND T1.end_nbr OR Tickets.start_nbr BETWEEN T1.start_nbr AND T1.end_nbr - 1)

Questions & Answers

You might also like