Showing posts with label SQL. Show all posts
Showing posts with label SQL. Show all posts

Wednesday, September 25, 2013

Design by committee

Design by committee is usually a term of abuse, but sometimes it's perhaps not the worst alternative. At the opposite end of the spectrum, there is design by disconnected individuals. That is how you get
ALTER TABLE tbl OWNER TO something
but
ALTER TABLE tbl SET SCHEMA something
in PostgreSQL.

Maybe a committee faced with this inconsistency would arrive at the compromise

ALTER TABLE tbl [SET] {OWNER|SCHEMA} [TO] something
?

Wednesday, January 12, 2011

Perception vs. Reality

So this is interesting: In the Stack Overflow Annual User Survey, 62.2% of respondents claim they are "proficient" in SQL. This tops all the languages listed. This is perhaps not that surprising, but I had at the same time — subjectively — noticed an abundance of let's say clueless questions and suboptimal answers on SQL and RDBMS topics in the StackExchange network. Quite clearly SQL is somewhat different from algorithmic programming languages in that there is a gap between being familiar with the language and really understanding its effects.

Friday, May 28, 2010

System-Versioned Tables

After my report on the upcoming SQL:2011, some people had asked me about the system-versioned table feature that is going to be the arguably only major new feature there. Here is how it works:
CREATE TABLE tab (
    useful_data int,
    more_data varchar,
    start timestamp GENERATED ALWAYS AS SYSTEM VERSION START,
    end timestamp GENERATED ALWAYS AS SYSTEM VERSION END
) WITH SYSTEM VERSIONING;
(This hilariously verbose syntax arises because this is defined so that it fits into the more general generated columns feature, e. g., GENERATED ALWAYS AS IDENTITY, similar to PostgreSQL's serial type.)
INSERT INTO tab (useful_data, more_data) VALUES (...);
This sets the "start" column to the current transaction timestamp, and the "end" column to the highest possible timestamp value.
UPDATE tab SET useful_data = something WHERE more_data = whatever;
For each row that would normally be updated, set the "end" timestamp to the current transaction timestamp, and insert a new row with the "start" timestamp set to the current transaction timestamp. DELETE works analogously.
SELECT * FROM tab;
This only shows rows where current_timestamp is between "start" and "end". To show the non-current data, the following options are
possible:
SELECT * FROM tab AS OF SYSTEM TIME timestamp;
SELECT * FROM tab VERSIONS BEFORE SYSTEM TIME timestamp;
SELECT * FROM tab VERSIONS AFTER SYSTEM TIME timestamp;
SELECT * FROM tab VERSIONS BETWEEN SYSTEM TIME timestamp AND timestamp;
There's also the option of
CREATE TABLE tab ( ... ) WITH SYSTEM VERSIONING KEEP VERSIONS FOR interval;
to automatically delete old versions.

That's more or less it. It's pretty much xmin/xmax/vacuum on a higher level with timestamps instead of numbers. And it's a revival of the old time travel feature. Obviously, you can do most or all of this with triggers already.

Tuesday, May 11, 2010

MERGE Syntax

The SQL MERGE statement has gotten my attention again. For many years, PostgreSQL users have been longing for a way to do an "upsert" operation, meaning do an UPDATE, and if no record was found do an INSERT (or the other way around). Especially MySQL users are familiar with the REPLACE statement and the INSERT ... ON DUPLICATE KEY UPDATE statement, which are two variant ways to attempt to solve that problem (that have interesting issues of their own). Of course, you can achieve this in PostgreSQL with some programming, but the solutions tend to be specific to the situation, and they tend to be lengthier than one would want.

Discussions on this then usually proceed to speculate that the SQL-standard MERGE statement ought to be the proper way to solve this, and then it turns out that no one completely understands the MERGE syntax or semantics, especially as they apply to this upsert problem. (I was in that group.) And that's been the end of that so far. OK, as I write this I am pointed, via Robert Haas's blog post, to an older mailing list post by Simon Riggs, who is surely one of the men most qualified to drive an eventual implementation, that contains a hint toward the solution, but it's hard to find in that post, if you want to try.

This subject had gotten my attention again at the SQL standard working group meeting I attended a few weeks ago, where I learned that in SQL:2011, a DELETE branch has been added to MERGE. We also took some time after the official part of the meeting to work through some examples that illustrate the uses of the MERGE statement.

Let's take a look at what the MERGE statement is originally supposed to do, and where the term "merge" arose from. Let's say you have a table with outstanding balances, such as
CREATE TABLE balances (
    name text,
    balance numeric
);
and at intervals you get a list of payments that your organization has received, such as
CREATE TABLE payments (
    name text,
    payment numeric
);
What you want to do then is to "merge" the payments table into the balances table in the following way:
  • If a balance exists, subtract from it.
  • If the balance goes to zero, delete the entire record.
  • If no balance exists, create a record (maybe someone pre-paid).
The command to do this would be:
MERGE INTO balances AS b
    USING payments AS p
    ON p.name = b.name
    WHEN MATCHED AND b.balance - p.payment = 0 THEN DELETE
    WHEN MATCHED AND b.balance - p.payment <> 0 THEN UPDATE SET balance = balance - p.payment
    WHEN NOT MATCHED THEN INSERT (name, balance) VALUES (p.name, -b.payment);
Of course there are simpler cases, but this shows all of the interesting features of this command.

How does this get us upsert? There, you don't have two tables, but only one and some values. I have seen some claims and examples about this in the wild that turn out to be wrong because they evidently violate the syntax rules of the SQL standard. So I did the only sensible thing and implemented the MERGE syntax into the PostgreSQL parser on the flight back, because that seemed to be the best way to verify the syntax. So the correct way, I believe, to do, say, an upsert of the balances table would be:

MERGE INTO balances AS b
    USING (VALUES ('foo', 10.00), ('bar', 20.00)) AS p (name, payment)
    ON p.name = b.name
    WHEN MATCHED AND b.balance - p.payment = 0 THEN DELETE
    WHEN MATCHED AND b.balance - p.payment <> 0 THEN UPDATE SET balance = balance - p.payment
    WHEN NOT MATCHED THEN INSERT (name, balance) VALUES (p.name, -b.payment);
Not all that nice and compact, but that's how it works.

Note that the AS clause after VALUES is required. If you leave it off, the PostgreSQL parser complains that a subquery in FROM needs an AS clause. Which is obviously not what this is, but it uses the same grammar rules, and it makes sense in this case because you need a correlation name to join against. And it was also one of those rare moments when you implemented something that gives you correct feedback that you didn't even provide for.

Anyway, the examples above all parse correctly, but they don't do anything yet. But if someone wants to implement this further or just try out the syntax, I'll send my code.

Monday, April 19, 2010

News from the SQL Standard

Last week, I attended the meeting of DIN NA 043-01-32 AA, which is the German "mirror group" of ISO/IEC JTC1 SC32, whereof WG3 produces ISO/IEC 9075, which is titled "Database languages - SQL".  Once you dig through all these numbers and letters and find out who is responsible for what, it's actually quite simple to get involved there and review or contribute things.

For the benefit of everyone who is interested but hasn't had the chance to get involved in that process, here is what is currently going on:
  • A new standard is currently in the "Final Committee Draft" (FCD) phase, which basically means "beta".  The final release is expected in 2011, so you will begin to see mentions of "SQL:2011".
  • The new standard will only contain parts 1/Framework, 2/Foundation, 4/PSM, 11/Schemata, 14/XML. The other parts are currently not being developed, which doesn't mean they are dead or withdrawn, but that no one bothers to add things to them at the moment.
  • All new features in SQL:2011 will be in the form of optional features.  So the level of core conformance is not impacted by the release of a new standard.
There isn't actually that much new in SQL:2011, besides countless fixes and clarifications.  I counted only a dozen or so new features.  Here are some things that might be of interest to the PostgreSQL community:
  • The syntax ALTER TABLE ... ALTER COLUMN ... SET/DROP NOT NULL has been taken into the standard.  PostgreSQL has supported that since version 7.3.  Coincidence?  Not sure.
  • Constraints can optionally be set to NO ENFORCE.  That means the database system won't enforce them but still assumes they are valid, for example for optimization.
  • System-versioned tables: Perhaps the largest new feature, this is a way to make data (rows) visible only during certain times. Some pundits might recall a moral predecessor of this feature labeled SQL/Temporal.  I haven't fully analyzed this feature yet, so I'll post later about the details.
  • Combined data change and retrieval. PostgreSQL does something like this with RETURNING, but this feature is more elaborate and allows the writing of "delta tables".
  • Named arguments in function calls. PostgreSQL 9.0 supports that, but using the syntax foo(3 AS a) instead of what ended up in the standard, foo(a => 3).
  • Default values for function arguments. PostgreSQL 9.0 supports that as well.
This time I attended the meeting as a guest. We are discussing some procedural and financial issues to make this an official membership.  If anyone else is interested in getting involved in the SQL standard development, let me know and I can point you in the right direction.  If we have enough interest, we can set up a discussion group within the PostgreSQL project.

Sunday, January 3, 2010

Missing Features for PostgreSQL SQL Conformance

A thought to start the new year: Perhaps it's time for the final push to complete the core SQL conformance for PostgreSQL.

Where do we stand? The PostgreSQL documentation lists in its appendix the currently supported and unsupported SQL features. As explained there, a certain subset of these features represents the "Core" features, which every conforming SQL implementation must supply, while the rest is purely optional. The unsupported features page currently lists 14 remaining Core features and subfeatures that are missing from PostgreSQL. Two of those are about client-side module support that is actually not mandatory if the implementation provides an embedded language (e.g., ECPG), so there are 12 items left.

So that's not so bad. Here's a list of the missing features:

E081-09 USAGE privilege

This would mean adding a USAGE privilege to domains.

Maybe this isn't very useful, although perhaps those working on SELinux support might have a more qualified opinion on it.  But let's say if we get all the other things done and this is left, this would be a fairly straightforward and well-defined feature to add.

(This would then complete feature E081 Basic Privileges.)

E153 Updatable queries with subqueries

This presupposes updatable views and requires views to be updatable even if their WHERE clause contains a subquery.

This is probably the big one. In the current PostgreSQL architecture, updatable views are apparently quite difficult to implement correctly. The mailing list archives contain plenty of details.

F311-04 CREATE VIEW: WITH CHECK OPTION

This also presupposes updatable views and requires the CHECK OPTION feature. See above.

(This would then complete feature F311 Schema definition statement.)

F812 Basic flagging

This feature means that there should be some implementation-specific facility that raises a notice or warning when a not standard-conforming SQL statement or clause is used. Or in other words a facility that warns when a PostgreSQL extension is used.

A naive implementation might consist of just adding something like elog(WARNING, "not SQL standard") in about five hundred places, but the trick would be to implement it in a way that is easy to maintain in the future. The mailing list archives also contain some discussions about this, key word "SQL flagger".

S011 Distinct data types

This is a way to define user-defined types based on existing types, like
CREATE TYPE new AS old;
Unlike domains, this way the new type does not inherit any of the functions and operators from the old type. This might sound useless at first, but it can actually create better type safety. For example, you could create a type like
CREATE TYPE order_number AS int;
while preventing that someone tries to, say, multiply order numbers.

The implementation effort would probably be similar to that for domains or enums. Also, search the mailing list archives for "distinct types".

(This includes feature S011-01 USER_DEFINED_TYPES view.)

T321 Basic SQL-invoked routines

There are a number of bits missing from fully SQL-compatible SQL function definitions, besides the specific subfeatures mentioned below.
  • Instead of a routine body like AS $$ ... $$, allow one unquoted SQL statement as routine body (see example below under RETURN).
  • LANGUAGE SQL is the default.
  • SPECIFIC xyz clause, allowing the assignment of an explicit "specific routine name" that can be used to refer to the function even when overloaded. Probably not terribly useful for PostgreSQL.
  • DETERMINISTIC / NOT DETERMINISTIC clause. DETERMINISTIC means the same as IMMUTABLE in PostgreSQL; NOT DETERMINSTIC is then STABLE or VOLATILE.
  • CONTAINS SQL / READS SQL DATA / MODIFIES SQL DATA clause. These also appear to overlap with the volatility property in PostgreSQL: MODIFIES would make the function volatile, READS would make it
    STABLE.
Also, for DROP FUNCTION the ability to drop a function by its "specific name" is required:
DROP SPECIFIC FUNCTION specific_name;
There are probably some more details missing, so part of finishing this item would also be some research.

T321-02 User-defined stored procedures with no overloading

Add a new command CREATE PROCEDURE that does that same thing as CREATE FUNCTION .. RETURNS void, and a DROP PROCEDURE command.

T321-04 CALL statement

Add a new command CALL procname() that does the same thing as SELECT procname() but requires procname() to not return a value, meaning it has to be a procedure in the above sense.

T321-05 RETURN statement

Add a new command RETURN callable only from within SQL functions. Then, instead of writing a function like
CREATE FUNCTION name(args) RETURNS type LANGUAGE SQL
AS $$ SELECT something $$;
write
CREATE FUNCTION name(args) RETURNS type LANGUAGE SQL
RETURN something;

That's it! Plus all the stuff I missed, of course. We only have about 2 weeks left(!) until the final commit fest for the 8.5 release, so it's a bit late to tackle these issues now, but maybe for the release after that?

Wednesday, July 29, 2009

How to find all tables without primary key

Getting a list of all tables that don't have a primary key is occasionally useful when you need to update a legacy schema or repair someone else's database "design". There are a few recipes out there that peek into the PostgreSQL system catalogs, but here is a way that uses the information schema:
SELECT table_catalog, table_schema, table_name
FROM information_schema.tables
WHERE (table_catalog, table_schema, table_name) NOT IN
(SELECT table_catalog, table_schema, table_name
FROM information_schema.table_constraints
WHERE constraint_type = 'PRIMARY KEY')
AND table_schema NOT IN ('information_schema', 'pg_catalog');
The last line is useful because many predefined tables don't have primary keys.

Since you're using the standardized information schema, this query should be portable to other SQL database systems. The identical query works on MySQL, for example.

Wednesday, July 8, 2009

Schema Search Paths Considered Pain in the Butt


OK, I've had it. Who ever thought that a search path was a good idea anyway? I'm talking about the schema search path in PostgreSQL in particular, but any other ones are just as bad. If you review the last 48 hours in the PostgreSQL commit log, you can see what kind of nonsense this creates, and what amount of resources it drains (with apologies to my fellow developers who had to clean up my mistakes).

The Schema Search Path

What is the schema search path? One way to describe it is that it allows you to specify the objects in which schemas can be addressed without schema qualification, and in which order those objects are resolved. So if you write
CREATE TABLE schema1.foo ( ... );
CREATE TABLE schema2.bar ( ... );
SET search_path TO schema1, schema2;
then you can address these tables as just
SELECT ... FROM foo, bar ...
instead of
SELECT ... FROM schema1.foo, schema2.bar ...
Another way to describe it is that it allows you to set hidden traps that mysteriously make your perfectly good SQL behave in completely unpredictable ways. So a perfectly harmless query like
SELECT * FROM pg_class WHERE relname ~ 'stuff$';
can be changed to do pretty much anything, if the user creates his own pg_class and puts it first into the search path, before the system schema pg_catalog.

The way to deal with this, to the extent that it is dealt with at all, is to either set the schema search path everywhere before accessing any object (for some value of everywhere, such as every script, every session, every function) or to explicitly schema-qualify every object reference. The problem with the former is that it negates the point of allowing the user to set the search path in the first place. What you'd need to do instead is to prepend your known search path to what was set already, which is a pain to program in SQL, which is why probably hardly anyone does it that way. The problem with the latter is that it makes schemas pointless, because if you have to qualify everything, you might as well make the schema part of the name ("pg_catalog_pg_class") and forget about the extra indirection.

Qualify Everything!

The way we usually deal with this in PostgreSQL core code, for example in the psql describe code, is to add pg_catalog qualifications everywhere; see this last commit for instance. In my small example above, this would become
SELECT * FROM pg_catalog.pg_class WHERE relname ~ 'stuff$';
But this, and the above commit and all of psql is still wrong, because the actually correct way to write this is
SELECT * FROM pg_catalog.pg_class WHERE relname OPERATOR(pg_catalog.~) 'stuff$';
because PostgreSQL of course allows user-defined operators, and those are subject to search path rules like anything else. Proof:
CREATE OPERATOR = ( PROCEDURE = charne, LEFTARG = "char", RIGHTARG = "char" );
SET search_path = public, pg_catalog;
\diS
This command is supposed to show all indexes, but now it will show everything that is not an index. Try this yourself; it's a great way to mess with your colleagues' heads. :-) Anyway, surely no one wants to schema-qualify all operators, which is why attempting to qualify everything is probably a hopeless proposition.

Unix Shells

Let's look for some analogies to this dreaded technology. Unix shells have had search paths for decades. I don't have a whole lot of insight into their history, or what they where originally intended for, but the way I see it now, from the point of view of an operating system development contributor myself, they are just as much a pain in the butt in their full generality. Because users can set hidden traps that can make any old harmless command do anything, various programming and security pains persist. The usual answers are again: set the path explicitly in every shell script, or add an explicit path to every command. The latter is obviously rarely done, and would by the way possibly fall into the same operator trap that I mentioned above. Who knew that [ (as in if [ "$foo" = bar ]) might actually be a separate executable:
$ ls -l '/usr/bin/['
-rwxr-xr-x 1 root root 38608 2009-06-05 03:17 /usr/bin/[

Many authors of shell scripts probably know that setting an explicit path at the top of the script is probably a good idea. Or is it? On this Debian system, I have 242 shell scripts in my /usr/bin/, 181 of which don't appear to set a path. 14 out of 28 scripts in /etc/init.d/ don't set a path. And out of the 14 that do, there are 12 different variants of the path that they actually choose to set. Which makes any actually sensible use of the path impossible. One such use might be to put a locally modified version of a common program into /usr/local/bin/. Now some scripts will use it, some will not.

Linux distributions have been trying to move away from relying on this path business anyway. The Filesystem Hierarchy Standard only allows for a few directories for executable programs. Everything else is supposed to be organized by the package manager, and in practice perhaps by the occasional symlink. Solaris has a much greater mess of bin directories, such as /usr/gnu/bin, /usr/xpg4/bin, but even there I sense that they don't want to keep going into that direction.

Whither PostgreSQL?

In PostgreSQL, you might get the worst of all worlds:
  • Explicitly qualifying everthing is completely unwieldy, and perhaps impossible if you can't modify the SQL directly.

  • Programmatically modifying the search path in a correct way is pretty complicated.

  • There is no standard for what goes into what schema.

  • There is talk of adding more "standard schemas", potentially breaking all those applications who managed to get one of the above points right.

  • Using the search path to override a system object by your own implementation doesn't work, at least in half of the applications that choose to do the supposedly right thing and alter the search path to their liking.

  • Security definer functions (analogous to setuid scripts) don't have a safe default for the search path. You need to make sure you set a safe path in every one of them yourself (see also CVE-2007-2138). One wonders how many of those out there are broken in this regard. At least setuid shell scripts are usually not possible.
What might be a better solution? More modern programming languages such as Java or Python with their "import" mechanisms, where you explicitly choose what you want in what namespace, appear to have less problems of the sort I described. Of course there is still a CLASSPATH or a PYTHONPATH in the background, but that's actually just a workaround because the underlying operating system relies on search paths and doesn't have an import mechanism itself.

I think the straightforward solution at the moment is to ignore the whole schema issue as much as you can and don't let random people issue random SQL statements. Probably not completely safe, but at least it doesn't tangle you up in a giant mess.

If someone wants to defend the current system or has proposals for fixing it, please leave comments.

(picture by david.nikonvscanon CC-BY)