PGSQL Oracle
PGSQL Oracle
Introduction
Recently, driven by the ever more exorbitant prices of Oracle RDBMS components, many companies started looking for an alternative platform that would help them decrease the license cost of the Oracle RDBMS. Of course, as it became customary these days, the first place to look for an alternative is the open source community. Open source community maintains several open source databases, the most popular among them are MySQL and PostgreSQL. With Oracle Corp. acquisition of SUN Microsystems, Oracle also acquired MySQL. There are other versions, like MariaDB and Drizzle, but the general feeling is that MySQL is no longer a serious alternative. That leaves us with PostgreSQL. This article strives to clarify PostgreSQL strengths and weaknesses for such purpose. I will refrain from giving a verdict of the database suitability as a general Oracle replacement. It depends on the particular project and the nature of the database that is being replaced. Overall impression is that Oracle is an 18wheeler, while PostgreSQL is a half ton truck. Each is useful its own area, but one cannot realistically expect to replace one by another.
Characteristics of PostgreSQL
PostgreSQL is, in a way, very similar to Oracle. That similarity is intentional, as there are several commercial products which use that similarity for marketing purposes. The most frequent complaints of the Oracle developers, when trying out open source databases, are the problems with the CONNECT BY clause, subquery factoring and analytic functions, also known as window functions to PostgreSQL developers. PostreSQL supports subquery factoring or Common Table Expressions (CTE), known to us oraclites as WITH clause. It also supports very decent set of window functions like row_number, ntile, rank, dense_rank, lead, lag and some others. Furthermore, recursive WITH is a very good replacement for Oracle's CONNECT BY query. Incidentally, Oracle supports recursive WITH queries as of version 11.2. Below are both Oracle and PostgreSQL versions of the same recursive query:
Oracle 11.2: 1 with e(empno,mgr,lev) as ( 2 select empno,mgr,0 from emp where empno=7839 3 union all 4 select emp.empno,emp.mgr,e.lev+1 5 from emp,e 6 where emp.mgr=e.empno) 7* select * from e SQL> / EMPNO MGR LEV
7839 0 7566 7839 1 7698 7839 1 7782 7839 1 7499 7698 2 7521 7698 2 7654 7698 2 7788 7566 2 7844 7698 2 7900 7698 2 7902 7566 2 7934 7782 2 7369 7902 3 7876 7788 3 14 rows selected.
Oracle requires UNION ALL and prohibits the use of LEVEL keyword, so I replaced it by LEV. The result is essentially the same as if we did a self join with CONNECT BY prior empno= mgr. PostgreSQL can do the same thing:
PostgreSQL 9.0: mgogala=# with recursive e(empno,mgr,level) as ( mgogala(# select empno,mgr,0 from emp where empno=7839 mgogala(# union mgogala(# select emp.empno,emp.mgr,e.level+1 mgogala(# from emp,e mgogala(# where emp.mgr=e.empno) mgogala-# select * from e; empno | mgr | level -------+------+-----7839 | | 0 7566 | 7839 | 1 7698 | 7839 | 1 7782 | 7839 | 1 7499 | 7698 | 2 7521 | 7698 | 2 7654 | 7698 | 2 7788 | 7566 | 2 7844 | 7698 | 2 7900 | 7698 | 2 7902 | 7566 | 2 7934 | 7782 | 2 7369 | 7902 | 3 7876 | 7788 | 3 (14 rows) mgogala=#
PostgreSQL doesn't object to my using the word level in the query, it's not a keyword in PostgreSQL. It also doesn't require UNION ALL, like Oracle. The query is essentially the same, which simplifies porting all those proprietary CONNECT BY constructs to PostgreSQL. Similarity between the SQL syntax of Oracle and PostgreSQL is remarkable and should be of great help to Oracle developers, tasked with developing an application using PostgreSQL database.
This function computes the partition into which the row needs to be inserted, based on the month and year in the CREATED_AT field in the incoming row and executes insert using the variable NEW as a bind value. Oracle is much more rigorous when it comes to coding and doesn't easily allow tricks like this. PostgreSQL doesn't compile PL/PgSQL code, it stores it as a text literal. The code is parsed for each session that executes it and cached in the private memory of the parsing process. That means that there is no code sharing. If one process parses a PL/PgSQL function, other processes will not benefit from it because PostgreSQL doesn't have anything even remotely like Oracle shared pool. That applies to all procedural extension to PostgreSQL, not just PL/PgSQL. In other words, database functions written in Perl or Python are also stored as text literals, parsed as needed and cached on per process basis.
Partitioning
Partitioning in PostgreSQL leaves much to be desired. It is essentially the same technology used in Oracle7, with partition views. In PostgreSQL, partitioning is based on the inheritance mechanism. There is one parent table and partitions are separate tables that inherit the table structure from the parent. Rows are routed into different partitions using triggers. PostgreSQL can do partition elimination, but only during the parse phase. That means that it will not do partition pruning with an already parsed SQL and a bind variable. Also, there is no global indexes, global constraints or any other mechanism except of the very limited partition pruning. There is no global statistics, so the optimizer cannot do anything on the table level, it has to descend to partition level and optimize the SQL on per partition basis. Frequently, optimizer is not up to the task, so using partitions results in horribly inefficient queries that take a very long time to complete. Partitioning can be useful, but it requires a very careful design, which involves careful data model consideration and extremely careful query design. There is no parallel query or parallel DML, not even for the partitioned tables.
PostgreSQL optimizer
PostgreSQL has a very sophisticated cost based optimizer, somewhat different from Oracle's. In addition to the value histograms, that both databases collect, PostgreSQL also compiles the list of most frequent column values. The number of the most frequent values and the size of the histograms is determined by the default_statistics_target parameter and can range from 0 to 10000, much greater range than Oracle's. In version 9, it is possible to set the histogram size for each table column separately, using the ALTER TABLE statement. The same option is also present in Oracle 11g, which allows the DBA to set the statistics preferences on the table level. As all cost based optimizers, PostgreSQL optimizer is also sometimes counter intuitive and stubbornly refuses to use an index. To alleviate the pain, database designers have made available a myriad of set session mode options which turn on and off particular access methods. There are no hints, which is a very big problem for porting projects. If a query starts misbehaving, there is no way to just make a decision for the optimizer to use the particular index, one has to re-engineer sizable portion of the data model, just to get things running. This makes porting projects prohibitively expensive as one should expect to be forced to change the data model and essentially rewrite the application from scratch. This is particularly big problem with Java ORM software, like Hibernate, which tends to generate cumbersome and complex queries, which can be very tricky to optimize, even under the best of circumstances. Also, the lack of hints makes the application much harder to maintain. Every application contains SQL that will start causing troubles over time, because of the underlying data model changes or simply because of the growth. Oracle allows the developers to quickly plug in such holes with hints, without necessarily having to re-engineer the entire data model. PostgreSQL doesn't allow that choice. Performance problems are usually addressed by fixing the underlying data model, which can be prohibitively expensive and require prolonged downtime. Optimizer will not use indexes in case of running a parsed query with bind variables because it doesn't know anything about the values of the bind variables. The attitude of the PostgreSQL designers toward hints is extremely negative and it is unlikely that this very useful tool will be provided anytime soon. It is a curiosity, because almost all other relational databases support hints in one form or another, except the PostgreSQL. Performance of complex queries, involving many tables, can deteriorate dramatically, especially when ported from Oracle, where such queries are customary. The enmity of the PostgreSQL developers to hints is almost religious in nature. Curiously, some of the most vocal opponents of hints
work for EnterpriseDB, whose flagship product is a commercial fork of PostgreSQL which supports hints. PostgreSQL also doesn't support index organized tables or clusters. It does support bitmap indexes, but this is a murky subject which is usually avoided like a plague by the PostgreSQL experts. Until version 9, the general advice was that bitmap indexes are used at one's own risk. There is no star schema or star transformation or any other tool to help with data warehouse type tasks. PostgreSQL optimizer is geared toward OLTP type applications which are not overly complex and which do not include large tables, tens of gigabyte in size, or even larger. Of course, there are examples like Facebook and Skype, which use PostgreSQL as their backbone, but these are still exceptional things, done by top class professionals and very hard to achieve by any standard. Recently, Skype has been acquired by Microsoft and they started using SQL Server instead of PostgreSQL. The lack of parallelism and unwillingness to support hints makes PostgreSQL rather awkward when it comes to porting applications from Oracle. Also, PostgreSQL optimizer is rather poorly documented, there no books like those written by Jonathan Lewis or Tom Kyte in the PostgreSQL world. There is no method for overriding PostgreSQL optimizer decisions. PostgreSQL may be smart, but it's still not smarter than us humans. Very few optimizers can adequately deal with correlated data. PostgreSQL optimizer cannot deal with that. What does that mean, precisely? Let's say that we have a table with the information about a group of people. The information includes the date of birth and zodiac sign. There is 1/12 probability that any given person is a Capricorn. There is also 1/12 probability that any given person was born in May. The optimizer would reach the conclusion that the probability that a person born in May has Capricorn as his zodiac sign is 1/12 * 1/12 = 1/144. That is still a hefty percent that might result in the full table scan. The reality is, however, that people with Capricorn zodiac sign are born in January, not in May. Such query would return no results and PostgreSQL cannot know anything about it. Correlated and interdependent data columns are a very frequent occurrence, especially with financial data. This is far more frequent situation than it appears. And the conclusion that the optimizer can reach is always incorrect. Having no means to override optimizer decision means that we would have to remove all interdependent columns from our data model, just as the 3rd normal form would require. That is a lot of work and time, especially if the application that is being ported is modestly complex and contains foreign keys, interdependent columns, triggers, views and procedural code.
Generally speaking, Oracle I/O is much more sophisticated and noticeably faster than PostgreSQL. Here is the result from the same machine and the table that was copied from Oracle to PostgreSQL using a Perl script.
Oracle 11.2: Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 -Production With the Partitioning, OLAP, Data Mining and Real Application Testing options SQL> set timing on SQL> select count(*) from t1; COUNT(*) 3000000 Elapsed: 00:00:00.62 Postgres 9.0: Timing is on. mgogala=# select count(*) from t1; count 3000000 (1 row) Time: 856.847 ms mgogala=#
The table is extremely simple and was created just for this comparison:
mgogala=# \d t1 Table "public.t1" Column | Type | Modifiers --------+------------------------------+----------c1 | bigint | not null c2 | character varying(30) | Indexes: "t1_pk" PRIMARY KEY, btree (c1)
In other words, oracle reads 3M very simple records in 620 milliseconds while PostgreSQL spends 856 milliseconds on the same table. The difference is rather significant. Both databases are running on the same machine, so the comparison is valid. PostgreSQL performance mailing list is full of complaints about the slowness of the full table scan but the complaints are not being addressed. The usual advice is to tune the IO subsystem. The bigger the table gets, the bigger the difference in full table scan speed is. There is no difference whatsoever when reading tables like EMP and DEPT. If the table is frequently updated, PostgreSQL full table scan gets even slower. The tune IO subsystem advice is often impractical and requires significant investments in hardware. Frequently, the advice includes solid state disks, better controllers, more CPU power or something of that nature, that would make most of the CIOs very unhappy.
Miscellaneous
Every relational database has a data dictionary. Oracle exposes its own data dictionary through the series of views, which are rather simple and well documented. Postgres dictionary is much more cryptic and much less documented than Oracle's. For instance, the pg_constraints table does not contain the id of a referenced primary key or unique key constraint for foreign key constraints. It contains a column list, which makes navigation through the dependency graph much harder. Monitoring tools are also significantly less usable than the tools available for Oracle. First and foremost, the fact that PostgreSQL has no shared pool means that it is not possible to see the execution plan of a SQL statement being executed by some application. It is possible to run explain plan, but one never knows what knobs were turned in the application itself and whether the execution environment is the same for the person monitoring as for the executing application being monitored. A positive thing is that it is possible to send a cancel query signal to a session, without killing the session. There is no such thing as performance trace, there is no wait event interface. PostgreSQL tuning is still a form of black magic, not an exact methodology. For Oracle, the methodology was laid out in Cary Millsap's groundbreaking book Optimizing Oracle Performance. Basically, Cary Millsap and Anjo Kolk introduced method and measurement to performance tuning. One could see where the time is spent, what kind of event does the application wait for and what to do about it. PostgreSQL has not developed any such methodology, PostgreSQL tuning is still a black art, of the kind that would get people burnt at stake in some less happy times. There are some extensions which provide execution plans for all SQL statements which execute longer than an preset time, but that is very messy indeed. PostgreSQL log file parsers like PgFouine leave a lot to be desired. PostgreSQL also has a problem with many connections. One should not try having 300 or so direct simultaneous connections to a PostgreSQL cluster (a PostgreSQL term for what Oracle calls instance). In such cases, a pooling tool like PgPool-II or PgBounce is a must. To tell the truth, these two tools do a good job and allow PostgreSQL to scale to very large number of connections. PostgreSQL can maintain a l standby database, which is extremely useful for high availability and performance, because reporting can be offloaded to the standby. PostgreSQL standby database is open for reading, which is similar to the Oracle Active Data Guard, a separately licensed option, available only with the version 11G.
Conclusion
PostgreSQL is a nice little open source database which can be used for offloading internal application systems from Oracle. It is very far from being on the same level with Oracle and it cannot replace Oracle in the enterprise server room, not by any stretch of the imagination. The main problem of PostgreSQL is the attitude of its developers who are resisting providing features like hints, demanded by the corporate DBA personnel. That results in very minor usage. Recent MongoDB user group had many more users and sponsors than the 2011 PgEast in New York City, which was a major flop. It was that attitude that made me cancel the ambitious porting projects that the company I used to work for has had. Simply said, providing the necessary level of service and performance wasn't possible in the allotted time and with the resources I had at my disposal. In order to become really useful PostgreSQL would need to have the following:
1. 2. 3. 4.
Optimizer hints Wait event interface Much improved partitioning. Some form of parallelism, at least for the partitioned tables.
While it is true that there are commercial versions of PostgreSQL (EnterpriseDB, VoltDB, Vertica, Greenplum, Netezza) providing the solution to some of the open source version problems, it is true that the open source version is rather limited in its present state. If we go commercial, we have to include a whole lot of other products, like Microsoft SQL Server and IBM DB2 into consideration, which is a completely different story. I have no experience with the commercial versions of PostgreSQL and I cannot make any claims about those versions.