Oracle 10g vs PostgreSQL 8 vs MySQL 5

In contrast to my recent rant about database comparisons, I was pretty impressed with the Oracle 10g vs PostgreSQL 8 vs MySQL 5 article. The two articles don’t compare the same issues, but I’d like to point out how well this one done.

Right from the start the author states his bias upfront (he is a long time Oracle admin), along with what was being compared. In this case the focus was on installing on a lower end Windows 2000 system. All of the factors were spelled out and the scoring method was described in sufficient detail. Each factor had some discussion and and a score for each database.

Overall I’d consider this a good example of a database comparison, it has a well defined (and narrow) focus with an upfront scoring system. My only complaint with this review was in the “Documentation and Getting Started Support” section. Although PostgreSQL and MySQL basically have the same issues in the discussion area, the scores are different. Given how well the rest of the review was done I don’t think this issue completely ruins the review, just provides a low point.

I hope that the author, Lewis R Cunningham, continues to do additional reviews of these three databases using this same format.

Benchmarks, Not Enough Details

Once again there is another attempt at comparing MySQL vs. PostgreSQL. Like so many others before it, this benchmark falls prey to a classic mistake, not enough details. So lets go through their review shall we?

Comparison
Operating System: All of the OSs listed for MySQL also apply to PostgreSQL. No explanation is given as to why they only listed two for PostgreSQL. There are even FAQ entries for some OSs, like AIX, HPUX, IRIX and Solaris.

Performance: I’ll go into more detail on this when later, suffice it say that this is a gross over simplification.

Other APIs: I’m not sure why they describe MySQL with “Most of languages” and mention specific languages for PostgreSQL. I suspect that virtually any language that has MySQL support also has PostgreSQL support. I can’t say this for sure of course because I have surveyed every single language that supports one or both of these databases, but you get the idea.

For the rest of the items they do a reasonable job, although they do leave out some additional features that might be of interest to some like domains, inheritance, sequences, etc.

Summary
There are some features that are extremely handy even for fairly small databases (like views). Look, MySQL does a pretty good job (aside from some issues) at what it is intended to do, stop trying to make excuses for though when you discover that it doesn’t fair well feature wise.

Benchmarks
There is no indication into how much tuning was done for either MySQL or PostgreSQL, if any was done at all.

Data Set & Results: There is no discussion about how many simultaneous connections/users there are. From the look of things all of these tests were done with a single connection/user. If that is the case then all of the inserts done in this should be thrown out and redone using MySQL’s LOAD DATA INFILE command and PostgreSQL’s COPY command. Doing bulk imports like this is always going to perform poorly using straight inserts. Not to mention that if you are only interested with single connection/user situations then you may well include things like MS Access for your benchmarks.

Another missing component is what type of table is being used in MySQL. Unless you specify what table type you want MySQL will use the default type, which is MyISAM. If the MySQL side of these tests were run using MyISAM tables then this whole test needs to be thrown out (inserts, queries and deletes) and redone using the InnoDB table type.

Until these sorts of issues are at least addressed all of the results should simply be ignored, there is simply not enough information to gain anything remotely useful from those numbers at this point.

Conclusion
Again, if you are interested in making bulk imports happen very quickly, at least use the right tool for the job.

Hardware & Software
Kudos for mentioning hardware and software details.

My Conclusion
Going good and meaningful benchmarks is hard work, taking a vastly over simplified approach like the one done here is not really helpful to anyone. There were no goals outlined as to what sort of usage they wanted to test against (although it looks like there were interested in single user heavy insert models) and insufficient discussion and details about how they were going to mimic that model as accurately as possible. Of the discussion that was provided, most it revolved around how to minimize the impact of rapid inserts wrapped in transactions for PostgreSQL. That discussion is waste because they weren’t using the right tool for the right job.

I was pointed to the review by a blog entry at SitePoint, which now has several comments. There is something of thread revolving around an issue with MySQL where if you try insert a 300 character long string into a field that only supports 250 characters, MySQL will simply truncate your data without throwing an error. This has been brought up before and it is simply wrong, the MySQL folks need to just fix this and move on instead of trying to find different ways to justify trashing your data when it is inserted. A counter point is brought up that good programmers always validate their data before attempting to do an insert, in this case making sure that your string is less than or equal to 250 characters. The sad thing about this stance is that there is some truth to it, but not in the sense that it is being used. It’s true that you should be checking for obvious problems in your data before you insert it so that you can give meaningful errors back to the user, however, that doesn’t change the fact that what MySQL is doing is corrupting data on insert.

This disagreement reminds me of the folks who simply add client side javascript error checking for form input, which allowed them to provide meaningful warnings and errors without having to process the form every time. The security folks were quick to point out that the same checks still had to be done on the server side because client side javascript checks were easy to by pass. Client side checks are a great thing for user, but they are no excuse to avoid those same checks on the server side. Checking the length of your strings is a good idea to provide good user feedback, but it is not excuse to allow your database to corrupt data.

Use the Microsoft test here, if MS SQL Server did this sort of thing, would you still be saying the same thing?

PostgreSQL 8.0

I’ve very happy to see that PostgreSQL 8.0 is now officially released. There are all sorts of new features in 8.0 in several different areas. Here are some of the highlights:

  • Microsoft Windows Native Server
  • Savepoints
  • Point-In-Time Recovery
  • Tablespaces
  • Improved Buffer Management, CHECKPOINT, VACUUM
  • Change Column Types
  • New Perl Server-Side Language
  • Comma-separated-value (CSV) support in COPY
  • Support cross-data-type index usage
  • Add ability to prolong vacuum to reduce performance impact
  • Implement dollar quoting to simplify single-quote

Congrats to the whole PostgreSQL team, thank you for all of your hard work and diligence.

UPDATE 11:30am 19 Jan 2005: The announcement of PostgreSQL 8.0 is very popular, my PubSub watch list for “PostgreSQL” is just exploding with entries today.

PostgreSQL Performance of "where exists"

I’m always interested in bits about the performance of specific types of SQL queries, so I was curious when I came across Bill Schneider‘s post about PostgreSQL performance of “where exists”. He was comparing the performance of two different queries that provide the same results, so I took a look at some of my data and converted them to fit data that I already had:

Query 1

select course_abbr from courses_basetable
where exists (
  select 1 from editions
  where editions.course_abbr = courses_basetable.course_abbr
)

Query 2

select distinct(editions.course_abbr) from editions
join courses on
  editions.course_abbr = courses.course_abbr

Both of these queries return the same list of course abbreviations, but perform slightly different. To avoid the suspense I’m going to give away the ending now: Query 1 runs faster than Query 2 (tested using PostgreSQL 7.4.6), but for Bill Query 2 ran faster (using PostgreSQL 7.3.x). For my tests Query 1 took an average time of 29 ms and Query 2 took an average of 31 ms. So what, 2 ms you say, I throw that much CPU time away several times a minute! The absolute times don’t really matter, the percentage difference does. In this case Query 1 is more than 5% faster than Query 2.

To find out more about why these two queries perform so differently I ran them both using EXPLAIN ANALYZE to see what was going on. It should be rather obvious why Query 2 takes longer after seeing the EXPLAIN ANALYZE results:

Query 1

Seq Scan on courses_basetable  (cost=0.00..35.82 rows=19 width=8)
(actual time=0.369..9.485 rows=36 loops=1)
  Filter: (subplan)
  SubPlan
    ->  Seq Scan on editions  (cost=0.00..1.81 rows=2 width=0)
          (actual time=0.207..0.207 rows=1 loops=38)
          Filter: ((course_abbr)::text = ($0)::text)

Query 2

Unique  (cost=6.06..6.38 rows=36 width=8)
(actual time=4.697..5.252 rows=36 loops=1)
  ->  Sort  (cost=6.06..6.22 rows=65 width=8)
        (actual time=4.682..4.849 rows=65 loops=1)
        Sort Key: editions.course_abbr
        ->  Hash Join  (cost=1.47..4.10 rows=65 width=8)
              (actual time=1.652..3.357 rows=65 loops=1)
              Hash Cond: (("outer".course_abbr)::text = ("inner".course_abbr)::text)
              ->  Seq Scan on editions  (cost=0.00..1.65 rows=65 width=8)
                    (actual time=0.017..0.462 rows=65 loops=1)
              ->  Hash  (cost=1.38..1.38 rows=38 width=8)
                    (actual time=1.034..1.034 rows=0 loops=1)
                    ->  Seq Scan on courses_basetable  (cost=0.00..1.38 rows=38 width=8)
                          (actual time=0.106..0.510 rows=38 loops=1)

Query 2 has more work to than Query 1. If anyone else has other data points for this query let me know.

On a side note, Blogger really needs to support TrackBack, I’m tired of giving a response in a post and then having to leave a comment to let the person know where to find it.

PostgreSQL Vacuum

Thanks to MVCC PostgreSQL does a pretty good job at avoiding locking contentions. Unfortunately this ability does not come without a cost. In this case the cost is having to run vacuum from time to time (the best interval for you will depend on usage) to recover rows and make that space available again. This is especially true on a table that is changing a lot (INSERT, UPDATE and DELETE). If you don’t do this you’ll notice things slowing down, just like Karl noticed with SELECT count(*). Setting aside the question of doing SELECT count(*) on a table for now, if you have a table that is getting millions of changes a day then you better be looking at how often you need to run vacuum. If it takes you years to hit a million changes then running vacuum won’t need to be as frequent.

I’ve not played with pg_autovacuum yet, but that is another resource to look at for determining how often you need to vacuum your tables.

Exposing PostgreSQL Regular Expression Check Constraints

Check constraints are a great feature, I was especially excited when I found out that PostgreSQL supports regular expressions in check constraints. Now I want to expose the regex check up to an application layer so that it can be used at input time, before the database chokes on it during an insert or an update. So I went looking for information on how to ask PostgreSQL for the check constraint for a given table and column name. Verlana’s PostgreSQL General Bits came to the rescue with the entry on Implementation of Constraints. The last query in the entry will lookup all on the constraints for a given table. I modified this query to extract just the constraint regular expression for a given table and column.

select substring(consrc from '^.+ [~|~*|!~|!~*] ''(.+)''.+$')
    AS constraint_regex
from pg_class r,
    pg_constraint c
where r.oid = c.conrelid
    and contype = 'c'
    and relname = 'your_table_name'
    and consrc ILIKE '((your_column_name)::text %'

With Features Like This Who Needs Bugs

When you come across things like this it is hard to resist beating them up over it one more time. Andrew Oliver has an entry about dealing with BLOBs in MySQL. He points out that if your BLOB is too long “… then rather than like error or warn it just silently truncates the data”. I’d really prefer that my database throw an error rather than alter my data. Andrew went from Oracle to PostgreSQL to MySQL only to discover this, um, interesting feature.

It should be noted that this happens when the data is inserted, which is when it should be throwing an error instead of truncating data. This limitation is clearly spelled out in the BLOB documentation, fifth paragraph:

If you assign a value to a BLOB or TEXT column that exceeds the column type’s maximum length, the value is truncated to fit.

Because BLOBs can be very large it seems that you would be unlikely to run into this with LONGBLOBs. Unfortunately that doesn’t make my gut feel any better about how wrong it is to simply truncate data that is too big when giving an error would really be the right thing to do.

MySQL's Funny Math

Daniel Lemire came across some funny math in MySQL. He ran ‘select 11/5;’ and had MySQL give him 2.20, where PostgreSQL gave him 2. There is nothing wrong with MySQL’s answer, but it might throw off most programmers who be more likely to expect the kind of result that PostgreSQL gives. The second query he ran was ‘select round(0.5);’ and MySQL returns 0 and PostgreSQL returns 1. As far as I can tell MySQL is just completely wrong here. Those were about all the details Daniel gave, so I thought I’d try this out for myself.

I was able to duplicate all of the results on MySQL 4.0.20 and PostgreSQL 7.4.3 systems. I thought I would take this one step further and try ‘select 11%5;’ on PostgreSQL, it gave me the correct answer of 1. This is nicely documented under Mathematical Functions and Operators in the PostgreSQL docs. Interestingly enough MySQL also returns 1 when running ‘select 11%5;’. MySQL does support integer division, but only by using the DIV() function. As for rounding, the MySQL docs for round() claim that:

… the behavior of ROUND() when the argument is halfway between two integers depends on the C library implementation. Different implementations round to the nearest even number, always up, always down, or always toward zero. If you need one kind of rounding, you should use a well-defined function such as TRUNCATE() or FLOOR() instead.

This explanation seems very odd to me since both the PostgreSQL and MySQL tests I ran were on the same system. Perhaps PostgreSQL implements round() independently of the c libraries in order to achieve more consistent (and correct) results? If so MySQL should do the same and not simply throw its arms up in the air if the result of round() comes out incorrectly. If Microsoft had done something like this in Excel the screaming Slashdot hoards would be all over them.

Note to MySQL developers: just fix this and move on.

Oracle Calls It MINUS, PostgreSQL Calls It EXCEPT

Just in case I run into this again, you can use EXCEPT in PostgreSQL as a replacement for using MINUS in Oracle. This was extremely handy this morning when I ran a query on Oracle and wanted to see if I could do the same thing on exported data in PostgreSQL. Doing a cut-n-paste returned a syntax error because the query used MINUS. A couple of minutes with Google and the PostgreSQL documentation and I discovered that EXCEPT does the same thing that Oracle’s MINUS does. After changing that one word in the query it ran perfectly.

Honestly, I tried to resist the urge, but I just couldn’t. According to the MySQL future feature list MINUS, INTERSECT, and FULL OUTER JOIN will be implemented in the “mid-term future” (after 5.1).

WordPress Using PostgreSQL

Nice to see there are other people out there who are interested in running WordPress on PostgreSQL instead of MySQL. It is even better to see that someone took the time to do an initial port of WordPress 1.2 to use PostgreSQL (WordPress-PG). Here is Keenan Tims announcement on wp-hackers. I wonder what the best way to continue support of PostgreSQL in WordPress is as new versions are released? There are so many things I’d like to dig in and spend time on but don’t have enough time to do them all :-(