Oracle 10g vs PostgreSQL 8 vs MySQL 5

In contrast to my recent rant about database comparisons, I was pretty impressed with the Oracle 10g vs PostgreSQL 8 vs MySQL 5 article. The two articles don’t compare the same issues, but I’d like to point out how well this one done.

Right from the start the author states his bias upfront (he is a long time Oracle admin), along with what was being compared. In this case the focus was on installing on a lower end Windows 2000 system. All of the factors were spelled out and the scoring method was described in sufficient detail. Each factor had some discussion and and a score for each database.

Overall I’d consider this a good example of a database comparison, it has a well defined (and narrow) focus with an upfront scoring system. My only complaint with this review was in the “Documentation and Getting Started Support” section. Although PostgreSQL and MySQL basically have the same issues in the discussion area, the scores are different. Given how well the rest of the review was done I don’t think this issue completely ruins the review, just provides a low point.

I hope that the author, Lewis R Cunningham, continues to do additional reviews of these three databases using this same format.

Benchmarks, Not Enough Details

Once again there is another attempt at comparing MySQL vs. PostgreSQL. Like so many others before it, this benchmark falls prey to a classic mistake, not enough details. So lets go through their review shall we?

Comparison
Operating System: All of the OSs listed for MySQL also apply to PostgreSQL. No explanation is given as to why they only listed two for PostgreSQL. There are even FAQ entries for some OSs, like AIX, HPUX, IRIX and Solaris.

Performance: I’ll go into more detail on this when later, suffice it say that this is a gross over simplification.

Other APIs: I’m not sure why they describe MySQL with “Most of languages” and mention specific languages for PostgreSQL. I suspect that virtually any language that has MySQL support also has PostgreSQL support. I can’t say this for sure of course because I have surveyed every single language that supports one or both of these databases, but you get the idea.

For the rest of the items they do a reasonable job, although they do leave out some additional features that might be of interest to some like domains, inheritance, sequences, etc.

Summary
There are some features that are extremely handy even for fairly small databases (like views). Look, MySQL does a pretty good job (aside from some issues) at what it is intended to do, stop trying to make excuses for though when you discover that it doesn’t fair well feature wise.

Benchmarks
There is no indication into how much tuning was done for either MySQL or PostgreSQL, if any was done at all.

Data Set & Results: There is no discussion about how many simultaneous connections/users there are. From the look of things all of these tests were done with a single connection/user. If that is the case then all of the inserts done in this should be thrown out and redone using MySQL’s LOAD DATA INFILE command and PostgreSQL’s COPY command. Doing bulk imports like this is always going to perform poorly using straight inserts. Not to mention that if you are only interested with single connection/user situations then you may well include things like MS Access for your benchmarks.

Another missing component is what type of table is being used in MySQL. Unless you specify what table type you want MySQL will use the default type, which is MyISAM. If the MySQL side of these tests were run using MyISAM tables then this whole test needs to be thrown out (inserts, queries and deletes) and redone using the InnoDB table type.

Until these sorts of issues are at least addressed all of the results should simply be ignored, there is simply not enough information to gain anything remotely useful from those numbers at this point.

Conclusion
Again, if you are interested in making bulk imports happen very quickly, at least use the right tool for the job.

Hardware & Software
Kudos for mentioning hardware and software details.

My Conclusion
Going good and meaningful benchmarks is hard work, taking a vastly over simplified approach like the one done here is not really helpful to anyone. There were no goals outlined as to what sort of usage they wanted to test against (although it looks like there were interested in single user heavy insert models) and insufficient discussion and details about how they were going to mimic that model as accurately as possible. Of the discussion that was provided, most it revolved around how to minimize the impact of rapid inserts wrapped in transactions for PostgreSQL. That discussion is waste because they weren’t using the right tool for the right job.

I was pointed to the review by a blog entry at SitePoint, which now has several comments. There is something of thread revolving around an issue with MySQL where if you try insert a 300 character long string into a field that only supports 250 characters, MySQL will simply truncate your data without throwing an error. This has been brought up before and it is simply wrong, the MySQL folks need to just fix this and move on instead of trying to find different ways to justify trashing your data when it is inserted. A counter point is brought up that good programmers always validate their data before attempting to do an insert, in this case making sure that your string is less than or equal to 250 characters. The sad thing about this stance is that there is some truth to it, but not in the sense that it is being used. It’s true that you should be checking for obvious problems in your data before you insert it so that you can give meaningful errors back to the user, however, that doesn’t change the fact that what MySQL is doing is corrupting data on insert.

This disagreement reminds me of the folks who simply add client side javascript error checking for form input, which allowed them to provide meaningful warnings and errors without having to process the form every time. The security folks were quick to point out that the same checks still had to be done on the server side because client side javascript checks were easy to by pass. Client side checks are a great thing for user, but they are no excuse to avoid those same checks on the server side. Checking the length of your strings is a good idea to provide good user feedback, but it is not excuse to allow your database to corrupt data.

Use the Microsoft test here, if MS SQL Server did this sort of thing, would you still be saying the same thing?

MySQL Performance On Linux, (Open|Net|Free)BSD and Solaris

The world of benchmarks is fraught with peril, those who enter are likely to get flamed, not matter what the results are. I had already prepared myself for the worst when I saw the title of Using MySQL to benchmark OS performance on NewsForge the other day. After reading through it, and part 2, Comparing MySQL performance, I was pleasantly surprised. At the end of the article I got the feeling that Tony Bourke had made an honest attempt at testing MySQL 4.0.22 on the following operating systems:

  • FreeBSD 4.11
  • FreeBSD 5.3
  • NetBSD 2.0
  • Linux 2.6
  • Linux 2.4
  • Solaris 10 x86 (build 69)
  • OpenBSD 3.6

While there are things that I’d recommend doing differently, it certainly seems like Tony did a good job to trying to make this as balanced as possible. Perhaps my biggest beef with his methods was the decision to run all of the tests locally, instead over the network. To his credit he does a good job explaining why he ended up not doing so, but that doesn’t change the fact that for those building apps (web or otherwise), don’t usually run that application on the same system that is running MySQL.

The results of the test still feel a little bit odd. I can’t really hold this against Tony though, I’m sure he was working on a deadline and if you put off publishing forever then why bother doing it in the first place. That said, I suspect that there is more that could be done if more time and resources were available. Some of the other obvious possibilities include using MySQL built for that OS (rpm’s, BSD ports, etc), looking at additional file system tweaks and differences (does Linux still default to async fs mounts?) and trying different versions of MySQL (4.1 just went into production, but 5.x betas have been around for awhile too).

What's New In MySQL 4.1

Back in October 2004 MySQL 4.1 was announced as being the officially ready for production. I’ve got a new MySQL install to do so I wanted to take the time to go through the new features in more detail. There are all sorts of good things in 4.1:

  • GIS: Although I’ve never done any development using GIS, it has always fascinated me. One of those things on my list of things to do :-)
  • Unicode: If you’ve got to deal with funky characters, you’ve got to have this.
  • New Windows Install: I’ve run MySQL on Windows before, but only for development.
  • Prepared Statements: Looks like just this is just a first step, later versions are supposed to support query plan caching for better performance.
  • Temporal Functionality: More time zone options and finer grained time resolution.
  • Subqueries (Sub-Selects): I can’t tell you how much I’ve wanted this in MySQL. This one feature is almost enough for me to make MySQL 4.1 the minimum version for some projects.

There is also on article on other little features. Some of these are pretty cool:

  • Better help
  • ‘ON DUPLICATE KEY UPDATE’ is an interesting short cut, it makes an INSERT more intelligence without having to do additional queries.
  • ‘GROUP BY … WITH ROLLUP’ is great! This provides you with the ability to get row sums right out of a SQL query. I wonder what it take to get this ability added to PostgreSQL?
  • A few handy new functions: COMPRESS(), UNCOMPRESS(), GROUP_CONCAT(), VARIANCE(), CRC32() and UUID().

These aren’t all of the new features, but these stood out to me. MySQL is obviously focusing on more of the traditional (read: Oracle) SQL database functionality that they’ve avoided for so long. It’s good to see, it will certainly make some of my projects that make use of MySQL much easier.

With Features Like This Who Needs Bugs

When you come across things like this it is hard to resist beating them up over it one more time. Andrew Oliver has an entry about dealing with BLOBs in MySQL. He points out that if your BLOB is too long “… then rather than like error or warn it just silently truncates the data”. I’d really prefer that my database throw an error rather than alter my data. Andrew went from Oracle to PostgreSQL to MySQL only to discover this, um, interesting feature.

It should be noted that this happens when the data is inserted, which is when it should be throwing an error instead of truncating data. This limitation is clearly spelled out in the BLOB documentation, fifth paragraph:

If you assign a value to a BLOB or TEXT column that exceeds the column type’s maximum length, the value is truncated to fit.

Because BLOBs can be very large it seems that you would be unlikely to run into this with LONGBLOBs. Unfortunately that doesn’t make my gut feel any better about how wrong it is to simply truncate data that is too big when giving an error would really be the right thing to do.

MySQL's Funny Math

Daniel Lemire came across some funny math in MySQL. He ran ‘select 11/5;’ and had MySQL give him 2.20, where PostgreSQL gave him 2. There is nothing wrong with MySQL’s answer, but it might throw off most programmers who be more likely to expect the kind of result that PostgreSQL gives. The second query he ran was ‘select round(0.5);’ and MySQL returns 0 and PostgreSQL returns 1. As far as I can tell MySQL is just completely wrong here. Those were about all the details Daniel gave, so I thought I’d try this out for myself.

I was able to duplicate all of the results on MySQL 4.0.20 and PostgreSQL 7.4.3 systems. I thought I would take this one step further and try ‘select 11%5;’ on PostgreSQL, it gave me the correct answer of 1. This is nicely documented under Mathematical Functions and Operators in the PostgreSQL docs. Interestingly enough MySQL also returns 1 when running ‘select 11%5;’. MySQL does support integer division, but only by using the DIV() function. As for rounding, the MySQL docs for round() claim that:

… the behavior of ROUND() when the argument is halfway between two integers depends on the C library implementation. Different implementations round to the nearest even number, always up, always down, or always toward zero. If you need one kind of rounding, you should use a well-defined function such as TRUNCATE() or FLOOR() instead.

This explanation seems very odd to me since both the PostgreSQL and MySQL tests I ran were on the same system. Perhaps PostgreSQL implements round() independently of the c libraries in order to achieve more consistent (and correct) results? If so MySQL should do the same and not simply throw its arms up in the air if the result of round() comes out incorrectly. If Microsoft had done something like this in Excel the screaming Slashdot hoards would be all over them.

Note to MySQL developers: just fix this and move on.

Top Five WordPress SQL Queries

Recently I mentioned a tool to analyze database queries, PQA (Practical Query Analysis). Originally this tool was designed to work with PostgreSQL and now has some support for MySQL. So I turned on query logging for my MySQL server that hosts my WordPress back end. Then I put together a little script to run PQA against the MySQL query log:

#!/bin/sh

/home/joseph/bin/pqa.rb -file /var/db/mysql/query.log -top 5 -normalize -logtype mysql

Nothing fancy: told it where the log was, how many “top” queries I wanted, normalize the queries and that the this was for MySQL (instead of PostgreSQL). I don’t get a lot of traffic to my site, so I’ve let the log run for almost a week to get a sample size of more than 100,000 queries. Now that I’ve hit that here are the results:

######## Overall statistics
101332 queries (150 unique) parsed in 27.470525 seconds
######## Queries by type
SELECTs: 101191 (100%)
INSERTs: 35 (0%)
UPDATEs: 44 (0%)
DELETEs: 13 (0%)
######## Most frequent queries
16465 times: SELECT * FROM wp_users WHERE user_login = ”
14423 times: SELECT post_date, post_name FROM wp_posts WHERE ID = 0
5391 times: SELECT option_name, option_value FROM wp_options
5391 times: SELECT * FROM wp_users
5347 times: SELECT * FROM wp_categories WHERE 0=0

Before I get into the queries themselves, lets talk about what other info this provided. I’ve got 101,332 queries, of which there are only 150 unique queries. PQA took about 27 seconds to process all of this info. Far and away the most common type of query was SELECT, this shouldn’t come as any surprise. A few INSERTS (new blog entries, comments, trackbacks, etc) and a few UPDATES (editing blog entries) and a few DELETES. This isn’t quite normal because the DELETES were the result of a comment spam attack. Fortunately there were only 13 of them so this doesn’t really impact our results much.

Now let take a look at the top five most executed queries and see what we can learn.

#1 (16465/101332): SELECT * FROM wp_users WHERE user_login = ”
The wp_users table keeps information about the registered users of a WordPress install, in this case I’m the only one. What should strike you as odd is the WHERE clause in this query, it is looking for user details for a username that is empty. My gut feeling is that this query should never, ever, ever be run by WordPress. Unless your WordPress install allows for blank usernames then this query will always return an empty result set. Unfortunately it will have to look through the wp_users table first (hopefully it is looking at an index on user_login) to determine this. Being able to avoid more than 16% of the total queries should help with speed, especially on systems that are running several copies of WordPress.

#2 (14423/101332): SELECT post_date, post_name FROM wp_posts WHERE ID = 0
As you can probably guess, the wp_posts table stores WordPress blog entries. Unlike the #1 query this one looks like it might be reasonable, but once again that WHERE clause caught my eye and made me wonder. So I started looking through all of the rows in the wp_posts table for my WordPress install. None of them have an ID of 0. Just like the first query I suspect that this query should never need to be run because it will always return nothing. At this point I can’t say this with 100% certainty though because I don’t know enough about the WordPress internals. My gut feeling is again that this query could be done away with, removing another 14% of the total queries.

#3 (5391/101332): SELECT option_name, option_value FROM wp_options
This query doesn’t have a WHERE clause for me to pick on. The wp_options table is used to store all of the different options that can be set for a WordPress install. All this query does is fetch all of those options. I doubt this has any possibilities for removal or optimization.

#4 (5391/101332): SELECT * FROM wp_users
This should look familiar, the only difference between this query and #1 above is the lack of a WHERE clause. I’ve got mixed feelings about this query. On the one hand I’m suspect of most queries that simply do a SELECT *, especially without a WHERE clause to limit it. What holds me back here is that I don’t think this query could be changed without a lot of changes to the source code of WordPress. Unless you have a lot of users in your WordPress install there isn’t much (like none at all) optimization that can be done here.

#5 (5347/101332): SELECT * FROM wp_categories WHERE 0=0
A new table to look at, wp_categories. This is a small table that holds information about entry categories (big surprise eh?). Once again our friend the WHERE clause provides with a bad feeling. If you aren’t familiar with SQL, the clause 0=0 always evaluates to true. In the context of this query it is completely useless. Some of you might be saying, so what, how much of a penalty could something small be performance wise? I asked myself the same question, so I put together a simple benchmark that ran this query with and without the WHERE 0=0 clause. In my simple tests the addition of WHERE 0=0 on this query against my WordPress database added an additional 0.0001 seconds PER QUERY! So why am I getting so excited about this, after all in my case that means that I’m only loosing about 1 second every two weeks, give or take. Well you would be right, for me personally it won’t make a much of a difference, but for a system that hosts many WordPress installs it could add up very quickly. Also this is one of those small optimizations that can add up when thrown together with other little optimizations through out the system.

So what have we learned from the top five WordPress queries? Well, about 30% of the SELECTs could likely be done away with entirely (#1 and #2). Another %5 of the SELECTs can gain a slight speed improvement of about one ten-thousandths of a second (#5). I’ll definitely be using PQA to look at more queries on other projects in the future.

There are some things to keep in mind here. These are total queries run over about one week, not all of them are being run on the same page necessarily. It is entirely possible that each of these queries is being run without the other four being present for a given page. I don’t think this is likely, but certainly a possibility. Another data point that might be helpful is the number of page views turn the same time period. I decided it wasn’t worth worrying about that at this point, since my focus was simply of the queries themselves.

For those of you who are now looking at me thinking, hey ding-a-ling, WordPress is an open source project, why don’t you take this information and send it to the developers so that everyone can benefit from this. Well, I already did. Alright that isn’t really true, I sent them some similar reports about other queries. One example would be three queries with WHERE 1=1 in them. I even included a diff. You’ll notice that this was sent in back in June, more than 4 months ago and it is still open. I haven’t been able to get much attention to getting these sorts of changes into the WordPress source.

If you are interested in doing some additional research yourself into some of the SQL queries that are used in WordPress check out my entry on MySQL Queries In WordPress, which explains how to turn on displaying all the queries run at the bottom of each page.

UPDATE 10:40am 11 Dec 2004: It was suggested in a comment (see #4) that 1=1 is a code optimization that is faster than checking to see if variable is empty first. See comment #5 for my response. Bottom line: my simple tests show that 1=1 is always slower than if(empty($some_var)) and that 1=1 breaks simple scalability schemes (multiple front end web servers, one big db server).

The State Of Full-Text Indexing For The Poor Man

I’ve yet to actually use full-text indexing on a project, but I’m continually drawn to the idea. I found mysql a bit surprised by the fact that it isn’t being used nearly as much as (I thought) it should be. The two obvious places that I thought full-text indexing would surely just take off are blogs and wikis. More than anything these two systems are designed to hold lots and lots of words. If the rise of Google has taught us anything it is that being able to easily, accurately and quickly search through huge amounts of words is vital to how useful those words are. What’s the point of having wikis and blogs store their data forever (more or less) if their own built in search features can’t find what you’re looking for? Some of you will immediately raise your hand to tell me that simply letting Google index your wiki and blog is much better than trying to have a good built in search. That may or may not be true, but that only works for publicly available resources. What about all of those wikis and blogs that are for internal use only? So once again we are back to relying on the built in search features of these tools.

So if I were to write a wiki or blog system it seems like using full-text indexing would be the obvious choice when it comes searching. So I started looking around to see if something like WordPress or MediaWiki. WordPress doesn’t appear to make any use of full-text indexing, but MediaWiki does. I did some quick looking around and couldn’t find any other wiki or blog system that made use this. I thought this was very strange since most of these systems use MySQL as their back end database. MySQL has had support for full-text indexing for years and looks to be pretty easy to use, so why isn’t it seeing more use? I think the answer can be found in the MySQL Full-Text docs, full-text indexing only works for MyISAM table types. For some reason I had never come across this limitation before, I just assumed that it also worked with InnoDB tables. It is on their ToDo list, but there doesn’t appear to be an expected timeline for it.

So anyone who wants to make use of full-text indexing has to use MyISAM on those tables under MySQL. This seems like a huge limitation. One of the reasons MySQL has advanced so much (in my view) is because of the features that InnoDB has brought to the table.

This brings me PostgreSQL and full-text indexing, made possible by the tsearch2 extension. The immediate difference between MySQL and PostgreSQL that jumps out is that MySQL has it built in and PostgreSQL has as a third party module. The install instructions look pretty easy, but I haven’t tried them yet so I can’t say for sure. Both of these approaches have advantages, but the advantage is targeted towards different groups. For PostgreSQL, having a third party develop this extension makes things easier for the PostgreSQL developers and harder for the users of software that uses PostgreSQL for the backend. Now not only do I have to make sure that my web host supports PostgreSQL (that’s another story in itself), but I have to make sure that they’ve installed tsearch2. For MySQL there is likely more work for the developers of MySQL because now they have one more thing that they have to support and test for each release. The benefit of that work is that full-text indexing is available to users of any install of MySQL. This is a grossly over simplified look at things, but I believe it fits the mind set of those who want to run software X on some random web host.

Unlike MySQL, PostgreSQL doesn’t have different table types. This means all of the features and abilities are available to you wether you want use full-text indexing or not. This factor would almost completely dry up if full-text indexing was supported in InnoDB. There are also some implementation details that are different between the two. MySQL uses (at least for syntax) something that looks like what you would use to create a regular index. In tsearch2 you end up adding a field to you table and then creating an index on that field.

So which is better? I don’t know, I hope to try out of both of them in the future, just to see how they work. Based on what I’ve read I don’t think either of these solutions provide a simple and powerful full-text indexing that can be used widely. MySQL’s support for it excludes their most powerful table type, InnoDB and PostgreSQL requires the install of an additional module in order to work.

Perhaps full-text indexing is a sufficiently complex task that there will never be a true poor mans solution.

Update (3:16 pm 12 Aug 2004): I guess it really is all about timing. There’s an intro article on PostgreSQL full-text indexing by Joshua Drake over at DevX dated 10 Aug 2004.

MySQL Queries In WordPress

I’ve started looking at the MySQL queries in WordPress to see what sort of optimizations are possible. To help with this I uncommented the HTML comments around the query count and the timer at the bottom of index.php. To look at all of the queries being run I add the following line to the top index.php define(‘SAVEQUERIES’, true);. This tells the $wpdb class to save all of the SQL queries, storing them in the $wpdb->savedqueries array. To look at those queries I added the following PHP code at the bottom of index.php (just above ):

    $num_queries = count($wpdb->savedqueries);
    for($i = 0; $i savedqueries[$i] =
                 wordwrap($wpdb->savedqueries[$i]);
    }
    print_r($wpdb->savedqueries);

It isn’t the prettiest in the world, but it works. So far I’ve found a handful of queries that have a 1=1 in the WHERE clause, which seemed pretty pointless. I’ve also noticed a query that is completely useless in the case where someone is browsing a WordPress blog without being logged in. The query will always fail if no one is logged in. So with an additional !empty() test, the number of queries went down from 19 to 18 in the case where no one is logged in.

I’m not sure how many other “easy” SQL oddities I’ll find though. Only time will tell.