MySQL to PostgreSQL and UUID/GUIDs

(Warning, I tend to use UUID instead of GUID. For purposes of this post consider them interchangeable.)

I’ve already mentioned the announcement of FeedLounge making the move from MySQL to PostgreSQL last month, but the discussion is still on going. If you haven’t yet, go read all of the comments on the announcement. This has to be some of the most productive and level headed (no flames yet) set of blog comments I’ve seen in a long time. For those of you who are coming late to this story let me bring you up to speed:

FeedLounge started with MySQL as their database, with the MyISAM table types. They ran into problems and so moved to the InnoDB table type in MySQL. Still not completely happy they did more research and decided to try out PostgreSQL for their database needs. Their tests showed that MySQL InnoDB database was 34GB, in PostgreSQL it was 9.6GB. Restores in MySQL took more than 24 hours, in PostgreSQL it was less than 5 hours. These numbers resulted in better performance and reduced time.

It has been more than two weeks since the initial announcement and there are still new comments being added. As of this morning there are 34 35 comments. Some of the comments have come from people who are very knowledgable about MySQL and the size of InnoDB indexes became the focus for why InnoDB was so much bigger than PostgreSQL. Using InnoDB with UUIDs as the primary key was identified as a problem, especially in the area of performance.

Things continued to get more interesting when Heikki Tuuri (the creator of InnoDB) left a comment confirming the size problems in InnoDB when using a UUID as a primary key. It seems that using a UUID in this way causes a couple of problems, the first being that InnoDB wants to keep things in index order, which is great for indexes on sequential numbers but bad for seemingly random data (PostgreSQL can do the same thing with cluster). The next problem is that secondary indexes in the same table get copies of the primary key index to making finding things faster. This, at least in part, was what caused the size difference.

Heikki’s solution (and others have mentioned this) is to use an auto_increment field as the primary key and create a unique secondary index on the UUID field. This would make the primary key index size much smaller and still allow InnoDB to do index lookups on the UUID field. That would seem to solve the problem, but there hasn’t been any feedback yet to confirm if it fixed all of the differences between InnoDB and PostgreSQL.

There is something about having to add an extra field to my table to make the database perform better that doesn’t seem quite right though. I don’t have anything concrete that I can point to showing that this isn’t a good idea, but it doesn’t feel right. For those who are running into these same types of problems though, it might be something to try, at least until you get a chance to try out PostgreSQL.

FeedLounge Switches To PostgreSQL

This could turn into an interesting case study, FeedLounge has switched databases, from MySQL to PostgreSQL. The two main factors they listed were database size (34Gig for MySQL/InnoDB vs. 9.6Gig for PostgreSQL) and restoration time (24+ hours to restore database in MySQL vs. less than 5 hours for PostgreSQL).

I’m surprised to here about the huge difference in database size between MySQL (w/InnoDB) and PostgreSQL. Perhaps the different indexing techniques are responsible for this, I’m not sure. For large databases the ability to cut your storage requirements by more than 1/3 is huge. My gut feeling though is that this would not be the norm though. More testing is definitely in order to make this comparison more meaningful.

The huge difference in database restoration time also strikes me as a bit high. Certainly another area where further tests would be interesting to see if this pattern holds true in general. Obviously being able to restore your database in less than 1/3 the time is a good thing.

The real kicker in all this though is the chance to make use of the additional features found in PostgreSQL (they point this out in their write up). I hope that Alex continues to write about how this transition progresses.