Link Dump 2004-11-30

Barcodes in Postscript: Make your own barcodes in Postscript. I’m not exactly sure how to use this, but it looks interesting :-)

SELECT DISTINCT: A SQL Case Study: An article on analyzing a query rewrite for PostgreSQL and making sure it is doing what you thought it should (via Random Notes).

Which PHP libraries do you use?: Harry is looking to see what PHP libraries people actually trust enough to use in real life.

Getting started with HTML_QuickForm: An intro to using PEAR’s HTML_QuickForm. For quite sometime I’ve stayed away for PEAR, but lately I’ve been finding some very helpful bits of code there. Time to swallow my pride and look a little closer.

Why Install Linux on Your Mac?: For the desktop user I have a hard time buying into any of these reasons. Sure you can get more power and flexibility, but the cost of those two is additional time. At this point I’m more interested in being productive with my desktop/notebook than customizing it.

Five Favorite Annoyances: Five annoyances on the Mac, with solutions. Despite what I just said above there are things that I would change about Mac OS X (no OS is perfect after all).

The Metadata Problem With Databases

It is the end of 2004 and virtually everyone uses databases to back end gobs data. Lately there has even been talk of “the database” being legacy technology, a small piece of the more exciting whole of an application. Mix a little bit of this thinking, my recent attempts to store column regular expressions in PostgreSQL constraints and a lot of day time work being spent on in-house apps interfacing with databases and you end up with this question. Where should I put my metadata?

Let me make sure that I’m clear on exactly what metadata I’m interested in (for now). Fields in a table generally have five attributes that we are commonly interested in: type, length, precision, nullness (I don’t think that is a word, but you know what I mean) and comment. That is all fine and good, but I find myself wanting more, that is what got me started on using regular expressions in constraints in the first place. I figured that might be a good place to store another field attribute and get the added bonus of having the database enforce it also. To start putting pen to paper here is an example table that we will work with:

CREATE TABLE users (
  user_id SERIAL NOT NULL,
  username VARCHAR(16) NOT NULL,
  lastname VARCHAR(50) NOT NULL,
  firstname VARCHAR(50) NULL,
  email_addr VARCHAR(250) NULL
);

Now imagine that you are writing a front end to this particular table (it could be web based, but that doesn’t matter) what sort additional attributes (more metadata!) would be handy to have? Regular expression limitations were one the first things on my list. I mean it is great that a username can only be 16 characters long, but what if only want to allow lowercase a through z and digits? A simple regex takes care of this: ‘^[a-z0-9]+$’. We could conceivably want to impose additional limitations on lastname, firstname and email_addr as well. Moving on we discover that lastname is a great database column name, but a crummy field label to use in an application, so every where that the lastname field is used we hard code a nicer label, like ‘Last Name’. This leaves me with a really unpleasant feeling, I’d much rather be able to ask for correct label to use everywhere that I’m displaying data for lastname. This problem could potentially be solved using the column comment feature in databases. So far so good, we could potentially store these two types of metadata in the database itself.

We have pretty much used up on places to put additional metadata, so what happens when we need to store more? One example would be something like an extended help or hint (bubble help?) for a field. In my applications are work I would like to be able to use to this to explain what a field is in greater detail (like why an insert or an update will fail if you try to add a username that doesn’t match ‘^[a-z0-9]+$’). I could use the comment field for this, but then I’m already using that for something else remember? For now that is the extent of the additional metadata I’m interested, but it would not be unreasonable to have additional metadata that I would be interested in later on.

Let use assume for a moment that all of the above issues can be satisfied, there is another concern I have with solutions I’ve outlined so far: metadata duplication. Going back to our example table, say we want to limit lastname to match ‘^[a-zA-Z]+$’. Looking around we discover that also have a customers table with a lastname field that we would also want to constrain in the same way, so add the same constraint to the customers.lastname as we did to users.lastname. Later on we become more enlightened and realize that it is reasonable to have spaces in lastname so we change the regular expression to ‘^[a-zA-Z ]+$’, in two different places. I almost hate to put it this way, but now our metadata is becoming heavily non-normalized. With PostgreSQL this particular problem could be addressed by the use of domains because it supports check constraints, but that still would leave us with a shared label and long description (PostgreSQL domains don’t have a comment attribute).

After going down this road awhile I came to the conclusion that I wasn’t going to be able to store all of the metadata that I wanted to using built-in database features. My next thought was to put this in the business logic layer of an application. I even half convinced myself this might be a good idea, but then that left me with a database that wouldn’t be able to enforce the regular expressions I was interested in. So I went back to idea of finding my original goal of finding a way to store the metadata in the database (with regex constraint support). It turns out that the most obvious solution works with a little bit of tweaking and one gotcha.

What might be this obvious solution be? Why a metadata table of course! Something like this perhaps:

CREATE TABLE metadata (
  metadata_id SERIAL NOT NULL,
  column_name VARCHAR(100) NOT NULL,
  display_name VARCHAR(250) NULL,
  long_note TEXT NULL,
  regex VARCHAR(1024) NULL
);

With this table you could share common metadata and add to as needed in the future. Then I looked at how to make this work with check constraints, so I tried something like this:

ALTER TABLE users ADD CONSTRAINT
username ~ (
  SELECT regex
  FROM metadata
  WHERE column_name = 'username'
);

That idea got shot down real fast. Unfortunately you can not do sub-queries in constraints on PostgreSQL. For a moment I thought that I was going to be faced with defeat before I even got out the door. After looking over more examples of constraints it looked like PostgreSQL would allow the use of a user defined function, so I tried this:

CREATE FUNCTION metadata_regex(varchar)
RETURNS varchar AS '
  SELECT regex
  FROM metadata
  WHERE column_name = $1;
' LANGUAGE SQL;
ALTER TABLE users ADD CONSTRAINT
username_ck CHECK (
  username ~ metadata_regex('username')
);

For what ever reason this works where the plain sub-query does not. There are a few nice features to this approach, you can set a constraint on every column without creating the metadata for it. If the metadata_regex() function returns null then it is an insert/update will always succeed (as if there was no constraint there at all except for the small overhead of doing the lookup). This was you can fill in metadata later if you want to. My boss was quick to ask what happens when you change an existing regex that would invalidate existing data? The short answer is that the system will let you change the regex and not give an error even with data that would now be invalid. It will enforce the new regex when you insert new data or update old data. Depending on your needs this may be a feature or a big problem. If you need to catch these regex changes and throw errors if data would be invalid with new regex there are two possible solutions. One solution would be to write a trigger what would verify that existing data would not be invalidated by the new regex and throw an error if it did.

Another solution to the regex change problem is to create functions that serve as an API to the metadata; so all updates, inserts and deletes would be done via these functions instead of directly manipulating the metadata table. These functions would act in a similar way as the trigger described above, preventing changes that would invalidate existing data. I’m not wholly convinced that one is better than the other, but I am starting to lean towards the API idea because it allows you hide the details of how the metadata is stored. This technique also lead me to another feature of PostgreSQL, creating new types.

Combining the API idea with creating new types allows you run regular SQL queries using the functions like tables. For some examples of how this can work take a look at ‘Defining and returning Rowtypes‘ in the 19 May 2003 General Bits. There is another example at General Bits under ‘SQL 2003 Standards Approved‘ from 17 April 2004. Between these two examples you should start to see the possibilities here, by creating an API to all your additional metadata not only can you limit changes you can allow the database to enforce them, either via triggers or constraints and expose them to the application layer through traditional SQL queries.

My goal is to be able to add new metadata (attributes) to database columns, allowing not just the upper application layers to make use of it but the database also. After some research and pondering it looks like PostgreSQL has the features necessary to make this possible in an easy and extensible way.

iPhone

Russ is back with a prediction of an Apple iPhone. Perhaps this is what Steve Jobs meant when he said that movies weren’t the next step for the iPod? This makes me wonder what features such a device would have. The obvious first step is supporting everything that the iPod does now (including photos and ideally video). Bluetooth would be a must, not just for syncing to my Powerbook, but for ear pieces and the Apple bluetooth keyboard and mouse. For storage the hard drive idea would have to go, likely some sort of flash instead, with an SD or compact flash slot for more memory. The ability to sync virtually everything, this would be a must. With all this data going back and forth it would be nice to have firewire 800 as a wired option to bluetooth, but I suspect USB would be much more likely (but it should be USB 2.0 though). In traditional Apple style a nice little bundle applications would come with the phone.

For major bonus points wifi (either 802.11b or 802.11g) in addition to all of the above would really be awesome (syncing over the net, etc).

Phantom of the Opera: Movie?

I have a lot of fond memories associated with The Phantom of the Opera (the San Francisco product at the Curran theatre). Sarah and I saw it at the Curran twice, the first time not long after we meet on one of our first dates, the second time I proposed to her during the intermission. I was little surprised to hear about a Phantom of the Opera movie. Now that I’ve seen the trailer for the movie I’m having very mixed feelings about the movie. On the one had this is Phantom, so I really want to like it. I already have so many good memories associated with it, but I can’t shake the feeling that making it into a movie is just wrong.

Exposing PostgreSQL Regular Expression Check Constraints

Check constraints are a great feature, I was especially excited when I found out that PostgreSQL supports regular expressions in check constraints. Now I want to expose the regex check up to an application layer so that it can be used at input time, before the database chokes on it during an insert or an update. So I went looking for information on how to ask PostgreSQL for the check constraint for a given table and column name. Verlana’s PostgreSQL General Bits came to the rescue with the entry on Implementation of Constraints. The last query in the entry will lookup all on the constraints for a given table. I modified this query to extract just the constraint regular expression for a given table and column.

select substring(consrc from '^.+ [~|~*|!~|!~*] ''(.+)''.+$')
    AS constraint_regex
from pg_class r,
    pg_constraint c
where r.oid = c.conrelid
    and contype = 'c'
    and relname = 'your_table_name'
    and consrc ILIKE '((your_column_name)::text %'