Tuesday, March 31, 2009

INT v. GUID - The Battle Rages On

So a friend of mine (and probably countless other geeks) continue to debate the data type that is best used for "surrogate" primary keys in database tables....a sarcastic comment on Facebook from yours truly sparked a bloody battle in the larger war.. ok I'm being dramatic - but it did kick off a good conversation... and brought out some of the underlying issues surrounding this oh-so-important debate....the discussion follows:

---

Brian Sneddon Looks like Twitter's record ids for storing tweets is around 1.4 billion. Hope they're using bigint already because the clock is ticking...

Kyle Rowland at 3:46pm March 30
guid

Brian Sneddon at 4:43pm March 30
If you use a guid you have no way to compare the time a record was created relative to other records unless you add a timestamp field. Then if you want to be able to search off it you need to index it. Can be quite a high cost versus just using something like a bigint.

Kyle Rowland at 4:46pm March 30
Right - but you're adding multiple purpose to the key - this is kind of a hack - at bare minimum it's unclear. Justifying it in terms of processing cost is like arguing that a GUI takes too much computing cost and we should therefore use only commandline prompts (yeah, i know i know...if only)... it's the same thing though, you're trading clarity for nominal processing cost.

Brian Sneddon at 4:56pm March 30
I can see your perspective, but I don't agree that it's a hack or unclear. I think having an autoincrementing primary key is a self-evident mechanism for both establishing comparative chronology as well as ensuring a unique reference to the row. I don't think there's anything wrong with taking self-evident and consistent behavior and making logical conclusions based on it. If I store bank balance information and want to use it not only to determine a customer's balance but to also make a business decision as to what tier I consider a customer to be in if the tier is consistently determined by the range of their account balance, I wouldn't see a problem in using that account banance for multiple purposes.

Kyle Rowland at 5:52pm March 30
There's nothing *wrong* with it per se - it's just better not to tie multiple meanings together like that. The bank balance thing is a perfect example - sure you could store it like that, but what if the business rule changes...in your world that rule is now embedded as an assumption of the code that is reading the data. If the rule changes you have to change whatever code was making assumptions about the rule. If you separated them out into two distinct fields (say a balance field and a tier field) - it would be very clear what was going on, and if your business rule changed it would require minimal changes to the client code. It's not a right or wrong thing - it's a readable, maintainable or not thing..

Brian Sneddon at 6:31pm March 30
To compare apples with apples in my example, the tier never changes. If it would change then it would make more sense to store it separately. In this example a tier is fundamentally tied to a specific balance range and is immutable. Not the best real world example, I admit. However using an autoincrementing primary key still allows you to both establish uniqueness as well as provide comparative chronology in both an efficient and straightforward manner. I think the argument that it's not clear could be taken ad absurdum to claim somewhat subjectively that we should abandon any complex datatypes simply because there might be some confusion as to how they should be used. There's a balance you strike here between efficiency and simplicity, and while any point on the spectrum isn't inherently wrong there will be disagreement about the necessity of leaning in either direction. I just doubt that anyone who works with database apps would actually be confused by it.

Brian Sneddon at 6:35pm March 30
Then to get into database implementations, when inserting into an index it is cheaper to insert into the end of the index as in typical tree implementations it requires much less maintenance of the tree in the process. With guid primary keys, which is of course what this stemmed from (no pun intended), you're constantly inserting into various points in the tree and can incur a lot of maintrnance cost in the process. I just don't see a whole lot of value to using guid vs a sufficiently large integer when you need a separate primary key.

Kyle Rowland at 8:16am March 31
Of course it's subjective - this is where software is an art. I'm not saying using an int primary key is evil - just that it slightly less clear (IMHO). In general, I like to avoid doubling up on meaning - this is just one design decision - but multiply it by 1000 and depending on the collection of individual choices one of two things emerges: an elegant, modular system that is a joy to maintain or that other thing. Both work, but I know which I'd prefer to work on. So to get back to the main point - there are really two arguments that you've proposed for using int for a primary key:

1) optimized database access
2) the convenient side effect of having chronology

They're both valid... however, until you show me some evidence that #1 even makes a difference in the real world, I don't believe it should be taken in to account - and I don't mean benchmarks, I mean in the context of a real, well written app. And really in total, I'd still prefer clarity over both of these...

Kyle Rowland at 8:18am March 31
Also re: The Mythical Immutable Requirement.... I don't believe it exists. In either case here or at all.

Brian Sneddon at 10:32am March 31
Check http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/#comment-81922 for an explanation as to why UUID can hurt performance. I know you said you don't want a benchmark, but you know I don't have access to any real-world scenerios. ;) However based on the simplicity of the benchmark at the end (i.e. bulk loading into a table) you can see that as the amount of data you need to insert rises the performance impact becomes clearly visible. This impact might not be noticable in smaller apps, but would certainly manifest as it grew.

Brian Sneddon at 10:37am March 31
of course at the same time Brian Aker's tests using UUID don't show that same performance degredation. So I guess the results aren't so clear. The great debate will live on! :)

Kyle Rowland at 11:46am March 31
yes it will :)

2 comments:

Brian said...

Man, you and this Brian guy seem smart! ;)

jamie said...

I like clustered keys composed of relevant columns ... INNODB style