"In fact, even in RDBMS’s developers should be using a natural key as their primary key instead of an auto-incremented ID. "<p>No, they shouldn't. The fact that your dataset may contain a piece of information that is usually unique its completely unrelated to the need of RDMSs to have identifiers to implement table relationships and to identify records in a table.<p>Not to mention that it can bring lots of headaches down the line. The classic example is using social security numbers (or another country's equivalent, such as CPF in Brazil) to identify individuals. Down the line you get a requirement to support international customers. That is now a major undertaking.<p>You are also offloading the responsibility of maintaining the uniqueness property to a third party, in effect trusting their database.
DO NOT USE USER-SUPPLIED DATA AS KEYS<p>Database keys are more-or-less equivalent to pointers. In the case of natural keys, they're pointers that (a) do not have ASLR, and (b) are exposed to the user.<p>As the article mostly ignores, you <i>will</i> guess wrong about what can be reasonably considered permanent or unique. Email addresses change (and aren't always unique), names change (and aren't unique), etc. This is much harder to fix if you used that guess as the basis for your entire data model.<p>The problem with proper keys was given as "auto_increment doesn't work on distributed systems". This is rather silly, as there are plenty of alternatives (GUID, UUID, the mentioned snowflake keys).<p>It's also interesting that proper keys are a "crutch" for designing data models and are best replaced with cleverness, and apparently are only useful to dictate record order (I know that Oracle at least doesn't work that way). This brings to mind "debugging code is twice as hard as writing code; if you write code as cleverly as possible, you are by definition not clever enough to debug it". DON'T BE (OVERLY) CLEVER. Do the least-fragile thing that doesn't add complexity (that is, doesn't add complexity <i>visible to you</i>.). Only do clever things that add fragility if you have an actual measured reason to do so.
"In fact, even in RDBMS’s developers should be using a natural key as their primary key instead of an auto-incremented ID. This will lead to better performance when the natural key is the most commonly used identifier, but is often not considered out of a habit."<p>As a database architect I always use auto-incrementing as the primary key. This ID is actually a surrogate key for the real primary composite key, which I assign to a unique key.<p>So, with the unique key in place, the performance will be nearly as good as you'd expect for direct looks by key attributes. However, the performance gain using the surrogate ID within join tables and as foreign keys is much greater.<p>"It’s also worth noting that over their years of use in SQL databases they’ve become somewhat of a crutch when designing the data model for a system."<p>They are not a crutch. They are a wonderful feature. One could model with GUIDs if necessary or something similar for your surrogate key but nothing beats the simplicity of an auto incrementing ID.<p>Even a NoSQL database like MongoDB creates a unique _id for you for each document.
An article promising to be quite technical demonstrates early on the author's inability to get their head around the spelling of the possessive form of "its". (Perhaps it's just me that finds this an unsettling juxtaposition. The About page wasn't useful in identifying native writer status.)<p>Then we dive into some mistruths about emails being unique per user (I know plenty of people who <i>still</i> share email addresses, both personal and also in shared commercial accounts). I worked at a place a few years ago that had designed their entire accounts system around the idea of email addresses being immutable. They'd subsequently had to introduce a secondary column for <i>actual</i> email addresses -- customers would login with their historical email address as their ID. The idea that a uid is a 'natural key' just seems perverse (or at least demonstrates a profound misunderstanding of what a natural key is).<p>A sideways reference - without actually using the phrase - to CAP theorem. Composite keys are suggested as a solution to a vaguely phrased problem without acknowledging the performance issues, and subsequent design constraints, they introduce. Ultimately there seems to be a misunderstanding about the differences between a (primary) key and an index.
The biggest problem with "natural" keys, as referenced in the article: they have a nasty habit of <i>changing</i>. In the article's example, making "username" or "email" the PK is going to be messy if you have foreign keys referencing the value when it updates.
In a recent large project, I moved to using GUIDs. Following advice from this article: <a href="http://www.codeproject.com/Articles/388157/GUIDs-as-fast-primary-keys-under-multiple-database" rel="nofollow">http://www.codeproject.com/Articles/388157/GUIDs-as-fast-pri...</a> it was fairly straightforward and performant.<p>Obviously YMMV (as may your definition of 'large' and 'performant').
"In fact, even in RDBMS’s developers should be using a natural key as their primary key instead of an auto-incremented ID. This will lead to better performance when the natural key is the most commonly used identifier, but is often not considered out of a habit."<p>Primary/unique keys are indexed. Creating an index of strings instead of integers is surely going to result in a huge index and slow performance?
We've been having this argument about primary keys for three decades at least. If the natural key vs. auto-generated ID theories can't be reduced to practice, is there any home for software development as a profession?
> An email address is a piece of information that is guaranteed to be unique to each user because everybody has their own email address.<p>I once let myself in for a lot of headaches by designing a system around this unwarranted assumption.
It looks like the purpose of the article is to explain why it's not a big deal that the company's database solution doesn't support auto-incrementing primary keys.
Instagram had a nice post on how they did their PK generation/sharding a couple years back. Still a good read, <a href="http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram" rel="nofollow">http://instagram-engineering.tumblr.com/post/10853187575/sha...</a>.<p>Previously discuss here, <a href="http://news.ycombinator.com/item?id=3058327" rel="nofollow">http://news.ycombinator.com/item?id=3058327</a>
There's nothing that requires most auto id keys to be monotonically sequential, which is the bottleneck when distributed. If uniqueness is all that's required, blocks of id's can be handed to different hosts and replenished on demand.<p>Gap-free, monotonically increasing id's should only be used for things that have regulatory compliance issues. For example, there shall be no gaps in invoice id's, and voided ones shall be marked.
Small project that may be of interest to those involved in this discussion: <a href="https://github.com/brandtg/alicia" rel="nofollow">https://github.com/brandtg/alicia</a><p>Kind of "eventually monotonic" distributed auto-increment keys. Your mileage may vary.
Could someone point out to me how filtering and selecting records over some parts of the primary key (ex {sensorID} over {deviceID}_{sensorID}_{timestamp}) is better than just having regular indexed fields (ex {sensorID}) containing those pieces of information?
Slightly off-topic:<p>With JS disabled, the page header is white text on a transparent background.<p>With JS enabled, I see no way to remove the header. I have little enough vertical screen real estate, adding an extra chunk of stuff on the top doesn't help. I ended up removing it entirely.