Honestly that's a poor blog post.<p>Randomly concludes "the best time-based ID seems to be xid" without saying why or comparing to others e.g. ksuid, UUIDv7 etc ("xid" is only mentioned twice in the entire blog, first in the above statement and second a link to the reference implementation). Equally unfortunate that they picked "xid" as their supposed "best" because Postgres has an internal identifier that is also called "xid" and is very much <i>NOT</i> to be used as a primary key !<p>Downplays the many issues with "serial", including somehow thinking the word might "might" has a place next to the words "not want to expose them to the world though" .... you <i>DON'T</i>, full stop. Exposing predictable identifiers to the world is never a good thing.<p>I'm not really sure what that blog post is supposed to be achieving really. I didn't learn anything.
While this is a good overview of the options for primary key generation, there's no silver bullet here. Most projects that are using SQL should just use the gold standard: an auto-incrementing integer for an internal primary key. And then decouple the public-facing primary key from it into a separate column, whether it be ULID, UUID, or a random-project-slug-123.<p>Also, during debugging, it's a lot nicer to look at short primary keys than to have UUIDs flooding the screen.
There is an umentioned security aspect that you should be aware of for adopting timestamp-based ids: you are leaking information about time, and this information could be sensitive.<p>This is how I would summarize a security perspective.<p><pre><code> * autoincrement id: leaks information about the system as a whole. Users can attack each other. Might be suitable for an internal-only application or an application that doesn't care about leaking this information and goes to great effort to be resilient to users attacking each other.
* timestamp + random id: leaks information about the time the individual record was created. An attacker can attempt to learn sensitive information about an individual. Suitable for a record that is already publicly shared with its time (e.g. a tweet). Might be suitable otherwise if ids are not public. That is only the record creator can view the id and you don't send out links with the ids to the user (particularly over insecure channels such as email).
* random id: does not leak information. suitable for any use case that is okay with the performance implications (of a non-sortable fragemented index).
</code></pre>
I am wary of how they call xid the best time-based id. It just removes all (run-time) randomness and thus performs the best. xid seems to be the same as MongoDB's oid. It is designed to be a conflict-free timestamp that can be used in a distributed system, and it is good at that. But in terms of protecting users for some use cases it could be worse than an auto-increment id because cross-user attacks are still possible (they will take many, many more attempts though) and it leaks information about time.
I've had good success with using auto-incrementing BIGINTs as internal IDs and creating an additional BYTEA field as external IDs. Foreign keys would be based on the internal IDs, anything user-facing would use external IDs. I think it's a good compromise as it keeps foreign key size small and still allows hiding internal structure from users.
Disclaimer: not a dba so my terms might not be appropriate<p>I’ve seen uuid4 which replaces the first 4 bytes with a timestamp. It was mentioned to me that this strategy allows postgres to write at the end of the index instead of arbitrarily on disk. I also presume it means it has some decent sorting.<p>[inspiration](<a href="https://github.com/tvondra/sequential-uuids/blob/master/sequential_uuids.c">https://github.com/tvondra/sequential-uuids/blob/master/sequ...</a>)
The author misses one advantage of UUIDs: if you’re working in high-throughput distributed systems, serial IDs create a bottleneck and single point of failure in the service handing out IDs.<p>With UUIDs any service can generate an ID itself and tell downstream services about it in parallel—even if one of them is down, slow, or needs retrying.
What does "SORT terribly" mean? That there is no semantically useful ordering? Well of course not, that's not what they are designed for. If you want ordering by time, then include a time-based column and sort on it. Does it mean that sorting performance is bad on UUID columns? Why?<p>And what does "index terribly" mean? You can index UUID columns just fine, so is it a performance concern? What is the concern?
Not a fantastic post, which is a shame as Supabase is quite an interesting company.<p>The answer is almost always "use biginteger identity", and almost never "use integer serial".<p>UUIDs have a place but are often better suited in larger, distributed and more complex data stores than postgres.<p>Using `xid` is such a poor choice I'm surprised it was even mentioned.<p>The "key" thing to remember is you don't <i>have</i> to expose your primary key to the world. Use UUIDs or shortcodes or whatever for external representations. Use bigints internally. This will prevent a world of pain.
For times when you need distributed generation, I’ve worked with a system that I liked. Server kept track of a sequence. Clients pull out batches of 1000 or so and then use them up. When the client starts to get low on available numbers it fetches another batch.<p>The ids generated are nice readable integers. Generally in sorted order, though not a guarantee, and you end up with gaps sometimes if a client doesn’t give out all its numbers before it’s restarted.<p>Would anyone be interested in a super robust version of that as a service?
Good intro article. I'd always heard that serial ints aren't guaranteed to be ordered but never knew why (because they are generated non-transactionally..so if an INSERT transaction rolls back the id that would have been used is effectively consumed/skipped).<p>What I see a lot in practice is a bigint numeric id for internal use (better for joins, FKs) and also a textual token for public use, perhaps with a typed prefix indicate the type of record it's identifying (U-AS234FDS for User, etc)
My opinion. Always if in any way possible pick a semantic key. There is usually something defining the thing you are working on. If there isnt work on your normalisation.<p>Main benefits to this:
Avoids accidental duplication (happens so much). Avoids additional round trips to fetch the id to make a mutation.<p>Of course if you work on something where you don’t know what it is yet (actually humans are a good example for that) uuid or int might make sense but I hear so many times picking non semantic as a default.
Pretty uninformative post. Goes from count(*) to some uuids, but fails to see the bigger picture.<p>Depending on the data, but assuming most data isn't big data:<p>- Use integers internally, maybe suffixed by a shard-id to prevent collision, but keep order.<p>- Use (random) external ids to access from the outside.<p>Note that certain uuids will still leak some information: time between records, number of machines, etc.
One major aspect of primary keys not mentioned in the article is the performance of queries. I did some benchmark recently, where serial (32 bits) was significantly faster than the native Postgres UUID type. The worst scheme you can use is a string; if you encode something like ksuid as a plain string rather than an efficiently packed byte sequence, query performance becomes significantly worse. I didn't benchmark sorting, but I assume it's similarly impacted.
Funny this keeps coming up!<p>I wrote about my experience using ulids the other day, specifically with Postgres and some of the dis/advantages you get with it.<p>It's a deeper dive into ulids than this article is, and shows some real world issues that crop up:<p><a href="https://blog.lawrencejones.dev/ulid/" rel="nofollow">https://blog.lawrencejones.dev/ulid/</a><p>That said, and spoiler alert: I'd probably go with bigint-sequence backed text IDs if I were choosing this over again.
There's some skepticism in the comments around the recommendation for xid. I'm curious if anyone here is using it in production at scale, and can comment on the practical realities.<p>I saw xid make the rounds about a year ago, and the promise of a pseudo-sortable 12-byte identifier that is "configuration free" struck me as a bit far-fetched.<p>In particular, I wondered if the xid scheme gives you enough entropy to be confident you wouldn't run into collisions. UUIDv4 doesn't eat a full 16 bytes of entropy for nothing. For example, if you look at the machine ID component of xid, it does some version of random assignment (either pulling the first three bytes from /etc/machine-id, or from a hash of the hostname). 3 bytes is 16777216 values, i.e., with 600 hosts you have a 1% chance of running into a collision. Probably too close for comfort?<p>There are settings where you can build some defense-in-depth against ID collisions, like a uniqueness constraint in your DB (effectively a centralized ticketing system). But there are many settings where that kind of thing wouldn't be practical. Off the top of my head, I'm thinking of monitoring-type applications like request or trace IDs.
One more note: uuid is not easily copy-pastable, due to dash `-` in it. I prefer to use it's raw bytes and encode with base32, which is copy-pastable.
Hybrid Logical Clock [1] could be of interest for readers.
This is a monotonically increasing clock based on a physical clock.<p>Combined with a machine identifier you can obtain globally unique identifiers that are totally ordered.<p>[1] <a href="https://cse.buffalo.edu/~demirbas/publications/hlc.pdf" rel="nofollow">https://cse.buffalo.edu/~demirbas/publications/hlc.pdf</a>
One note regarding uuid.
It doesn't need to imply it is random. That's specific v4.<p>V5 are predictable uuids.
That combine a ns uuid and a string, via sha1 based one way mapping resulting in a uuid.
Given the advantages of sortable UUIDs and this post's conclusion that XID is the best option right now, is Supabase planning to add pg_idkit to their list of supported extensions?
Size also matters. If you are running cheaper instances on cloud, they have limited IO and can be short on memory.<p>Smart data types for keys, enums instead of varchars etc helps to keep indexes small.
I’ve been using HiLo for so long this isn’t something I think about.<p>I don’t see a point in trying to hide the ID. Either it’s public or it’s private and should be verified before being accessed.
I only use uuid as PK in Postgres. So many benefits, it is easy to migrate data between environments, client can generate its own ids etc.<p>I don’t know about performance but I think in most cases that is not a big concern anyway.