Choosing a Postgres primary key

220 pointsby awaliasover 2 years ago

24 comments

Honestly that's a poor blog post.Randomly concludes "the best time-based ID seems to be xid" without saying why or comparing to others e.g. ksuid, UUIDv7 etc ("xid" is only mentioned twice in the entire blog, first in the above statement and second a link to the reference implementation). Equally unfortunate that they picked "xid" as their supposed "best" because Postgres has an internal identifier that is also called "xid" and is very much NOT to be used as a primary key !Downplays the many issues with "serial", including somehow thinking the word might "might" has a place next to the words "not want to expose them to the world though" .... you DON'T, full stop. Exposing predictable identifiers to the world is never a good thing.I'm not really sure what that blog post is supposed to be achieving really. I didn't learn anything.

评论 #34453773 未加载

评论 #34452074 未加载

评论 #34453589 未加载

评论 #34453119 未加载

评论 #34452243 未加载

评论 #34453060 未加载

评论 #34452898 未加载

ammmirover 2 years ago

While this is a good overview of the options for primary key generation, there's no silver bullet here. Most projects that are using SQL should just use the gold standard: an auto-incrementing integer for an internal primary key. And then decouple the public-facing primary key from it into a separate column, whether it be ULID, UUID, or a random-project-slug-123.Also, during debugging, it's a lot nicer to look at short primary keys than to have UUIDs flooding the screen.

评论 #34453002 未加载

评论 #34453105 未加载

评论 #34457702 未加载

评论 #34459101 未加载

评论 #34453897 未加载

gregwebsover 2 years ago

There is an umentioned security aspect that you should be aware of for adopting timestamp-based ids: you are leaking information about time, and this information could be sensitive.This is how I would summarize a security perspective.<pre><code> * autoincrement id: leaks information about the system as a whole. Users can attack each other. Might be suitable for an internal-only application or an application that doesn't care about leaking this information and goes to great effort to be resilient to users attacking each other. * timestamp + random id: leaks information about the time the individual record was created. An attacker can attempt to learn sensitive information about an individual. Suitable for a record that is already publicly shared with its time (e.g. a tweet). Might be suitable otherwise if ids are not public. That is only the record creator can view the id and you don't send out links with the ids to the user (particularly over insecure channels such as email). * random id: does not leak information. suitable for any use case that is okay with the performance implications (of a non-sortable fragemented index). </code></pre> I am wary of how they call xid the best time-based id. It just removes all (run-time) randomness and thus performs the best. xid seems to be the same as MongoDB's oid. It is designed to be a conflict-free timestamp that can be used in a distributed system, and it is good at that. But in terms of protecting users for some use cases it could be worse than an auto-increment id because cross-user attacks are still possible (they will take many, many more attempts though) and it leaks information about time.

评论 #34454115 未加载

ThePhysicistover 2 years ago

I've had good success with using auto-incrementing BIGINTs as internal IDs and creating an additional BYTEA field as external IDs. Foreign keys would be based on the internal IDs, anything user-facing would use external IDs. I think it's a good compromise as it keeps foreign key size small and still allows hiding internal structure from users.

评论 #34451919 未加载

评论 #34452323 未加载

评论 #34457873 未加载

nargellaover 2 years ago

Disclaimer: not a dba so my terms might not be appropriateI’ve seen uuid4 which replaces the first 4 bytes with a timestamp. It was mentioned to me that this strategy allows postgres to write at the end of the index instead of arbitrarily on disk. I also presume it means it has some decent sorting.[inspiration](<a href="https://github.com/tvondra/sequential-uuids/blob/master/sequential_uuids.c">https://github.com/tvondra/sequential-uuids/blob/master/sequ...</a>)

评论 #34453362 未加载

评论 #34453465 未加载

评论 #34453871 未加载

glacialsover 2 years ago

The author misses one advantage of UUIDs: if you’re working in high-throughput distributed systems, serial IDs create a bottleneck and single point of failure in the service handing out IDs.With UUIDs any service can generate an ID itself and tell downstream services about it in parallel—even if one of them is down, slow, or needs retrying.

评论 #34455502 未加载

评论 #34457478 未加载

评论 #34464821 未加载

geophileover 2 years ago

What does "SORT terribly" mean? That there is no semantically useful ordering? Well of course not, that's not what they are designed for. If you want ordering by time, then include a time-based column and sort on it. Does it mean that sorting performance is bad on UUID columns? Why?And what does "index terribly" mean? You can index UUID columns just fine, so is it a performance concern? What is the concern?

评论 #34462771 未加载

orfover 2 years ago

Not a fantastic post, which is a shame as Supabase is quite an interesting company.The answer is almost always "use biginteger identity", and almost never "use integer serial".UUIDs have a place but are often better suited in larger, distributed and more complex data stores than postgres.Using `xid` is such a poor choice I'm surprised it was even mentioned.The "key" thing to remember is you don't have to expose your primary key to the world. Use UUIDs or shortcodes or whatever for external representations. Use bigints internally. This will prevent a world of pain.

jdwyahover 2 years ago

For times when you need distributed generation, I’ve worked with a system that I liked. Server kept track of a sequence. Clients pull out batches of 1000 or so and then use them up. When the client starts to get low on available numbers it fetches another batch.The ids generated are nice readable integers. Generally in sorted order, though not a guarantee, and you end up with gaps sometimes if a client doesn’t give out all its numbers before it’s restarted.Would anyone be interested in a super robust version of that as a service?

rco8786over 2 years ago

Good intro article. I'd always heard that serial ints aren't guaranteed to be ordered but never knew why (because they are generated non-transactionally..so if an INSERT transaction rolls back the id that would have been used is effectively consumed/skipped).What I see a lot in practice is a bigint numeric id for internal use (better for joins, FKs) and also a textual token for public use, perhaps with a typed prefix indicate the type of record it's identifying (U-AS234FDS for User, etc)

评论 #34459193 未加载

lysecretover 2 years ago

My opinion. Always if in any way possible pick a semantic key. There is usually something defining the thing you are working on. If there isnt work on your normalisation.Main benefits to this: Avoids accidental duplication (happens so much). Avoids additional round trips to fetch the id to make a mutation.Of course if you work on something where you don’t know what it is yet (actually humans are a good example for that) uuid or int might make sense but I hear so many times picking non semantic as a default.

评论 #34452085 未加载

评论 #34452749 未加载

评论 #34452136 未加载

评论 #34453945 未加载

评论 #34457393 未加载

评论 #34451954 未加载

评论 #34453405 未加载

评论 #34454087 未加载

jbverschoorover 2 years ago

Pretty uninformative post. Goes from count(*) to some uuids, but fails to see the bigger picture.Depending on the data, but assuming most data isn't big data:- Use integers internally, maybe suffixed by a shard-id to prevent collision, but keep order.- Use (random) external ids to access from the outside.Note that certain uuids will still leak some information: time between records, number of machines, etc.

atombenderover 2 years ago

One major aspect of primary keys not mentioned in the article is the performance of queries. I did some benchmark recently, where serial (32 bits) was significantly faster than the native Postgres UUID type. The worst scheme you can use is a string; if you encode something like ksuid as a plain string rather than an efficiently packed byte sequence, query performance becomes significantly worse. I didn't benchmark sorting, but I assume it's similarly impacted.

lawrjoneover 2 years ago

Funny this keeps coming up!I wrote about my experience using ulids the other day, specifically with Postgres and some of the dis/advantages you get with it.It's a deeper dive into ulids than this article is, and shows some real world issues that crop up:<a href="https://blog.lawrencejones.dev/ulid/" rel="nofollow">https://blog.lawrencejones.dev/ulid/</a>That said, and spoiler alert: I'd probably go with bigint-sequence backed text IDs if I were choosing this over again.

jeffomaticover 2 years ago

There's some skepticism in the comments around the recommendation for xid. I'm curious if anyone here is using it in production at scale, and can comment on the practical realities.I saw xid make the rounds about a year ago, and the promise of a pseudo-sortable 12-byte identifier that is "configuration free" struck me as a bit far-fetched.In particular, I wondered if the xid scheme gives you enough entropy to be confident you wouldn't run into collisions. UUIDv4 doesn't eat a full 16 bytes of entropy for nothing. For example, if you look at the machine ID component of xid, it does some version of random assignment (either pulling the first three bytes from /etc/machine-id, or from a hash of the hostname). 3 bytes is 16777216 values, i.e., with 600 hosts you have a 1% chance of running into a collision. Probably too close for comfort?There are settings where you can build some defense-in-depth against ID collisions, like a uniqueness constraint in your DB (effectively a centralized ticketing system). But there are many settings where that kind of thing wouldn't be practical. Off the top of my head, I'm thinking of monitoring-type applications like request or trace IDs.

deepsunover 2 years ago

One more note: uuid is not easily copy-pastable, due to dash `-` in it. I prefer to use it's raw bytes and encode with base32, which is copy-pastable.

评论 #34465876 未加载

conaclosover 2 years ago

Hybrid Logical Clock [1] could be of interest for readers. This is a monotonically increasing clock based on a physical clock.Combined with a machine identifier you can obtain globally unique identifiers that are totally ordered.[1] <a href="https://cse.buffalo.edu/~demirbas/publications/hlc.pdf" rel="nofollow">https://cse.buffalo.edu/~demirbas/publications/hlc.pdf</a>

say_it_as_it_isover 2 years ago

Probably worth revising this post to include BIGINT and BIGSERIAL, and then changing the entire Supabase schema to use them as well.

je42over 2 years ago

One note regarding uuid. It doesn't need to imply it is random. That's specific v4.V5 are predictable uuids. That combine a ns uuid and a string, via sha1 based one way mapping resulting in a uuid.

quartzover 2 years ago

Given the advantages of sortable UUIDs and this post's conclusion that XID is the best option right now, is Supabase planning to add pg_idkit to their list of supported extensions?

评论 #34455624 未加载

jpalomakiover 2 years ago

Size also matters. If you are running cheaper instances on cloud, they have limited IO and can be short on memory.Smart data types for keys, enums instead of varchars etc helps to keep indexes small.

philliphaydonover 2 years ago

I’ve been using HiLo for so long this isn’t something I think about.I don’t see a point in trying to hide the ID. Either it’s public or it’s private and should be verified before being accessed.

评论 #34468600 未加载

deepsunover 2 years ago

As for DB ids I prefer to have some integer internal DB ID, only for foreign keys, and additionally something like UUID for client-side.

waspightover 2 years ago

I only use uuid as PK in Postgres. So many benefits, it is easy to migrate data between environments, client can generate its own ids etc.I don’t know about performance but I think in most cases that is not a big concern anyway.

24 comments

traceroute66over 2 years ago

评论 #34453773 未加载

评论 #34452074 未加载

评论 #34453589 未加载

评论 #34453119 未加载

评论 #34452243 未加载

评论 #34453060 未加载

评论 #34452898 未加载

ammmirover 2 years ago

评论 #34453002 未加载

评论 #34453105 未加载

评论 #34457702 未加载

评论 #34459101 未加载

评论 #34453897 未加载

gregwebsover 2 years ago

评论 #34454115 未加载

ThePhysicistover 2 years ago

评论 #34451919 未加载

评论 #34452323 未加载

评论 #34457873 未加载

nargellaover 2 years ago

评论 #34453362 未加载

评论 #34453465 未加载

评论 #34453871 未加载

glacialsover 2 years ago

评论 #34455502 未加载

评论 #34457478 未加载

评论 #34464821 未加载

geophileover 2 years ago

评论 #34462771 未加载

orfover 2 years ago

jdwyahover 2 years ago

rco8786over 2 years ago

评论 #34459193 未加载

lysecretover 2 years ago

评论 #34452085 未加载

评论 #34452749 未加载

评论 #34452136 未加载

评论 #34453945 未加载

评论 #34457393 未加载

评论 #34451954 未加载

评论 #34453405 未加载

评论 #34454087 未加载

jbverschoorover 2 years ago

atombenderover 2 years ago

lawrjoneover 2 years ago

jeffomaticover 2 years ago

deepsunover 2 years ago

One more note: uuid is not easily copy-pastable, due to dash `-` in it. I prefer to use it's raw bytes and encode with base32, which is copy-pastable.

评论 #34465876 未加载

conaclosover 2 years ago

say_it_as_it_isover 2 years ago

Probably worth revising this post to include BIGINT and BIGSERIAL, and then changing the entire Supabase schema to use them as well.

je42over 2 years ago

quartzover 2 years ago

Given the advantages of sortable UUIDs and this post's conclusion that XID is the best option right now, is Supabase planning to add pg_idkit to their list of supported extensions?

评论 #34455624 未加载

jpalomakiover 2 years ago

philliphaydonover 2 years ago

评论 #34468600 未加载

deepsunover 2 years ago

As for DB ids I prefer to have some integer internal DB ID, only for foreign keys, and additionally something like UUID for client-side.

waspightover 2 years ago