UUID, serial or identity columns for PostgreSQL auto-generated primary keys?

204 点作者 lhenk将近 4 年前

23 条评论

3pt14159将近 4 年前

> Now, sometimes a table has a natural primary key, for example the social security number of a country’s citizens.You know, you think that, but it's never that simple. The field was added incorrectly and nobody noticed until the value is in countless tables that you now need to simultaneously update or the value is something that's supposed to be semi-secret, so now a low level support staff can't reference the row when dealing with a request. Or the table's requirements change and now you need to track two different kinds of data or data that is missing the field.Me, I always just have the table make its own ID. It is just simpler, even when you think it is overkill.

评论 #27351577 未加载

评论 #27349246 未加载

评论 #27349757 未加载

评论 #27350851 未加载

评论 #27351318 未加载

评论 #27349409 未加载

评论 #27350416 未加载

magicpointer将近 4 年前

About UUID as Primary Key and performance, the following article has some insights and benchmarks as well: <a href="https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/" rel="nofollow">https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...</a>Essentially, they observed sizeable performance improvements by using UUID generators that are tweaked to get more sequentia resultsl. It results in better indexes. The articles compares sequences, random UUIDs and 2 kinds of sequentialish UUID generators.

评论 #27347045 未加载

评论 #27347351 未加载

评论 #27347812 未加载

评论 #27347322 未加载

pritambarhate将近 4 年前

A little late to comment here. But for database IDs, I have found that Instagram's technique to generate IDs works very well: <a href="https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c" rel="nofollow">https://instagram-engineering.com/sharding-ids-at-instagram-...</a>They are not serially incrementing but still sortable. Thus prevent index fragmentation issues observed with UUIDS. Are 8 bytes in length. So index size is smaller compared to UUIDs. So you get all benefits of serial IDs but they are not easily guessable thus preventing sequential access attacks.

评论 #27351359 未加载

评论 #27351690 未加载

pmontra将近 4 年前

Meta: this company wrote an impressive number of articles about PostgreSQL since 2013. List at <a href="https://www.cybertec-postgresql.com/en/tag/postgresql/" rel="nofollow">https://www.cybertec-postgresql.com/en/tag/postgresql/</a>

评论 #27348652 未加载

评论 #27348099 未加载

conradfr将近 4 年前

UUIDs are great when you use the id "publicly" but using an incremental value would be too revealing for different reasons.So it's good to know that performances are not bad.

评论 #27347397 未加载

评论 #27347038 未加载

评论 #27347278 未加载

评论 #27348092 未加载

评论 #27347233 未加载

eric4smith将近 4 年前

Simple rules:Use integer primary keys internally for identifiers and relationships.Use English/Other Language permalinks for URL'sUse UUID's in places like API's one-time action links and "private" links that you only want to share with other people.Worked fine for me for many, many years.

评论 #27350757 未加载

simonw将近 4 年前

Something I really like about integer incrementing IDs is that you can run ad-hoc "select * from table order by id desc limit 10" queries to see the most recently inserted rows.I end up doing this a lot when I'm trying to figure out how my applications are currently being used.Strictly incrementing UUIDs can offer the same benefit.

barrkel将近 4 年前

Another point: if there's any temporal locality to your future access patterns - if you're more likely to access multiple rows which were inserted at roughly the same time - then allocating sequential identifiers brings those entries closer together in the primary key index.I used to work on a reconciliation system which inserted all its results into the database. Only the most recent results were heavily queried, with a long tail of occasional lookups into older results. We never had a problem with primary key indexes (though this was in MySQL, which uses a clustered index on the primary key for row storage, so it's an even bigger benefit); the MD5 column used for identifying repeating data, on the other hand, would blow out the cache on large customers' instances.

评论 #27410988 未加载

foresto将近 4 年前

I once pondered how I might generate IDs that were as compact as a machine word, without a value (or small set of values) revealing the size of the data set. One application might be user-visible customer numbers that don't easily reveal how many customers there are.I eventually came across the idea of using maximal period linear-feedback shift registers to transform an integer variable through every possible value (minus one), but in a non-incremental sequence that depends on the LFSR arrangement.I never ended up putting the idea to use, but I've always been curious about people who have and how it worked out for them. [Edit to clarify: It was meant for obfuscation, not security against a determined attacker.]

评论 #27350507 未加载

评论 #27351308 未加载

评论 #27350320 未加载

topspin将近 4 年前

I just started a little side project and chose to use UUID for Postgresql keys. The schema is highly generic and I anticipate the possibility of merging instances. UUID precludes collisions in such a case.

评论 #27347891 未加载

cratermoon将近 4 年前

Postgres (and other relational DBs) really need to implement something like snowflake[1] or ksuid[2]1 <a href="https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html" rel="nofollow">https://blog.twitter.com/engineering/en_us/a/2010/announcing...</a>2 <a href="https://segment.com/blog/a-brief-history-of-the-uuid/" rel="nofollow">https://segment.com/blog/a-brief-history-of-the-uuid/</a>

评论 #27350252 未加载

vbsteven将近 4 年前

I’m currently prototyping a little database+api+cli todo app and I want identifiers that can be abbreviated in the same way as partial git commit hashes can be used on the command line. What should I use?I was thinking of generating random character strings and simply retry when the db throws duplicate key error on insert. No sharding is necessary and I’d like to have efficient foreign keys. Any thoughts?

评论 #27347733 未加载

评论 #27347286 未加载

评论 #27348867 未加载

rsync将近 4 年前

I have no particular expertise with modern databases and it has been decades since I did any work as a DBA.However, I cannot imagine creating table entries without a datestamp. No matter what else you are doing, or what you index by, I would want YYYY-MM-DD_HH-MM-SS in every row.Maybe I'm just weird that way ...

评论 #27351327 未加载

评论 #27349886 未加载

BatteryMountain将近 4 年前

I feel the whole debate is overkill: 99% of businesses/systems will never have so much data that they NEED to use uuid's. I personally don't like using integers for keys either as I've been burnt by them before. I also doubt any software I build today or have built in the last 10 years will be used 100 years from now.Recently I built a new system (typical business-type backend) and forced to use sqlite + C# + dapper. Using this combination I cannot use guid/uuid as dapper cannot properly map it back to c# from sqlite, and my dislike of int's got me thinking. I have a random string generator (have used it for years for things like OTP's and other reference numbers), where I give it an alphabet + length of the desired string. Using 8 to 12 characters, I can get a few million unique permutations. That is, if used as a primary key, few million per database table. Then I hear in the back of my head, guys from work who would argue I would run out of unique combinations or would have to do lookups to see if they exist. So I decided slap the year and month on it as a prefix, so a key might look like this: 2105HSUAMWPA. This gets indexed really well too and there is some inherent information that can be seen from looking at the key: Year 21, Month 5 and then the unique bits.It's basically 4 lines of code that gets called on every new database entity. I think it will be easy to shard/partition the data too if the need arise in the future, by simply looking at the first 4 digits.Thus to summarize:Data is sliced by entity type (customer, invoice, etc), then by date (2105 for May 2021) then by unique string.What do you guys think about this approach? Anyone been burnt by something like this?

评论 #27351368 未加载

panny将近 4 年前

It seems like int vs bigint is brushed off rather quickly here. bigint is twice the size of int, therefore indexing will be larger as well. Furthermore, all the FK storage and indexing will also be bloated by this choice. If you design a customer table with a bigint PK, and everything will point to customer (invoices, billing statements, etc), then that's not an insignificant amount of space. While most of us may want to have "billions served" like McDonald's, the reality is my company and your company will never have 2 billion customer accounts, even in the wildest of imaginations. If you ever did reach this point, it's "a good problem to have" and relatively easy to move from int -> bigint. Moving in the reverse direction is likely difficult or impossible.It would be nice to see real benchmarking on millions of rows to compare the three, but my gut tells me you use int by default, bigint if you outgrow int, and UUID if you have plenty of money for hardware and need distribution capabilities a UUID would enable.

评论 #27350772 未加载

评论 #27350087 未加载

strangeattractr将近 4 年前

This is making me reconsider how I do IDs. I thought the performance of sequential IDs was significantly better. So my approach was to use a standard auto-increment primary ID and then obfuscate by id * p mod m where p and m are coprime and very large. then i get back the original ID using the mod inverse. Should I just be using UUID?

评论 #27350036 未加载

zzzeek将近 4 年前

> You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.in one sense I agree with the author that things are generally just easier when you use surrogate primary keys, however they really should note here that the FOREIGN KEY constraint itself is not a problem at all as you can just use ON UPDATE CASCADE.

评论 #27349867 未加载

foobarbazetc将近 4 年前

Always, always use a bigserial.(Actually, all serials are bigserial’s but the “base type” they add to the table differs, and it’ll always come back to bite you later. Ask me how I know…)

ainar-g将近 4 年前

I don't think I've ever seen this mentioned anywhere, but if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions? Are there any downsides besides making the application side a tiny bit more complex?

评论 #27346919 未加载

评论 #27348998 未加载

评论 #27347112 未加载

评论 #27346920 未加载

评论 #27346874 未加载

评论 #27346875 未加载

评论 #27347283 未加载

staticassertion将近 4 年前

Another benefit of using sequential integers is that you can leverage a number of optimizations.For one thing you can represent a range of data more efficiently by just storing offsets. This means that instead of having to store a 'start' and 'end' at 8 + 8 bytes you can store something like 'start' and 'offset', where offset could be based on your window size, like 2 bytes.You can leverage those offsets in metadata too. For example, I could cache something like 'rows (N..N+Offset) all have field X set to null' or some such thing. Now I can query my cache for a given value and avoid the db lookup, but I can also store way more data in the cache since I can encode ranges. Obviously which things you cache are going to be data dependent.Sequential ints make great external indexes for this reason. Maybe I tombstone rows in big chunks to some other data store - again, I can just encode that as a range, and then given a lookup within that range I know to look in the other datastore. With a uuid approach I'd have to tombstone each row individually.These aren't universal optimizations but if you can leverage them they can be significant.

评论 #27347438 未加载

rini17将近 4 年前

I'm a fan of generating primary key by copying natural key (if it's one integer) or hash of natural key. This is done only once when row is created and is never updated, even if natural key changes. In this case you are left with valuable bit of information that something happened to natural key.

rossmohax将近 4 年前

Another alternative is ULID, which can be stored as UUID on a Postgres side, but is more b-tree friendly.

评论 #27350026 未加载

hardwaresofton将近 4 年前

Yeah, just use a UUID unless the bits to store the UUID really are your driving limitation (they're not), having a UUID that is non-linear is almost always the most straight-forward option for identifying things, for the tradeoff of human readability (though you can get some of that back with prefixes and some other schemes). I'm not going to rehash the benefits that people have brought up for UUIDs, but they're in this thread. At this point what I'm concerned about is just... what is the best kind of UUID to use -- I've recently started using mostly v1 because time relationship is important to me (despite the unfortunate order issues) and v6[0] isn't quite so spread yet. Here's a list of other approaches out there worth looking at- isntauuid[1] (mentioned in this thread, I've given it a name here)- timeflake[2]- HiLo[3][4]- ulid[5]- ksuid[6] (made popular by segment.io)- v1-v6 UUIDs (the ones we all know and some love)- sequential interval based UUIDs in Postgres[7]Just add a UUID -- this almost surely isn't going to be what bricks your architecture unless you have some crazy high write use case like time series or IoT or something maybe.[0]: <a href="http://gh.peabody.io/uuidv6/" rel="nofollow">http://gh.peabody.io/uuidv6/</a>[1]: <a href="https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c" rel="nofollow">https://instagram-engineering.com/sharding-ids-at-instagram-...</a>[2]: <a href="https://github.com/anthonynsimon/timeflake" rel="nofollow">https://github.com/anthonynsimon/timeflake</a>[3]: <a href="https://en.wikipedia.org/wiki/Hi/Lo_algorithm" rel="nofollow">https://en.wikipedia.org/wiki/Hi/Lo_algorithm</a>[4]: <a href="https://www.npgsql.org/efcore/modeling/generated-properties.html#hilo-autoincrement-generation" rel="nofollow">https://www.npgsql.org/efcore/modeling/generated-properties....</a>[5]: <a href="https://github.com/edoceo/pg-ulid" rel="nofollow">https://github.com/edoceo/pg-ulid</a>[6]: <a href="https://github.com/segmentio/ksuid" rel="nofollow">https://github.com/segmentio/ksuid</a>[7]: <a href="https://www.2ndquadrant.com/en/blog/sequential-uuid-generators" rel="nofollow">https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...</a>