My notes on Gitlab's Postgres schema design (2022)

488 pointsby daigoba66over 1 year ago

15 comments

> It is generally a good practice to not expose your primary keys to the external world. This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.What value would there be in preventing guessing? How would that even be possible if requests have to be authenticated in the first place?I see this "best practice" advocated often, but to me it reeks of security theater. If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, horribly wrong and that should be the focus of one's energy instead of adding needless complexity to the schema.The only case I know of where this might be valuable is from a business intelligence standpoint, i.e. you don't want competitors to know how many customers you have. My sympathy for such concerns is quite honestly pretty low, and I highly doubt GitLab cares much about that.In GitLab's case, I'm reasonably sure the decision to use id + iid is less driven by "we don't want people guessing internal IDs" and more driven by query performance needs.

评论 #39417350 未加载

评论 #39417009 未加载

评论 #39422798 未加载

评论 #39417625 未加载

评论 #39417201 未加载

评论 #39417397 未加载

评论 #39418004 未加载

评论 #39417043 未加载

评论 #39423996 未加载

评论 #39428270 未加载

评论 #39419816 未加载

评论 #39426748 未加载

评论 #39435517 未加载

评论 #39417321 未加载

评论 #39417553 未加载

kelnosover 1 year ago

> For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.

评论 #39414873 未加载

评论 #39418014 未加载

评论 #39415447 未加载

评论 #39414594 未加载

评论 #39417492 未加载

评论 #39416867 未加载

评论 #39414547 未加载

zetalyraeover 1 year ago

The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.A much more salient concern for me is performance. UUIDv4 is widely supported but is completely random, which is not ideal for index performance. UUIDv7[0] is closer to Snowflake[1] and has some temporal locality but is less widely implemented.There's an orthogonal approach which is using bigserial and encrypting the keys: <a href="https://github.com/abevoelker/gfc64">https://github.com/abevoelker/gfc64</a>But this means 1) you can't rotate the secret and 2) if it's ever leaked everyone can now Fermi-estimate your table sizes.Having separate public and internal IDs seems both tedious and sacrifices performance (if the public-facing ID is a UUIDv4).I think UUIDv7 is the solution that checks the most boxes.[0]: <a href="https://uuid7.com/" rel="nofollow">https://uuid7.com/</a>[1]: <a href="https://en.wikipedia.org/wiki/Snowflake_ID" rel="nofollow">https://en.wikipedia.org/wiki/Snowflake_ID</a>

评论 #39415960 未加载

评论 #39425775 未加载

评论 #39415747 未加载

评论 #39417394 未加载

评论 #39420858 未加载

评论 #39416821 未加载

评论 #39415254 未加载

评论 #39415580 未加载

traceroute66over 1 year ago

Slight nit-pick, but I would pick up the author on the text vs varchar section.The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".This horse bolted a long time ago. Its not "not much", its "none".The Postgres Wiki[1] explicitly tells you to use text unless you have a very good reason not to. And indeed the docs themselves[2] tell us that "For many purposes, character varying acts as though it were a domain over text" and further down in the docs in the green Tip box, "There is no performance difference among these three types".Therefore Gitlab's use of (mostly) text would indicate that they have RTFM and that they have designed their schema for their choice of database (Postgres) instead of attempting to implement some stupid "portable" schema.[1] <a href="https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_varchar.28n.29_by_default" rel="nofollow">https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use...</a> [2] <a href="https://www.postgresql.org/docs/current/datatype-character.html" rel="nofollow">https://www.postgresql.org/docs/current/datatype-character.h...</a>

评论 #39419049 未加载

exabrialover 1 year ago

Foreign keys are expensive is an oft repeated rarely benched claim. There are tons of ways to do it incorrectly. But in your stack you are always enforcing integrity _somewhere_ anyway. Leveraging the database instead of reimplementing it requires knowledge and experimentation, and it more often than not it will save your bacon.

bluerooibosover 1 year ago

Has anyone written about or noticed the performance differences between Gitlab and GitHub?They're both Rails-based applications but I find page load times on Gitlab in general to be horrific compared to GitHub.

评论 #39416871 未加载

评论 #39422248 未加载

评论 #39416788 未加载

评论 #39415637 未加载

vinnymacover 1 year ago

I always wondered what the purpose of that extra “I” was in the CI variables `CI_PIPELINE_IID` and `CI_MERGE_REQUEST_IID` were for. Always assumed it was a database related choice, but this article confirms it.

gfodyover 1 year ago

> 1 quintillion is equal to 1000000000 billionsit is pretty wild that we generally choose between int32 and int64. we really ought to have a 5 byte integer type which would support cardinalities of ~1T

评论 #39415697 未加载

azlevover 1 year ago

It's reasonable to not have auto increment id's, but it's not clear to me if there is benefits to have 2 IDs, one internal and one external. This increases the number of columns / indexes, makes you always do a lookup first, and I can't see a security scenario where I would change the internal key without changing the external key. Am I missing something?

评论 #39417856 未加载

josephgover 1 year ago

> As I discussed in an earlier post[3] when you use Postgres native UUID v4 type instead of bigserial table size grows by 25% and insert rate drops to 25% of bigserial. This is a big difference.Does anyone know why UUIDv4 is so much worse than bigserial? UUIDs are just 128 bit numbers. Are they super expensive to generate or something? Whats going on here?

评论 #39414747 未加载

评论 #39420901 未加载

评论 #39414754 未加载

评论 #39416674 未加载

eezingover 1 year ago

We shouldn’t assume that this schema was designed all at once, but rather is the product of evolution. For example, maybe the external_id was added after the initial release in order to support the creation of unique ids in the application layer.

martinaldover 1 year ago

Is it just me that thinks in general schema design and development is stuck in the stone ages?I mainly know dotnet stuff, which does have migrations in EF (I note the point about gitlab not using this kind of thing because of database compatibility). It can point out common data loss while doing them.However, it still is always quite scary doing migrations, especially bigger ones refactoring something. Throw into this jsonb columns and I feel it is really easy to screw things up and suffer bad data loss.For example, renaming a column (at least in EF) will result in a column drop and column create on the autogenerated migrations. Why can't I give the compiler/migration tool more context on this easily?Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?I feel there really hasn't been much progress on this since migration tooling came around 10+ years ago. I know ORMs are leaky abstractions, but I feel everyone reinvents this stuff themselves and every project does these common things a different way.Are there any tools people use for this?

评论 #39414700 未加载

评论 #39415910 未加载

评论 #39414768 未加载

评论 #39414854 未加载

评论 #39414707 未加载

评论 #39416038 未加载

评论 #39416240 未加载

rob137over 1 year ago

I found this post very useful. I'm wondering where I could find others like it?

评论 #39417622 未加载

评论 #39439513 未加载

评论 #39419368 未加载

sidcoolover 1 year ago

Great read! And even better comments here.

firemeltover 1 year ago

so anyone use schema.rb in production? even dhh once campfire use .sql instead schema.rb