> It is generally a good practice to not expose your primary keys to the external world. This is especially important when you use sequential auto-incrementing identifiers with type integer or bigint since they are guessable.<p>What value would there be in preventing guessing? How would that even be possible if requests have to be authenticated in the first place?<p>I see this "best practice" advocated often, but to me it reeks of security theater. If an attacker is able to do anything useful with a guessed ID without being authenticated and authorized to do so, then something else has gone horribly, horribly, <i>horribly</i> wrong and <i>that</i> should be the focus of one's energy instead of adding needless complexity to the schema.<p>The only case I know of where this might be valuable is from a business intelligence standpoint, i.e. you don't want competitors to know how many customers you have. My sympathy for such concerns is quite honestly pretty low, and I highly doubt GitLab cares much about that.<p>In GitLab's case, I'm reasonably sure the decision to use id + iid is less driven by "we don't want people guessing internal IDs" and more driven by query performance needs.
> <i>For example, Github had 128 million public repositories in 2020. Even with 20 issues per repository it will cross the serial range. Also changing the type of the table is expensive.</i><p>I expect the majority of those public repositories are forks of other repositories, and those forks only exist so someone could create pull requests against the main repository. As such, they won't ever have any issues, unless someone makes a mistake.<p>Beyond that, there are probably a lot of small, toy projects that have no issues at all, or at most a few. Quickly-abandoned projects will suffer the same fate.<p>I suspect that even though there are certainly some projects with hundreds and thousands of issues, the average across all 128M of those repos is likely pretty small, probably keeping things well under the 2B limit.<p>Having said that, I agree that using a 4-byte type (well, 31-bit, really) for that table is a ticking time bomb for some orgs, github.com included.
The point about the storage size of UUID columns is unconvincing. 128 bits vs. 64 bits doesn't matter much when the table has five other columns.<p>A much more salient concern for me is performance. UUIDv4 is widely supported but is completely random, which is not ideal for index performance. UUIDv7[0] is closer to Snowflake[1] and has some temporal locality but is less widely implemented.<p>There's an orthogonal approach which is using bigserial and encrypting the keys: <a href="https://github.com/abevoelker/gfc64">https://github.com/abevoelker/gfc64</a><p>But this means 1) you can't rotate the secret and 2) if it's ever leaked everyone can now Fermi-estimate your table sizes.<p>Having separate public and internal IDs seems both tedious and sacrifices performance (if the public-facing ID is a UUIDv4).<p>I think UUIDv7 is the solution that checks the most boxes.<p>[0]: <a href="https://uuid7.com/" rel="nofollow">https://uuid7.com/</a><p>[1]: <a href="https://en.wikipedia.org/wiki/Snowflake_ID" rel="nofollow">https://en.wikipedia.org/wiki/Snowflake_ID</a>
Slight nit-pick, but I would pick up the author on the text vs varchar section.<p>The author effectively wastes many words trying to prove a non-existent performance difference and then concludes "there is not much performance difference between the two types".<p>This horse bolted a long time ago. Its not "not much", its "none".<p>The Postgres Wiki[1] explicitly tells you to use text unless you have a very good reason not to. And indeed the docs themselves[2] tell us that "For many purposes, character varying acts as though it were a domain over text" and further down in the docs in the green Tip box, "There is no performance difference among these three types".<p>Therefore Gitlab's use of (mostly) text would indicate that they have RTFM and that they have designed their schema for their choice of database (Postgres) instead of attempting to implement some stupid "portable" schema.<p>[1] <a href="https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_varchar.28n.29_by_default" rel="nofollow">https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use...</a>
[2] <a href="https://www.postgresql.org/docs/current/datatype-character.html" rel="nofollow">https://www.postgresql.org/docs/current/datatype-character.h...</a>
Foreign keys are expensive is an oft repeated rarely benched claim. There are tons of ways to do it incorrectly. But in your stack you are always enforcing integrity _somewhere_ anyway. Leveraging the database instead of reimplementing it requires knowledge and experimentation, and it more often than not it will save your bacon.
Has anyone written about or noticed the performance differences between Gitlab and GitHub?<p>They're both Rails-based applications but I find page load times on Gitlab in general to be horrific compared to GitHub.
I always wondered what the purpose of that extra “I” was in the CI variables `CI_PIPELINE_IID` and `CI_MERGE_REQUEST_IID` were for. Always assumed it was a database related choice, but this article confirms it.
> 1 quintillion is equal to 1000000000 billions<p>it is pretty wild that we generally choose between int32 and int64. we really ought to have a 5 byte integer type which would support cardinalities of ~1T
It's reasonable to not have auto increment id's, but it's not clear to me if there is benefits to have 2 IDs, one internal and one external. This increases the number of columns / indexes, makes you always do a lookup first, and I can't see a security scenario where I would change the internal key without changing the external key. Am I missing something?
> As I discussed in an earlier post[3] when you use Postgres native UUID v4 type instead of bigserial table size grows by 25% and insert rate drops to 25% of bigserial. This is a big difference.<p>Does anyone know why UUIDv4 is so much worse than bigserial? UUIDs are just 128 bit numbers. Are they super expensive to generate or something? Whats going on here?
We shouldn’t assume that this schema was designed all at once, but rather is the product of evolution. For example, maybe the external_id was added after the initial release in order to support the creation of unique ids in the application layer.
Is it just me that thinks in general schema design and development is stuck in the stone ages?<p>I mainly know dotnet stuff, which does have migrations in EF (I note the point about gitlab not using this kind of thing because of database compatibility). It can point out common data loss while doing them.<p>However, it still is always quite scary doing migrations, especially bigger ones refactoring something. Throw into this jsonb columns and I feel it is really easy to screw things up and suffer bad data loss.<p>For example, renaming a column (at least in EF) will result in a column drop and column create on the autogenerated migrations. Why can't I give the compiler/migration tool more context on this easily?<p>Also the point about external IDs and internal IDs - why can't the database/ORM do this more automatically?<p>I feel there really hasn't been much progress on this since migration tooling came around 10+ years ago. I know ORMs are leaky abstractions, but I feel everyone reinvents this stuff themselves and every project does these common things a different way.<p>Are there any tools people use for this?