We put half a million files in one Git repository (2022)

134 pointsby kisamotoover 1 year ago

20 comments

Alacartover 1 year ago

Ah yes, I too have accidentally committed node_modules.Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.Genuine, non snarky question: Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?

评论 #37295459 未加载

评论 #37299727 未加载

评论 #37296068 未加载

评论 #37297022 未加载

评论 #37296091 未加载

评论 #37297939 未加载

jiggawattsover 1 year ago

Something I learned about writing robust code is that scalability needs to be tested up-front. Test with 0, 1, and many where the latter is tens of millions, not just ten.I've seen production databases that had 40,000 tables for valid reasons.I've personally deployed an app that needed 80,000 security groups in a single LDAP domain, just for it. I can't remember what the total group number of groups across everything was, but it was a decent chunk of a million.Making something like Git, or a file system, or a package manager? Test what happens with millions of objects! Try billions and see where your app breaks. Fix the issues even if you never think anyone will trigger them.It's not about scaling to some arbitrary number, it's about scaling, period.

评论 #37292633 未加载

评论 #37296347 未加载

评论 #37293042 未加载

评论 #37292631 未加载

评论 #37293116 未加载

评论 #37296800 未加载

评论 #37293366 未加载

评论 #37292602 未加载

评论 #37295373 未加载

评论 #37313271 未加载

评论 #37292811 未加载

avidiaxover 1 year ago

How does a "monorepo" differ from, say, using a master project containing many git submodules[1], perhaps recursively? You would probably need a bit of tooling. But the gain is that git commands in the submodules are speedy, and there is only O(logN) commit multiplication to commit the updated commit SHAs up the chain. Think Merkle tree, not single head commit SHA.Eventually, you may get a monstrosity like Android Repo [2] though. And an Android checkout and build is pushing 1TB these days.But there, perhaps, the submodule idea wins again. Replace most of the submodules with prebuilt variants, and have full source + building only for the module of interest.[1] <a href="https://git-scm.com/book/en/v2/Git-Tools-Submodules" rel="nofollow noreferrer">https://git-scm.com/book/en/v2/Git-Tools-Submodules</a>[2] <a href="https://source.android.com/docs/setup/download#repo" rel="nofollow noreferrer">https://source.android.com/docs/setup/download#repo</a>

评论 #37296198 未加载

评论 #37292856 未加载

评论 #37292819 未加载

评论 #37292567 未加载

评论 #37293008 未加载

评论 #37295881 未加载

评论 #37293720 未加载

hjgracaover 1 year ago

Or as they call it, a simple "Hello World" Javascript project

psydvlover 1 year ago

There is VFS for git from Microsoft, that can solve problem more elegant way, I think: <a href="https://github.com/microsoft/scalar">https://github.com/microsoft/scalar</a>

评论 #37295807 未加载

Smaug123over 1 year ago

Bold move to enable the "ours" merge strategy by default! I presume this is a typo for the "-Xours" merge option to `ort` or `recursive`, but that still seems pretty brave.

评论 #37292385 未加载

评论 #37292397 未加载

Borg3over 1 year ago

Hmm, I've read this one: "These .xlf files are generated and contain translated strings for each locale."So why to store them under VCS at first place? I think they are doing it wrong.

评论 #37318037 未加载

评论 #37314953 未加载

eigenvalueover 1 year ago

Since 70% of the files were xlf files used for translation/localization, couldn't they instead just store all of those in a single SQLite file and solve their problem much more easily? Any of the nuances of the directory structure could be captured in SQLite tables and relationships, and it would be easy to access them for edits by non-coders using a tool like DB Browser.I feel like often people make problems much harder than they need to be by imposing arbitrary constraints on themselves that could be avoided if they approached the problem differently.

评论 #37318052 未加载

评论 #37297510 未加载

bob1029over 1 year ago

Our monorepo is at ~500 megs right now. This is 7 years worth of changes. No signs of distress anywhere, other than a periodic git gc operation that now takes long enough to barely notice.I can't imagine using anything else for my current project. In fact, the only domain within which I would even consider something different would be game development. Even then, only if the total asset set is ever expected to exceed a gigabyte or so. Git is awful with large blobs. LFS is an option, but I've always felt like it was a bandaid and not a fundamental solve.

SerCeover 1 year ago

I'd like to add that the previous discussion was here, <a href="https://news.ycombinator.com/item?id=31762245">https://news.ycombinator.com/item?id=31762245</a>.However, since then we've migrated our engineering blog from medium to a self-hosted stack, so HN doesn't link it to the previous discussion automatically.

ufjfjjfjfjover 1 year ago

I can't be the only thinking this is a small amount of files unless you keep them all in the same directory

steffresover 1 year ago

Anyone know, what's the advantage of this over a big composite repo with several git submdolues?I think that submodules are better suited for separation of concerns and performance, even while achieving the same composite structure as an equivalent monorepo?

评论 #37293750 未加载

评论 #37292538 未加载

fsckboyover 1 year ago

this is one of those multipurpose PR articles (not all bad) to generate awareness of the company, their product, use case, and developers.>At Canva, we made the conscious decision to adopt the monorepo pattern with its benefits and drawbacks. Since the first commit in 2012, the repository has rapidly grown alongside the product in both size and trafficwhile reading it i was having trouble keeping track of where I was in the recursion, it's sort of "Xzibit A" for "yo dawg, we know you use source repositories, so check out our source repository (we keep it in our source repository) while you check out your source repository!"

Groxxover 1 year ago

Don't bother with watchman, it has consistently been so flaky that I simply live with the normal latency.Thankfully, nowadays git has one built in for some OSes, and it's much, MUCH better than watchman ever was.

评论 #37318057 未加载

paulirishover 1 year ago

> A git fetch trace that was capturedAnyone know observability software are they using to visualize the GIT_TRACE details? (Or is the assumption that the UI is Olly as well?)

评论 #37315609 未加载

baz00over 1 year ago

Probably learned how enterprise software developers suffer.

time4teaover 1 year ago

We learned they were 70% autogenerated so probably shouldn't have been in git at all, but our build process relied on that, and didnt want to fix it, so we bodged it.

评论 #37292481 未加载

评论 #37318068 未加载

评论 #37296368 未加载

评论 #37297566 未加载

评论 #37296940 未加载

mrAssHatover 1 year ago

The site is not opening. Thanks, CloudFlare.

issaframover 1 year ago

Since when was a "monorepo" ever considered a good idea?

评论 #37295528 未加载

评论 #37297574 未加载

评论 #37297321 未加载

rsp1984over 1 year ago

I am not sure what I'm looking at here. Surely those half million files are for dozens if not hundreds of different apps, libraries and tools and surely those do not all depend on each other, no?Because if so, why not just use one repo per app/library/tool? Sure, if you have a cluster of things that all depend on each other, or a cluster of things that typically is needed in bulk, by all means, put those in a single repo.But putting literally all your code in a single repo is not a very sane technical choice, is it?

评论 #37296720 未加载

评论 #37297194 未加载