Ah yes, I too have accidentally committed node_modules.<p>Jokes aside, and coming from a place of ignorance, it's interesting to me that a file count that size is still a real performance issue for git. I'd have expected something that's so ubiquitous and core to most of the software world hasn't seen improvements there.<p>Genuine, non snarky question:
Are there some fundamental aspects of git that would make it either very difficult to improve that, or that would sacrifice some important benefits if they were made? Or is this a case of it being a large effort and no one has particularly cared enough yet to take it on?
Something I learned about writing robust code is that scalability needs to be tested up-front. Test with 0, 1, and <i>many</i> where the latter is tens of millions, not just ten.<p>I've seen production databases that had 40,000 tables for <i>valid</i> reasons.<p>I've personally deployed an app that needed 80,000 security groups in a single LDAP domain, just for it. I can't remember what the total group number of groups across everything was, but it was a decent chunk of a million.<p>Making something like Git, or a file system, or a package manager? Test what happens with millions of objects! Try <i>billions</i> and see where your app breaks. Fix the issues even if you never think anyone will trigger them.<p>It's not about scaling to some arbitrary number, it's about <i>scaling</i>, period.
How does a "monorepo" differ from, say, using a master project containing many git submodules[1], perhaps recursively? You would probably need a bit of tooling. But the gain is that git commands in the submodules are speedy, and there is only O(logN) commit multiplication to commit the updated commit SHAs up the chain. Think Merkle tree, not single head commit SHA.<p>Eventually, you may get a monstrosity like Android Repo [2] though. And an Android checkout and build is pushing 1TB these days.<p>But there, perhaps, the submodule idea wins again. Replace most of the submodules with prebuilt variants, and have full source + building only for the module of interest.<p>[1] <a href="https://git-scm.com/book/en/v2/Git-Tools-Submodules" rel="nofollow noreferrer">https://git-scm.com/book/en/v2/Git-Tools-Submodules</a><p>[2] <a href="https://source.android.com/docs/setup/download#repo" rel="nofollow noreferrer">https://source.android.com/docs/setup/download#repo</a>
There is VFS for git from Microsoft, that can solve problem more elegant way, I think: <a href="https://github.com/microsoft/scalar">https://github.com/microsoft/scalar</a>
Bold move to enable the "ours" merge strategy by default! I presume this is a typo for the "-Xours" merge <i>option</i> to `ort` or `recursive`, but that still seems pretty brave.
Hmm, I've read this one:
"These .xlf files are generated and contain translated strings for each locale."<p>So why to store them under VCS at first place? I think they are doing it wrong.
Since 70% of the files were xlf files used for translation/localization, couldn't they instead just store all of those in a single SQLite file and solve their problem much more easily? Any of the nuances of the directory structure could be captured in SQLite tables and relationships, and it would be easy to access them for edits by non-coders using a tool like DB Browser.<p>I feel like often people make problems much harder than they need to be by imposing arbitrary constraints on themselves that could be avoided if they approached the problem differently.
Our monorepo is at ~500 megs right now. This is 7 years worth of changes. No signs of distress anywhere, other than a periodic git gc operation that now takes long enough to barely notice.<p>I can't imagine using anything else for my current project. In fact, the only domain within which I would even consider something different would be game development. Even then, only if the total asset set is ever expected to exceed a gigabyte or so. Git is awful with large blobs. LFS is an option, but I've always felt like it was a bandaid and not a fundamental solve.
I'd like to add that the previous discussion was here, <a href="https://news.ycombinator.com/item?id=31762245">https://news.ycombinator.com/item?id=31762245</a>.<p>However, since then we've migrated our engineering blog from medium to a self-hosted stack, so HN doesn't link it to the previous discussion automatically.
Anyone know, what's the advantage of this over a big composite repo with several git submdolues?<p>I think that submodules are better suited for separation of concerns and performance, even while achieving the same composite structure as an equivalent monorepo?
this is one of those multipurpose PR articles (not all bad) to generate awareness of the company, their product, use case, and developers.<p>><i>At Canva, we made the conscious decision to adopt the monorepo pattern with its benefits and drawbacks. Since the first commit in 2012, the repository has rapidly grown alongside the product in both size and traffic</i><p>while reading it i was having trouble keeping track of where I was in the recursion, it's sort of "Xzibit A" for "yo dawg, we know you use source repositories, so check out our source repository (we keep it in our source repository) while you check out your source repository!"
Don't bother with watchman, it has consistently been so flaky that I simply live with the normal latency.<p>Thankfully, nowadays git has one built in for some OSes, and it's much, MUCH better than watchman ever was.
> A git fetch trace that was captured<p>Anyone know observability software are they using to visualize the GIT_TRACE details? (Or is the assumption that the UI is Olly as well?)
We learned they were 70% autogenerated so probably shouldn't have been in git at all, but our build process relied on that, and didnt want to fix it, so we bodged it.
I am not sure what I'm looking at here. Surely those half million files are for dozens if not hundreds of different apps, libraries and tools and surely those do not all depend on each other, no?<p>Because if so, why not just use one repo per app/library/tool? Sure, if you have a cluster of things that all depend on each other, or a cluster of things that typically is needed in bulk, by all means, put those in a single repo.<p>But putting literally <i>all</i> your code in a single repo is not a very sane technical choice, is it?