Google stores billions of lines of code in a single repository (2016) [pdf]

173 pointsby jeremylevyover 2 years ago

22 comments

There's a lot of love for monorepos nowadays, but after more than a decade of writing software, I still strongly believe it is an antipattern.1. The single version dependencies are asinine. We are migrating to a monorepo at work, and someone bumped the version of an open source JS package that introduced a regression. The next deploy took our service down. Monorepos mean loss of isolation of dependencies between services, which is absolutely necessary for the stability of mission-critical business services.2. It encourages poor API contracts because it lets anyone import any code in any service arbitrarily. Shared functionality should be exposed as a standalone library with a clear, well-defined interface boundary. There are entire packaging ecosystems like npmjs and pypi for exactly this purpose.3. It encourages a ton of code churn with very low signal. I see at least one PR every week to code owned by my team that changes some trivial configuration, library call, or build directive, simply because some shared config or code changed in another part of the repo and now the entire repo needs to be migrated in lockstep for things to compile.I've read this paper, as well as watched the talk on this topic, and am absolutely stunned that these problems are not magnified by 100x at Google scale. Perhaps it's simply organizational inertia that prevents them from trying a more reasonable solution.

评论 #34768141 未加载

评论 #34768220 未加载

评论 #34769598 未加载

评论 #34768097 未加载

评论 #34768677 未加载

评论 #34769433 未加载

评论 #34769813 未加载

评论 #34768090 未加载

评论 #34769808 未加载

评论 #34768591 未加载

评论 #34770843 未加载

评论 #34770802 未加载

评论 #34771261 未加载

评论 #34771314 未加载

评论 #34771730 未加载

评论 #34768569 未加载

评论 #34772202 未加载

评论 #34771197 未加载

评论 #34770635 未加载

评论 #34770199 未加载

yazaddaruvalaover 2 years ago

Having worked at Google and Amazon.Honestly their systems are almost identical. Amazon just creates a monotonically increasing watermark outside the “repo”. Google uses “the repo” to create the monotonically increasing watermark.Otherwise, Google calls it “merge into g3” Amazon calls it “merge into live”.Amazon has the extra vocabulary of VersionSets/Packages/Build files. Google has all the same concepts, but just calls them Dependencies/Folders/Build files.Amazon’s workflows are “git-like”, Google is migrating to “git-like” workflows (but has a lot of unnecessary vocabulary around getting there - Piper/Fig/Workspace/etc).I really can’t tell if the specific difference between “mono-repo” or “multi-repo” makes much practical difference to the devs working on either system.

评论 #34770258 未加载

评论 #34778037 未加载

评论 #34769610 未加载

评论 #34769317 未加载

zdwover 2 years ago

Monorepos are great... but only if you can invest in the tooling scale to handle them, and most companies can't invest in that like Google can. Hyrum Wright class tooling experts don't grow on trees.A good article to reference when this topic gets raised: <a href="http://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-you-ask-if-youre-good-enough-for-a-monorepo.html" rel="nofollow">http://yosefk.com/blog/dont-ask-if-a-monorepo-is-good-for-yo...</a>

评论 #34767826 未加载

评论 #34768079 未加载

评论 #34767727 未加载

评论 #34767837 未加载

评论 #34768564 未加载

dangover 2 years ago

Related:Why Google Stores Billions of Lines of Code in a Single Repository (2016) - <a href="https://news.ycombinator.com/item?id=22019827" rel="nofollow">https://news.ycombinator.com/item?id=22019827</a> - Jan 2020 (121 comments)Why Google Stores Billions of Lines of Code in a Single Repository (2016) - <a href="https://news.ycombinator.com/item?id=17605371" rel="nofollow">https://news.ycombinator.com/item?id=17605371</a> - July 2018 (281 comments)Why Google stores billions of lines of code in a single repository (2016) - <a href="https://news.ycombinator.com/item?id=15889148" rel="nofollow">https://news.ycombinator.com/item?id=15889148</a> - Dec 2017 (298 comments)Why Google Stores Billions of Lines of Code in a Single Repository - <a href="https://news.ycombinator.com/item?id=11991479" rel="nofollow">https://news.ycombinator.com/item?id=11991479</a> - June 2016 (218 comments)

marcrosoftover 2 years ago

I love monorepos. I feel like they are even more helpful for small teams and smaller scale. The productivity of being able to add libraries by creating a new folder or refactor across services is unbeatable.

sn_masterover 2 years ago

Because Google does something, doesn't mean it's a good thing to do for anyone else. This kind of infrastructure is very expensive to maintain, and suffers from many flaws like -almost- everyone being stuck using SDKs that are several versions behind the latest production one even for the internal GCP ones.

chrisaover 2 years ago

Here's a talk version given by Rachel (one of the authors) about the same topic: <a href="https://www.youtube.com/watch?v=W71BTkUbdqE">https://www.youtube.com/watch?v=W71BTkUbdqE</a>

评论 #34767771 未加载

Karellenover 2 years ago

> The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence.Wait, that's an average of nearly 30 new files per commit. Not 30 files changed per commit, but whatever changes are happening to existing files, plus 30 brand new files. For every single commit.Although...> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, [...]I'm not quite sure what this is saying.Is it saying that if `main` contains 1,000 files, and then someone creates a branch called `release`, then the repo now contains 2,000 files? And if someone then deletes 500 files from `main` in the next commit, the repo still contains 2,000 files, not 1,500?If that's the case, why not just call every different version of every file in the repo a different file? If I have a new repo and in the first commit I create a single 100-line file called `foo.c`, and then I change one line of `foo.c` for the second commit, do I now have a repo with two files?I mean, if you look at the plumbing for e.g. `git`, yes, the repo is storing two file objects for the repo history. But I don't think I've ever seen someone discuss the Linux git repo and talk about the total number of file objects in the repo object store. And when the linked paper itself mentions Linux, it says "The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files" - and in that case it's definitely not talking about the total number of file objects in the store.I don't think it's entirely clear what the paper even means when it talk about "a file" in a source code repository, or if it even means the same thing consistently. I'm not sure it's using the most obvious interpretation, but I can't understand why it would pick a non-obvious interpretation. Especially if it's not going to explain what it means, let alone explain why it chose one meaning over another.

评论 #34768904 未加载

GreedClarifiesover 2 years ago

This is from the golden age of Google.Of particular note is that they published this many years after it had been shipped to their internal customers. This was not some position paper about "why we focus on ai" after not shipping any of their "breakthroughs".

thwoeriuowieover 2 years ago

Google's code may be a monorepo, but back when I was there you only ever 'checked' out particular projects for editing etc. It's a bit silly to talk about some aspects of Google separated from the whole dev env in there.

评论 #34769480 未加载

rvcdbnover 2 years ago

I really wish they would make this tech available via gcloud. Seems like it would be very popular and a great way to attract other gcloud business away from MS/GitHub which scales horribly.

评论 #34767595 未加载

评论 #34767850 未加载

gardenhedgeover 2 years ago

I've never experienced a monorepo like Googles. How does it work? Are Chrome and Gmail in the same repo? I assume they're built separately and pushing code to one doesn't affect the other.

评论 #34768481 未加载

gorgoilerover 2 years ago

Imagine you have two teams in one monorepo and requirements.txt has pinned numpy at 1.22. One team wants to upgrade to 1.24 but the upgrade breaks the other team’s code as it was dependent on an emergent property* in the older version of numpy.How would you handle this situation as an IC? As a manager of one of the teams? As a skip-level manager of both teams?As a budding IC on the team that wants the upgrade, you may want to go fix up the other team’s code for them so you can bring them along with the upgrade. Realistically, the further you get from Google’s level of engineering discipline and skill the more likely you are to encounter the following in the needs-1.22 codebase:- horrible code that is hard to understand and therefore hard to refactor- code with no tests, making it risky to refactor- the team that wrote it have all left or been fired and no one is available to help understand it- they are a remote team with no social relationship to you who interact entirely online, in writing, in the style of an aggressive subreddit mod- deeply entrenched factions mean that even if you offer them a patch they will default refuse it because who are you to work on their codebase and they don’t need the upgraded numpy so why should they waste resources on reviewing something they don’t want- misguided adherence to status enhancing terms like “audit” and “compliance” mean jobsworth ICs refuse to even look at your patch because someone somewhere once heard a friend of a friend whose company failed SOC2 because engineer from floor X made a change to code owned by floor Y and it went against policyAll of these social problems are real ones I have encountered and if you have solved these then you’re probably already happily in a monorepo already. If instead you work in an org full of teams pointing guns at each other in a fight to the death to stop any kind of cross org collaboration from sullying the purity of the tribal system then know this: it gets better, and if you build the right social connections then the technical efficiency of having your monobusiness executing its monomission inside a monorepo is within reach!*bug

评论 #34771041 未加载

评论 #34771740 未加载

denvercoder904over 2 years ago

Is the code for the Search project in the mono repo as well? How does Google handle access control for their mono repos? Where's the secret sauce stored?

评论 #34770877 未加载

dgnemoover 2 years ago

Big fan of monorepo approach here.Still, I have recently hit a major issue with the fact that GIT (and other common version control sw) don't have per-directory ACL.Has anyone dealt with this issue? Which VCS / configuration have you adopted?

randyrandover 2 years ago

iOS and Windows are “monorepos” too.The software is built daily, and everyone must be on the same version of every library.Under the hood there are a bunch of repos, and there are exceptions, but largely operates as a monorepo.

评论 #34767816 未加载

teleforceover 2 years ago

Previous discussions on HN (2020):<a href="https://news.ycombinator.com/item?id=22019827" rel="nofollow">https://news.ycombinator.com/item?id=22019827</a>

Scubabear68over 2 years ago

I’d really love to know what the breakdown of those 2 billion lines of code is by product. What a huge number.

KolmogorovCompover 2 years ago

> Google’s codebase is shared by more [...] than 25,000 Google software develop- ers from dozens of offices in countries around the world.> Access to the whole codebase encourages extensive code sharing and reuse [...]Doesn't this strategy result in a great risk of massive code leaks from rogue employees? Even if read access are logged and the culprit found, it's too late once it's been published.

评论 #34767841 未加载

评论 #34767843 未加载

评论 #34768075 未加载

myhfover 2 years ago

(published July 2016)

评论 #34767349 未加载

评论 #34767682 未加载

quantum_stateover 2 years ago

something is seriously wrong if Google needs 2B loc to do its things …

评论 #34770255 未加载

deanCommieover 2 years ago

No wonder noone at Google can't ship everything if they constantly have to stop development of their feature so they can do mandatory upgrades of their dependencies...

评论 #34767914 未加载