Git partial clone lets you fetch only the large file you need

229 pointsby moyerabout 5 years ago

14 comments

beagle3about 5 years ago

There is one note piece to the puzzle to make git perfect for every use case I can think of: store large files as a list of blobs broken down by some rolling hash a-la rsync/borg/bup.That would e.g. make it reasonable to check in virtual machine images or iso images into a repository. Extra storage (and by extension, network bandwidth) would be proportional to change size.git has delta compression for text as an optimization but it’s not used on big binary files and is not even online (only on making a pack). This would provide it online for large files.Junio posted a patch that did that ages ago, but it was pushed back until after the sha1->sha256 extension.

评论 #22578700 未加载

derefrabout 5 years ago

Has anyone used Git submodules to isolate large binary assets into their own repos? Seems like the obvious solution to me. You already get fine-grained control over which submodules you initialize. And, unlike Git LFS, it might be something you’re already using for other reasons.

评论 #22577549 未加载

评论 #22576305 未加载

评论 #22577245 未加载

评论 #22576578 未加载

评论 #22578126 未加载

评论 #22577734 未加载

vvandersabout 5 years ago

Also known as workspace views in P4.It's interesting to see the wheel reinvented. We used to run a 500gb art sync/200gb code sync with ~2tb back end repo back when I was in gamedev. P4 also has proper locking, it is really the is right tool if you've got large assets that need to be coordinated and versioned.Only downside of course is that it isn't free.

评论 #22578886 未加载

评论 #22576961 未加载

评论 #22578205 未加载

评论 #22577256 未加载

评论 #22577887 未加载

scarecrow112about 5 years ago

This is interesting and could be a savior for Machine Learning(ML) engineering teams. In a typical ML workflow, there are three main entities to be managed: 1. Code 2. Data 3. Models Systems like Data Version Control(DVC) [1], are useful for versioning 2 & 3. DVC improves on usability by residing inside the project's main git repo while maintaining versions of the data/models that reside on a remote. With Git partial clone, it seems like the gap could still be reduced between 1 & 2/3.[1] - <a href="https://dvc.org/" rel="nofollow">https://dvc.org/</a>

itrootabout 5 years ago

Also --reference (or --shared) is a good parameter to speed-up cloning (for build, for example), if you have your repository cached in some other place. I was using it a long time ago when I was working on system that required to clone 20-40 repos to build. This approach decreased clone timings by an order of magnitude.

评论 #22576461 未加载

评论 #22579094 未加载

microtherionabout 5 years ago

That seems quite useful, though Git LFS mostly does the job.One of my biggest remaining pain points is resumable clone/fetch. I find it near impossible to clone large repos (or fetch if there were lots of new commits) over a slow, unstable link, so almost always I end up cloning a copy to a machine closer to the repo, and rsyncing it over to my machine.

评论 #22581470 未加载

shaklee3about 5 years ago

This is great. We use get lfs extensively, and one of the biggest complaints we have is users have to clone 7GB of data just to get the source files. There's a work around in that you don't have to enter your username and password from the lfs repo, and let it timeout, but that's a kluge.

评论 #22576380 未加载

danboltabout 5 years ago

In the AAA games industry git has been a bit slower on the uptake (although that’s changing quickly) as large warehouses of data are often required (eg: version history of video files, 3D audio, music, etc.). It’s nice to see git have more options for this sort of thing.

评论 #22578061 未加载

评论 #22577239 未加载

jniedrauerabout 5 years ago

This could actually be a really good solution to the maximum supported size of a Go module. If you place a go.mod in the root of your repo, then every file in the repo becomes part of the module. There's also a hardcoded maximum size for a module: 500M. Problem is, I've got 1G+ of vendored assets in one of my repos. I had to trick Go into thinking that the vendored assets were a different Go module[0]. Go would have to add support for this, but it would be a pretty elegant solution to the problem.[0]: <a href="https://github.com/golang/go/issues/37724" rel="nofollow">https://github.com/golang/go/issues/37724</a>

评论 #22577711 未加载

krupanabout 5 years ago

I started a project recently and for the first time ever I've wanted to keep large files in my repo. I looked into git LFS and was disappointed to learn that it requires either third party hosting or setting up a git LFS server myself. I looked into git annex and it seems decent. This, once it is ready for prime time, will hopefully be even better

nikiviabout 5 years ago

Is it possible given a git repo (hosted on say GitHub) to only 'clone' (download) certain files from it? Without `.git`

评论 #22576659 未加载

评论 #22576970 未加载

评论 #22579761 未加载

vicosityabout 5 years ago

I'm still unconvinced. Will this provide a user friendly approach to managing design assets.

评论 #22577624 未加载

pilibertoabout 5 years ago

> One reason projects with large binary files don't use Git is because, when a Git repository is cloned, Git will download every version of every file in the repository.Wrong? There's a --depth option for the git fetch command which allows the user to specify how many commits they want to fetch from the repository

评论 #22576360 未加载

评论 #22576147 未加载

评论 #22576130 未加载

smitty1eabout 5 years ago

In AWS, it's worth considering putting those large files in an S3 bucket.