> Fortunately, we can fix the fact that tar.gz files are unindexed and unseekable, while still making the file a valid tar.gz file by taking advantage of the fact that two gzip streams can be concatenated and still be a valid gzip stream. So you can just make a tar file where each tar entry is its own gzip stream.<p>I'm surprised nobody came up with this idea till now. It's brilliantly simple.
Author here.<p>I just moved this to <a href="https://github.com/google/crfs" rel="nofollow">https://github.com/google/crfs</a> if people want to track that repo instead of Go's build system (which is relatively boring for most people).
This is very cool! I've been waiting to see someone enable tar.gz files to be seekable so they could be object bundles stored in remote blob storage systems that a client could mount and seek through on demand by byte range (so you could treat data in a similar fashion to containers, or like a Mac DMG file that had an open standard for remote mounting).
Conceivably this could be leveraged to allow docker for mac to only push deltas to the build virtual machine when running docker build, correct?<p>Currently docker build compressed everything in the working directory on every build. This is fine for building images for deploy/upload but is annoying for a local dev situation where you're frequently rebuilding.<p>Seems like it wouldn't be too hard to write an alternate docker build that checks a previously built "Stargz" and just sends the additional files? (There would be some complexity here reassembling a valid tar within hyperkit).<p>I might be missing something here, it might be misplacing the bottleneck during build, but every time I'm annoying by this problem it seems part of the issue is the single fat tar that needs to be created every time.<p>edit: this strategy could also work with docker-machine building on remote machines
In the introduction:<p>>Currently, however, starting a container in many environments requires doing a pull operation from a container registry to read the entire container image from the registry and write the entire container image to the local machine's disk. It's pretty silly (and wasteful) that a read operation becomes a write operation.<p>What's silly is to claim that this is the problem. Any read is going to be a write operation, at multiple levels, thanks to systems of transparent caching: To a nearby CDN, to local disk, to local memory, to your CPU cache, etc. These are optimizations, they aren't making your container startup any slower.<p>The real problem, which this tool indeed helps to solve, is that reading the entire image must complete before you're able to start further processes which read specific parts of the image. Not anything to do with "reads causing writes".
If the bottleneck of pulling was eliminated by this, it means the test runs didn't need to access most of the image, right? I wonder what this says about carrying unnecessary stuff or test coverage. Especially since the base distro layers were probably cached.<p>Edit: " For isolation and other reasons, we run all our containers in a single-use fresh VMs." So they had no caching for the base layers unless those were primed in the vm image?