A common mistake that's not covered in this article is the need to perform your add & remove operations in the same RUN command. Doing them separately creates two separate layers which inflates the image size.<p>This creates two image layers - the first layer has all the added foo, including any intermediate artifacts. Then the second layer removes the intermediate artifacts, but that's saved as a diff against the previous layer:<p><pre><code> RUN ./install-foo
RUN ./cleanup-foo
</code></pre>
Instead, you need to do them in the same RUN command:<p><pre><code> RUN ./insall-foo && ./cleanup-foo
</code></pre>
This creates a single layer which has only the foo artifacts you need.<p>This why the official Dockerfile best practices show[1] the apt cache being cleaned up in the same RUN command:<p><pre><code> RUN apt-get update && apt-get install -y \
package-bar \
package-baz \
package-foo \
&& rm -rf /var/lib/apt/lists/*
</code></pre>
[1] <a href="https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run" rel="nofollow">https://docs.docker.com/develop/develop-images/dockerfile_be...</a>
There's some more to consider with the latest buildkit frontend for docker, check it out here: <a href="https://hub.docker.com/r/docker/dockerfile" rel="nofollow">https://hub.docker.com/r/docker/dockerfile</a><p>In particular cache mounts (RUN --mount-type=cache) can help the package manager cache size issue, and heredocs are a game-changer for inline scripts. Forget doing all that && nonsense, write clean multiline run commands:<p><pre><code> RUN <<EOF
apt-get update
apt-get install -y foo bar baz
etc...
EOF
</code></pre>
All of this works right now in plain old desktop docker you have installed right now, you just need to use the buildx command (buildkit engine) and reference the docker labs buildkit frontend image above. Unfortunately it's barely mentioned in docs or anywhere else other than their blog right now.
There are another base images from google that are smaller than the base images and come handy when deploying applications that runs on single binary.<p>> Distroless images are very small. The smallest distroless image, gcr.io/distroless/static-debian11, is around 2 MiB. That's about 50% of the size of alpine (~5 MiB), and less than 2% of the size of debian (124 MiB).<p><a href="https://github.com/GoogleContainerTools/distroless" rel="nofollow">https://github.com/GoogleContainerTools/distroless</a>
This app is great for discovering waste<p><a href="https://github.com/wagoodman/dive" rel="nofollow">https://github.com/wagoodman/dive</a><p>I've found 100MB fonts and other waste.<p>All the tips are good, but until you actually inspect your images, you won't know why they are so bloated.
If you really want to optimize image size, use Nix!<p>Ex: <a href="https://gist.github.com/sigma/9887c299da60955734f0fff6e2faeee0" rel="nofollow">https://gist.github.com/sigma/9887c299da60955734f0fff6e2faee...</a><p>Since it captures exact dependencies, it becomes easier to put just what you need in the image. Prior to nix, my team (many years ago) built a redis image that was about 15MB in size by tracking the used files ans removing unused files. Nix does that reliably.
For my two cents, if you're image requires anything not vanilla, you may be better off stomaching the larger Ubuntu image.<p>Lots of edge cases around specific libraries come up that you don't expect. I spent hours tearing my hair out trying to get Selenium and python working on an alpine image that worked out-of-the-box on the Ubuntu image.
A very common mistake I see (though not related to image size perse) when running Node apps is to do CMD ["npm", "run", "start"]. This is first memory wasteful, as npm is running as the parent process and forking node to run the main script. Also, the bigger problem is that the npm process does not send signals down to its child thus SIGINT and SIGTERM are not passed from npm into node which means your server may not be gracefully closing connections.
I also liked this one:<p><a href="https://fedoramagazine.org/build-smaller-containers/" rel="nofollow">https://fedoramagazine.org/build-smaller-containers/</a><p>I don't avoid large images because of their size, I avoid them because it's an indicator that I'm packaging much more than is necessary. If I package a lot more than is necessary then perhaps I do not understand my dependencies well enough or my container is doing too much.
> 1. Pick an appropriate base image<p>Starting with: Use the ones that are supposed to be small. Ubuntu does this by default, I think, but debian:stable-slim is 30 MB (down from the non-slim 52MB), node has slim and alpine tags, etc. If you want to do more intensive changes that's fine, but start with the nearly-zero-effort one first.<p>EDIT: Also, where is the author getting these numbers? They've got a chart that shows Debian at 124MB, but just clicking that link lands you at a page listing it at 52MB.
The article doesn't seem to do much... in the 'why'. I'm inundated with <i>how</i>, though.<p>I've been on both sides of this argument, and I really think it's a case-by-case thing.<p>A highly compliant environment? As minimal as possible. A hobbyist/developer that wants to debug? Go as big of an image as you want.<p>It shouldn't be an expensive operation to update your image base and deploy a new one, regardless of size.<p>Network/resource constraints (should) be becoming less of an issue. In a lot of cases, a local registry cache is all you need.<p>I worry partly about how much time is spent on this quest, or secondary effects.<p>Has the situation with name resolution been dealt with in musl?<p>For example, something like /etc/hosts overrides not taking proper precedence (or working at all). To be sure, that's not a great thing to use - but it <i>does</i>, and leads to a lot of head scratching
You might not need to care about image size at all if your image can be packaged as stargz.<p>stargz is a gamechanger for startup time.<p>kubernetes and podman support it, and docker support is likely coming. It lazy loads the filesystem on start-up, making network requests for things as needed and therefore can often start up large images very fast.<p>Take a look at the startup graph here:<p><a href="https://github.com/containerd/stargz-snapshotter" rel="nofollow">https://github.com/containerd/stargz-snapshotter</a>
I like this article, and there is a ton of nuance in the image and how you should choose the appropriate one. I also like how they cover only copying the files you actually need, particularly with things like vendor or node_modules, you might be better off just doing a volume mount instead of copying it over to the entire image.<p>The only thing they didn't seem to cover is consider your target. My general policy is dev images are almost always going to be whatever lets me do one of the following:<p>- Easily install the tool I need<p>- All things being equal, if multiple image base OS's satisfy the above, I go with alpine, cause its smallest<p>One thing I've noticed is simple purpose built images are faster, even when there are a lot of them (big docker-compose user myself for this reason) rather than stuffing a lot of services inside of a single container or even "fewer" containers<p>EDIT: spelling, nuisance -> nuance
I always feel helpless with python containers - it seems there isn’t much savings ever eeked out of multi-stage and other strategies that typically are suggested. Docker container size really has made compiled languages more attractive to me
Nobody mentioned <a href="https://github.com/docker-slim/docker-slim" rel="nofollow">https://github.com/docker-slim/docker-slim</a> yet.<p>So here it is.
There is some strange allure for spending time crafting Dockerfiles. IMO it's over glorified - for most situations the juice is not worth the squeeze.<p>As a process for getting stuff done, a standard buildpack will get you a better result than a manual Dockerfile for all but the most extreme end of advanced users. Even for those users, they are typically advanced in a single domain (e.g. image layering, but not security). While buildpacks are not available for all use cases, when available I can't see a reason to use a manual Dockerfile for prod packaging<p>For our team of 20+ people, we actively discourage Dockerfiles for production usage. There are just too many things to be an expert on; packers get us a pretty decent (not perfect) result. Once we add the packer to the build toolchain it becomes a single command to get an image that has most security considerations factored in, layer and cache optimization done far better than a human, etc. No need for 20+ people to be trained to be a packaging expert, no need to hire additional build engineers that become a global bottleneck, etc. I also love that our ops team could, if they needed, write their own buildpack to participate in the packaging process and we could slot it in without a huge amount of pain
Somewhat tangentially related to the topic of this post: does anyone know any good tech for keeping an image "warm". For instance, I like to spin up separate containers for my tests vs development so they can be "one single process" focused, but it is not always practical (due to system resources on my local dev machine) to just keep my test runner in "watch" mode, so I spin it down and have to spin it back up, and there's always some delay - even when cached. Is there a way to keep this "hot" but not run a process as a result? I generally try to do watch mode for tests, but with webdev I got alot of file watchers running, and this can cause a lot of overhead with my containers (on macOS for what its worth)<p>Is there anything one can do to help this issue?
One way to simply optimize Docker image size is to use <a href="https://github.com/GoogleContainerTools/distroless" rel="nofollow">https://github.com/GoogleContainerTools/distroless</a><p>Supports Go, Python, Java, out of the box.
For Java, JIB on distroless works pretty well. It's small, fast and secure.<p>- <a href="https://github.com/GoogleContainerTools/jib" rel="nofollow">https://github.com/GoogleContainerTools/jib</a><p>- <a href="https://github.com/GoogleContainerTools/distroless" rel="nofollow">https://github.com/GoogleContainerTools/distroless</a>
The analyzer product this post is content marketing for looks interesting, but I would want to run it locally rather than connect my image repo to it.<p>Am I being paranoid? Is it reasonable to connect my images to a random third party service like this?
When I want to run a containerized service I just look for the dockerhub image or github repo that requires the least effort to get running. In these cases is it very common to write dockerfiles and try to optimize them?
I've heard that using alpine over a base image like debian makes it harder for current vulnerability scanners to find problems. Is this still true?