January 19, 2020

Efficiently caching Docker builds in CI (Dockerhash)

This post provides a new way to speed up your Docker-based CI builds with a new caching technique called “Dockerhash”.

Introduction
Implementation
Example
Benchmarks
Usage in production
Authors

Introduction

Docker caches each layer as an image is built, and each layer will only be re-built if it or the layer above it has changed since the last build. Because of this, building non trivial Dockerfiles without cache can take quite some time. In case you have to do so very frequently, as in case of automated CI builds, it can eat up a considerable amount of time in the daily workflow of a developer. Especially when working with larger images (for example in a monolithic system architecture), this can be an incredibly costly factor in software development.

Several different approaches in order to solve this problem exists and can lead to a better performance:

Using a persistent Docker cache on each k8s node: As fast as local cache, but many cache misses due to autoscaled CI runners. Practically unusable in our case.
Using the registry for Docker layer caching: Reduces build times a bit, but is still too slow for a single image.
Using Kaniko for caching: About the same speed as Gitlab’s layer caching approach (too slow).

As of today (04.01.2020) there are also some more promising tools being built already, such as Docker’s buildx and Google’s CRFS. But they are still under development and are not ready for production use. When Dockerhash was started to be implemented in October 2018, none of those repositories existed yet, so a new solution to this problem had to be found.

Implementation

Docker itself is calculating a hash for every layer it creates and can thereby determine if it has to re-build that layer on the next docker build or not and thereby accelerate following builds. But if there is no Docker cache present locally, which may be the case in a Docker-in-Docker (DinD) environment or when it has been cleared/expired due to an automated/scheduled process, the build will always take the same (large) amount of time.

However, it is possible to calculate a hash for every step in a given Dockerfile without having to build or download it beforehand. This can be achieved by applying an hash algorithm on all input parameters of the Dockerfile. In the following we simply refer to the resulting hash of the operation as Dockerhash.

Implementing such a Dockerhash for your CI pipelines may not be easy and you might encounter several obstacles with growing complexity, but in the end, the procedure is somewhat simple and can be described as follows:

Let $D_i$ stand for all Dockerfile inputs, defined as quadruple $D_i := (D, A, O, C)$ with Dockerfile $D$ , ARGs $A$ consisting of all arguments, and all COPY/ADD statements $C$ (including those derived from parent’s ONBUILD statements*). Then Dockerhash $D_h$ is the result of $H(D_i) = D_h$ with a hashing function $H$ . Now, if $D_h$ does not matches with a previous build, then build the image and store $D_i$ in a persistent storage (=> Cache miss), so that it can be checked for later builds.

The following list may give an example on which steps would be neccessary in pratice:

Acquire aforementioned input params (includes parsing of Dockerfile)
Replace every $ARG with it’s corresponding value
For every FROM, check if the parent image contains ONBUILD statements* and append them to the currently viewed Dockerfile
Replace all commit hashes in FROMs if a Dockerhash already exists for that image-commit-combination, so we don’t need to re-build images using a commit hash as a tag for their parent’s image
Calculate a hash for each file listed in ADD or COPY (thats what we need the context for)
Calulate a hash for the whole Dockerfile itself
Make a list of all hashes and sort it, then hash the list itself. (The outcome of this operation is the Dockerhash)
(Optional) If the Dockerhash does not already exist: Build the image and store the Dockerhash in a persistent storage (=> Cache miss)
(Optional) Link the current commit hash to the Dockerhash in the storage

* Quick HOTWO on getting ONBUILD statements for a docker image (authentication may vary for private registries):

Courtesy of Ciro S. Costa

#!/bin/bash

image="library/sentry"
tag="9.1.2-onbuild"

token=$(curl -s \
    "https://auth.docker.io/token?scope=repository:$image:pull&service=registry.docker.io" \
    | jq -r '.token')

digest=$(curl -s \
    -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
    -H "Authorization: Bearer $token" \
    "https://registry-1.docker.io/v2/$image/manifests/$tag" \
    | jq -r '.config.digest')

curl -s -L \
    -H "Authorization: Bearer $token" \
    "https://registry-1.docker.io/v2/$image/blobs/$digest" \
    | jq -r '.container_config.OnBuild'

Example

The following Dockerfile shall give an example of what happens when Dockerhash is applied to it:

ARG TEST
FROM sentry:9.1.2-onbuild
COPY test.txt /
RUN echo "$TEST"

After step 0. to 4. are done, the copy of the Dockerfile might look like this:

ARG TEST
FROM sentry:9.1.2-onbuild
COPY test.txt /
RUN echo "test-arg" # replaced ARG variable
COPY . /usr/src/sentry # derived from parent image

Then, the hashes of both all input files and the Dockerfile itself are put into a (sorted) list:

hashes = [
    '3797bf0afbbfca4a7bbba7602a2b552746876517a7f9b7ce2db0ae7b', # "test"
    '49d53081deb3afbd9cf2ecc170309c58c019a899933bfa86444b8dc6', # Dockerfile content
    'f4f6779e153c391bbd29c95e72b0708e39d9166c7cea51d1f10ef58a' # "foo"
]

Which itself is hashed again, resulting in a Dockerhash:

1	dockerhash = 'fd981e198c1d176fbba679194a2dcacb2c2731f3d8bf9c8d7c6e8cd0'

Benchmarks

As scribed before, the Docker image of a monolithic application might take very long to build. In the table below a comparison between build-times using different caching mechanisms for the following exemplary image can be found:

1
2
3

FROM alpine:3.11.2
RUN dd if=/dev/urandom of=/test.bin bs=64M count=32 iflag=fullblock # emulate image size
RUN sleep 60 # emulate compiling/dependency fetching/etc.

Caching Mechanism	First build	Following builds (Ø)	Speedup
None	3 min 35 sec	3 min 35 sec	-
Pulling layers from Registry before build	3 min 35 sec	54 sec	~4x
Dockerhash	3 min 35 sec	12 sec	~18x

Another important note has to be made after looking at these results. What they cannot show is that build time complexity is $O(n)$ with n image size for the other methods and $O(1)$ in case of Dockerhash, so that when using it, one does not have to worry about those image attributes.

Usage in production

Because you do not want to fall back to no having no cache at all, you shouldn’t only rely on Dockerhash for caching, but use a combination of caching mechanisms in your pipelines. This might be look like so:

1	Dockerhash > Pulling Layers > Application build cache > Cache miss

The application build cache really is just an image where the package manager’s and compiler’s caches are stored, so that dependencies might not have to be downloaded from the package registry again and the builds can utilize previous compiled intermediate build caches in case of compiled languages.

Authors

Dockerhash and this post were created cooperatively by Fabian Beuke and myself.