Superior caching with dagger
@kjuulh2023-08-02
Dagger is an up-and-coming ci/cd orchestration tool as code, this may sound abstract, but it is quite simple, read on to learn more.
Introduction
This post is about me finding a solution to a problem, I've faced for a while
with rust
caching for docker images. I was building a new tool I am working on
called cuddle-please
(a release manager inspired by
release-please).
I will start with a brief introduction to dagger, then the problem and how dagger solves it, in comparison to docker.
What is dagger
If you already know what dagger is, feel free to skip ahead. I will explain briefly what it is, and give a short example.
Dagger is a tool where you can define your pipelines as code, dagger doesn't desire to replace your tools, such as bash, clis, apis and whatnot, but it wants to allow you to orchestrate them to your hearts content. And at the same time bring proper engineering principles to it, such as testing, packaging, and ergonomics.
Dagger allows you to write your pipelines in one of the supported languages (of which are rapidly expanding).
The official languages are by the dagger team are:
- Go
- Python
- Typescript
Community based ones are:
- Rust (I am currently the author and maintainer of this one, but I don't work
for
dagger
) - Elixir
- Dotnet (in-progress)
- Java (In-progress)
- Ruby etc.
Dagger at its simplest is an api on top of docker
or rather buildkit
, but
brings with it so much more. You can kind of think of dagger
as a juiced up
Dockerfile
, but it brings more interactivity and programmability to it. It
even have elements of docker-compose
as well. I personally call it
Programmatic Orchestration
.
Anyways, a sample pipeline could be:
#[tokio::main]
async fn main() -> eyre::Result<()> {
let client = dagger::connect().await?;
let output = client.container()
.from("alpine")
.with_exec(vec!["echo", "hello-world"])
.stdout().await?;
println!("stdout: {output}");
}
Now simply build and run it.
cargo run
This will go ahead and download the image, and run the echo "hello-world"
command. Which in turn we can extract and print. This is a very basic example.
The equivalent Dockerfile
would look like this.
FROM alpine
RUN echo "hello-world"
The only prerequisite is a newer version of
docker
, but you can also installdagger
as well, for better ergonomics and output.
However, dagger as its namesake suggests runs on dags, this means that normally
when you would use multi-stage dockerfiles
FROM alpine as base
FROM base as builder
RUN ...
FROM base as production
COPY --from=builder /mnt/... .
This forms a dag when you run docker build .
, where.
base is run first because builder depends on it.
after is done, production will run because depends on builder
Dagger does the same things behind the scenes, but with a much more capable api.
In dagger, you can easily, share sockets, files, folders, containers, stdout,
etc. All of which can be done in a programming language, instead of a recipe
like declarative file like a Dockerfile
.
It should be noted that dagger transforms your code into a declarative manifest
behind the scenes, kind of like Pulumi
, though it is still interactive, think
SQL
, where each query is a declarative command/query.
Why orchestration matters.
Dagger is a paradigm shift, because you can now enable engineering on top of
your pipelines, normally in Dockerfiles, you would download all sorts of clis to
manage your package managers, and tooling such as jq
and whatnot to perform
small changes to the scripts to transform them into something compatible with
the docker build
.
The problem
A good example is building production images for rust. Building ci docker images
for rust is a massive pain. This is because when you run cargo build
, or any
of its siblings, you refresh package registry if needed, download dependencies,
form the dependency chain between crates, and build the final crates / binaries.
This is very bad for caching, because you can't tell cargo
to only fetch
dependencies and compile them, but leave your own crates alone.
This is general means that you will cache bust your dependencies each time you
do a code change to your crates, no matter how small. Dockerfile
or rather
Buildkit
on its own isn't able to properly split the cache, between these
commands, because from its point of view, it is all a single atomic command.
Existing solutions are downloading tools to handle it for you, but those are
cumbersome, and tbh, incompatible. For example, cargo-chef
. With cargo chef,
it should allow you to create a recipe.json file, which contains a list of all
your dependencies, which you can move from an step into your build step, and
cache the dependencies that way. I've honestly found this really flaky, as the
lower recipe.json
producing image, would cache-bust all the time.
FROM lukemathwalker/cargo-chef:latest-rust-1 AS chef
WORKDIR /app
FROM chef AS planner
COPY . .
RUN cargo chef prepare --recipe-path recipe.json
FROM chef AS builder
COPY --from=planner /app/recipe.json recipe.json
# Build dependencies - this is the caching Docker layer!
RUN cargo chef cook --release --recipe-path recipe.json
# Build application
COPY . .
RUN cargo build --release --bin app
# We do not need the Rust toolchain to run the binary!
FROM debian:buster-slim AS runtime
WORKDIR /app
COPY --from=builder /app/target/release/app /usr/local/bin
ENTRYPOINT ["/usr/local/bin/app"]
The above is the original example, but there are some flaws, it relies on the
checksum of the recipe.json to be the same. If you do a change in one of your
crates it will bust the hash of the recipe.json, because we just load all the
files in COPY . .
.
Instead, what we would like to do is just load in the Cargo.toml
and
Cargo.lock
files in for our workspace, as well as any crates we've got. And
then dynamically construct empty main and lib.rs files to act as the binaries.
This is the simplest approach, but very bothersome in a Dockerfile
.
FROM rustlang/rust:nightly as base
FROM base as dep-builder
WORKDIR /mnt/src
COPY **/.Cargo.toml .
COPY **/.Cargo.toml .
RUN echo "fn main() {}" >> crates/<some-crate>/src/main.rs
RUN echo "fn main() {}" >> crates/<some-crate>/src/lib.rs
RUN echo "fn main() {}" >> crates/<some-other-crate>/src/main.rs
RUN echo "fn main() {}" >> crates/<some-other-crate>/src/lib.rs
# ...
RUN cargo build # refreshes registry, fetches deps, compiles thems, and links them to a dummy binary
FROM base as builder
WORKDIR /mnt/src
COPY --from=dep-builder target target
COPY **/.Cargo.toml .
COPY **/.Cargo.toml .
COPY crates crates
RUN cargo build # Compiles user code and links everything together, reuses cache from incremental build done previously
This is very cumbersome, as you have to remember to update the echo
lines set
above. You can script your way out of it, but it is just an ugly approach, that
is hard to maintain and grok.
The solution built in dagger
Instead what we can do in dagger
is to use a proper programmatic tool for
this.
// Some stuff omitted for brevity
# 1
let mut rust_crates = vec![PathBuf::from("ci")];
# 2
let mut dirs = tokio::fs::read_dir("crates").await?;
while let Some(entry) = dirs.next_entry().await? {
if entry.metadata().await?.is_dir() {
rust_crates.push(entry.path())
}
}
# 3
fn create_skeleton_files(
directory: dagger_sdk::Directory,
path: &Path,
) -> eyre::Result<dagger_sdk::Directory> {
let main_content = r#"fn main() {}"#;
let lib_content = r#"fn some() {}"#;
let directory = directory.with_new_file(
path.join("src").join("main.rs").display().to_string(),
main_content,
);
let directory = directory.with_new_file(
path.join("src").join("lib.rs").display().to_string(),
lib_content,
);
Ok(directory)
}
# 4
let mut directory = directory;
for rust_crate in rust_crates.into_iter() {
directory = create_skeleton_files(directory, &rust_crate)?
}
You can find this in
cuddle-please.
Which uses dagger as part of its ci
. Anyways, for those not versed on rust
,
which most people probably arent. What is happening here, in rough terms:
- We create a list of known crates. In this case ci, is added, because it is a bit special.
- We list all folders in the folder crates and add them to
rust_crates
- An inline function is created, which has the option of adding a new file to an existing directory, in this case it adds both a main.rs and lib.rs file with some dummy content to a given path.
- Here we apply these files for all the crates we found above.
This is roughly equivalent to what we had above, but this time we can test individual parts of the code, or even share it. For example, I could create a rust library containing this functionality which I could reuse across all of my projects. This is a game-changer!
Note that rust is a bit more verbose than the other sdks, especially in comparison to the dynamic once, such as Python or Elixir. But to me this is a plus, because it allows us to work in the language we're most comfortable with, which in my case is
rust
You can look at the rest of the
file,
but now if I actually build using cargo run -p ci
, it will first do everything
while it builds its cache, and then afterwards if I do a code change in any of
the files, only the binary will be recompiled and linked.
This is mainly because of these two import of files (which are equivalent to
COPY
in dockerfiles)
# 1
let dep_src = client.host().directory_opts(
args.source
.clone()
.unwrap_or(PathBuf::from("."))
.display()
.to_string(),
dagger_sdk::HostDirectoryOptsBuilder::default()
.include(vec!["**/Cargo.toml", "**/Cargo.lock"])
.build()?,
);
# 2
let src = client.host().directory_opts(
args.source
.clone()
.unwrap_or(PathBuf::from("."))
.display()
.to_string(),
dagger_sdk::HostDirectoryOptsBuilder::default()
.exclude(vec!["node_modules/", ".git/", "target/"])
.build()?,
);
- Will load in only the Cargo files, this allows us to only cache-bust if any of those files change.
- We load in everything except for some stuff, this is a mix of
COPY
and.dockerignore
.
Now we simply load them at different times and execute builds in between:
# 1
let rust_build_image = client.container().from(
args.rust_builder_image
.as_ref()
.unwrap_or(&"rustlang/rust:nightly".into()),
);
# 2
let target_cache = client.cache_volume("rust_target");
# 3
let rust_build_image = rust_build_image
.with_workdir("/mnt/src")
.with_directory("/mnt/src", dep_src.id().await?)
.with_exec(vec!["cargo", "build"])
.with_mounted_cache("/mnt/src/target/", target_cache.id().await?)
.with_directory("/mnt/src/crates", src.directory("crates").id().await?);
# 4
let rust_exe_image = rust_build_image.with_exec(vec!["cargo", "build"]);
# 5
rust_exe_image.exit_code().await?;
- Do a
FROM
equivalent, creating a base container. - Builds a cache volume, this is extremely useful, because you can setup a shared cache pool for these volumes, so that you don't have to rely on buildkit-layer caching. (what is normally used in Dockerfiles)
- Here we build the image
- First we set the workdir,
- then load in the directory fetched from above, this includes, the Cargo files as well as stub main and lib.rs files
- Next we fire off a normal build with
with_exec
which function like aRUN
. here we build the stub, with refreshed registry, downloaded and compiled dependencies. - We load in the rest of the source and replace
crates
with out own crates, this loads in the proper.rs
files.
- We now build the actual binary
- We trigger exit_code, to actually run the dag, everything previously had been lazy, so if we didn't fire off the exit_code, or do another code action on it, we wouldn't actually execute the step. Now dagger will figure out the most optimal way of running our pipeline for maximum performance and cacheability.
This is very verbose
Rust is a bit more verbose than other languages, especially in comparison to
scripting languages. In the future, I would probably package this up, and
publish this as a crate
I can depend on myself. This is super nice, and would
make it quite easy to share this across all of my projects.
That project like in my previous
post could serve as a singular
component, which could be tested in isolation, and serve as a proper api, and
tool in general. This is something very hard, if not impossible with regular
Dockerfiles
(without templating).
Conclusion
I've shown a rough outline of what dagger is, why it is useful and how you can
do stuff with it that isn't possible using Dockerfile
proper. The code
examples show some contrived code, that highlight that you can solve real
problems, using this new paradigm of mixing code with orchestration. In this
case an unholy union of rust
and buildkit
through dagger
.