Counting to 3 with a new builder processing 50M+ monthly builds
For most of Railway's history, a "build" was the same thing as docker buildx build. A pool of GCP VMs would scale up, a Temporal worker on each box would shell out to buildx, and a few minutes later you had an image in our registry. It worked. We shipped a lot of features on top of it.
It also bled egress, couldn't isolate noisy neighbours, and gave us no good story for putting builds on the bare-metal hardware we'd been buying. By the time we were running 50 build nodes at peak in us-west1 alone, the cracks were obvious. We scaled at 20% CPU just to keep QoS reasonable. We couldn't push overflow anywhere useful. And every time someone ran a 30-minute monorepo build, the box it landed on became unusable for everyone else who happened to be sharing it.
A year later, none of that is true anymore. Builds run inside microVMs on a static pool of bare-metal hosts, with a 1-of-3-on-the-ring scheduler that keeps your BuildKit cache warm across runs. Last week we did 66,000 builds per hour at peak.
This week I shaved another 10 seconds off the average build by deleting code that decompressed image layers we already had the metadata for. (For fun.)
This is the story of how we got here. It's not a clean story.
The old system? Straightforward, I would say.
Which was both its strength and its problem. A push from git or the Railway CLI landed a code snapshot in a regional bucket, either a tarball pulled from a GitHub clone through the API or a chunked gRPC upload streamed from the CLI. (Like from railway up)
Alongside the tarball we generated metadata about what was inside the snapshot by unpacking the tarball on the host and running Nixpacks's provider detection over the code. Our control plane took that metadata back, kicked off a Temporal workflow, and the workflow picked a builder VM to run on.
Picking the VM was a Redis lookup. Each build carried a PreviousImageTag; we'd look up which builder built that image last time, and if it was still alive the new build went to the same box. Cache hit. If not, anywhere with capacity.
The builder process itself was a thin isolated wrapper around docker buildx. Linux did whatever it wanted to do and we paid the egress bill to keep snapshots and registry traffic flowing across regions. This hit limits as you could imagine.
By the back half of 2025 we'd hit enough of those limits that "buy more GCP" stopped being a real answer. Users saw stuck builds, flaky pushes, oncall pages but the shape of the fix lived deeper, in what a build was on Railway. So we started work on Builder v3.
Big(er) machine. Better runtime.
Now, a fleet of 256 vCPU, 512 GB Railway Metal host(s) gets split into 8 build cells. Each cell is a microVM with 32 vCPU and 64 GB of RAM, running buildkitd inside. Cells are long-lived when we need to ship new builder code, we drain the cell, stop it, update its image, and start it again. The VM's disk persists across stop/start, so the BuildKit cache survives the upgrade.
That's the model. Easy right?
Below is what happens when you try to ship it.
Builder v3 started life as code that lived in two places. The build workflow, Temporal worker (Railway uses Temporal for our queue.), with a activity router, build controller ran on the host. The actual BuildKit client lived inside the guest VM. They talked over a gRPC bridge tunnelled through vsock.
However, I couldn't update build logic without redeploying both halves of the system across the entire VM fleet.
Every line change in the build workflow turned into a fleet-wide rollout. The guest-side init was setting up BuildKit dirs and 32 GiB of swap on every VM we ever booted, builder or not. The host-side runner was carrying the Temporal SDK and builder-specific workflows around for VMs that would never run a build. And running builders outside our own VM stack, for example, on GCP directly, wasn't really possible.
So in March I started on what became the "standalone" refactor. It's the part of the project I'm most proud of, partly because the diff was satisfying and partly because it forced a much cleaner picture of what a builder actually is.
Around this time, we were running both systems simultaneously which was confusing us… and our users.
The new shape: a single standalone builder binary that runs as the main container inside the VM. BuildKit moves to a guest-init extension that just brings up buildkitd with a cgroup, a bind mount, and a config file. Everything else, the Temporal worker, the activities, the BuildKit client, the log shipping, collapses into one process.
The old path was:
workflow (host) → activity → gRPC/vsock → handler (guest) → operation → buildkitThe new path is:
workflow → activity → operation → buildkitFour layers became one process, with no gRPC hop. Telemetry goes direct to our log pipeline over TCP instead of tunnelling through the host. By itself, just removing the unnecessary Temporal round-trips and gRPC bridges cut about 20 seconds off a fully-cached build.
I started rolling the standalone build out to two builders in Singapore on April 13.
Upgraded 16 VMs on the 15th and 16th. By the 18th I had to roll it back.
The problem wasn't the build code.
It was networking.
Around this time- Railway needed about 3x more compute than we needed at the start of the year. Long story. So… supporting public clouds was now needed.
In the old way, the Temporal worker ran on the host, where networking is straightforward, it has full access to our log pipeline, to Temporal, to the orchestrator. In the new shape the worker runs inside the VM, where networking goes through the host's bridge. And on a small percentage of VMs, that bridge was wedging just often enough that workflows would silently fail to start.
The actual fix took longer than the rollback.
So we run standalone on GCP and AWS first (where the VM networking is whatever the cloud provider gives us, not our own bridge), kill v1 entirely, then come back and fix our microVM platform's networking story. That's roughly what happened. We finished the migration to our microVM platform almost a month later, on May 12.
In the meantime the GCP rollout surfaced its own collection of small disasters.
Example: DNS via 1.1.1.1 on our bare-metal partner's hosts turned out to be unreliable enough that I ended up running CoreDNS as a library inside each VM, with local cache and upstream to our internal resolver on metal.
Another one: Some VMs were drifting in clock time enough to break cert validation, which we fixed by running NTP inside the VMs, obvious in retrospect. Snapshot uploads to R2 occasionally failed because the download size and upload size disagreed and we send the expected size as a header.
…And we mitigated a privilege escalation CVE on the GCP builder hosts ("Dirty Frag") before it ever fired.
By April 29 every flag was at 100%. On the 30th we marked it GA. We were still seeing reliability issues, mostly stuck builds, but the core migration was done.
Here's the bug that probably did the most user-visible damage.
buildkitd and the per-build runc workers share a microVM. If you give them no isolation, the workers can starve buildkitd of CPU and memory. When that happens, buildkitd stops responding to its Temporal heartbeats. The build doesn't fail, it just stops making progress.
From the outside it looks like the build is "stuck." (Ed. note, we’re sorry.)
The fix is conceptually trivial: put the daemons (buildkitd and our builder process) into one cgroup v2 slice and the per-build runc processes into a sibling slice. Give the control processes a guaranteed CPU and memory floor, and let the workers fight over what's left.
In practice this was about a week of work.
The microVM kernel had to support cgroup v2 properly, we were on 6.x for unrelated reasons, but Jake Cooper also had to fix up our bare-metal builders, which had come from the factory with RAID1 set up wrong (the vendor had put the EFI partition into the RAID array, which is impressively wrong).
I had to plumb the cgroup parents through the proto so the guest init knew where to put what. The first canary on a GCP builder behaved well enough that I rolled the change across all of GCP, then waited a few days, then took it to the metal fleet.
After cgroup isolation landed, the population of "build is stuck for 20 minutes with no progress" collapsed. I had to write a Datadog monitor specifically for the remaining cases, because most of the obvious symptoms went away.
This one's recent enough that I'm still rolling it out.
When we build, we need metadata about every layer in the source image — diff IDs, sizes, things that BuildKit uses to compute its cache key and plan the build graph. The old code path got this metadata by decompressing each layer — reading raw bytes through gzip just to look at the metadata.
The fix is to read the same metadata from the OCI image manifest, which has all of it already.
This is about 15× faster than decompression on average, which translates to roughly 10 seconds shaved off the average build.
If you're building something that touches OCI internals, this is the kind of thing it pays to audit. We had this code for years and nobody had looked at it!
A normal week on Builder v3, as of mid-May, runs at about 66,000 builds per hour at peak, up from around 60,000 the week before, which was already a record.
The fleet is split across GCP and our bare-metal partner, all of it standalone. Draining and restarting a builder is a one-line operation in our internal tool, and the runbook fits in a paragraph.
The egress problem is, as far as I can tell, solved.
The longer-term direction is away from running builds at all.
The fastest, cheapest, most reliable build is the one that never happens — and the more time I spend optimising the build pipeline, the more obvious it becomes that the real win is making most deploys skip it entirely. We're starting to sketch out what a buildless path looks like for the workloads where it's possible. Builder v3 is what we needed to get here.
The next post may be about how we shrink it back down to nothing.
For now: Builder v3 is live and standalone, v1 is dead, and the egress bleed is patched. Users are happy. I'm taking PTO on Friday.
Editor's note: Ed did, in fact, take PTO after writing this.
