Charith Amarasinghe

Mar 21, 2025

Zero-Touch Bare Metal at Scale

We’ve all gotten used to clicking a button and getting a Linux machine running in the cloud. But when you’re building your own cloud, you’ve got to build the button first.

Lately we’ve been writing about building out our Metal infrastructure one rack at a time.

In our last blog, we spoke about the trials of building out the physical infrastructure. In this episode, we talk about how we operationalize the hardware once it’s installed.

Sorting your LEGO pieces

You’ve built your dream server with dual-redundant NICs and multiple redundant NVMe drives for resilience. You’ve ordered 100 units and got them all racked up and wired to your detailed diagrams. You go to the DC with your USB stick and reboot into your favorite Linux distro’s installer only to be greeted by the “Choose your Network Interface” screen with a dozen or so incomprehensible interface names.

Herein lies our first hurdle — how do you map the physical arrangement of hardware to what your operating system sees?

A Supermicro server with 12 NVMe Bays - Guess which bay is /dev/nvme0n2 as seen by Linux? (That’s a trick question).

💡

When buying build-to-order servers, it’s essential to include instructions specifying exactly where each Network Card or NVMe Drive should get installed. Otherwise you might end up with multiple different configurations across different orders.

It helps first to take a step back and discuss how Linux names devices.

When a host boots Linux, the OS enumerates the attached hardware. Most commonly, devices are attached to the PCIe bus and Linux begins enumerating these according to the hierarchical structure of the bus. When Linux encounters a device during this traversal, the udev daemon will get an event and associate a number of identifiers with the device - it’ll then use these identifiers to formulate a name which it then assigns to the device nodes it creates in /dev and elsewhere.

The consequence of this approach is that device names can be very unstable, especially if the hardware layout changes between boots or if the enumeration order is non-deterministic. If you used Linux in the olden days, you’d know the pain of plugging in a new PCIe card and booting only to figure out that your networking broke. Despite the many critiques that could be leveled against SystemD, it does succeed in addressing these problems for network interfaces since v197. But storage device naming is still a crapshoot and better achieved by device serial number.

Our approach to addressing this unpredictability is to lean on Redfish - a HTTP API for Board Management Controllers (BMCs) attached to server motherboards. Redfish APIs can enumerate the hardware on a board, detailing PCIe cards, NVMe drives, their serial numbers and/or MAC addresses and their physical locations.

An Extract of a Redfish System object scraped from one of our storage nodes

Our very first step once a rack is installed is to build a CSV of identifiers for the equipment in the rack - hostname, BMC MAC address, BMC password, and a few other details. We then push this data via gRPC to an internal control plane called MetalCP.

MetalCP runs a Temporal worker which implements a Host Import workflow. For each server, we then kick off a workflow that runs through the following steps:

Match the device to its representation in our internal DCIM tool (Railyard)
Connect to the datacenter's management network via Tailscale
Connect to the management router at the datacenter and identify the DHCP lease assigned to the BMC (via its MAC)
Connect to the BMC of the server via this IP and scrape all available data
Create an internal Protobuf representation of the hardware layout
Create static DHCP leases for the BMC and for the management NIC on the host using discovered MAC addresses from the scrape
Update a DB with all the details about the server

An import workflow takes less that a minute to complete in most cases and Temporal ensures recovery from any transient failures.

The timeline of a Host Import workflow

The database record that is stored by a host workflow contains all the information you could want about a server. We generate:

A list of Network Interface Cards, their Physical Location (Slot), and MAC addresses for each Port
A list of NVMe Drives, their Serial Numbers, Model Numbers, and Physical Slot IDs
System stats such as CPU core counts, RAM size, and hardware identifiers

We then match this hardware specification against a list of known configurations. These configurations encode details such as network interface names assigned to specific PCIe slots and NVMe drive bay identifiers. A hardware configuration is as simple as a set of conditionals in Golang that match the key distinguishing factors of a specific type of server, and a config object containing stable interface and drive names.

An example hardware identification function, pb.SystemInfo is an encoding of raw data we’ve scraped from Redfish

A custom plugin exposes this Hardware Config object to Ansible, allowing us to reference NVMe disks and network interface names with Jinja template expressions. For example, a NVMe drive in Bay 0 can be uniquely addressed as /dev/disks/by-id/nvme-{{ drive_bays.DiskBay0.device_model }}_{{ drive_bays.DiskBay0.serial_number }} .

This approach lets us build config in any shape we want without leaving anything to chance. The import workflow also flags faults in the hardware; if a server is not reporting a NIC or a DIMM of RAM, the workflow will fail since the hardware won’t match a known configuration. We’ve thus far identified servers with faulty RAM and servers with NICs installed in the wrong slots through this mechanism.

Configuration is one step, but getting at Ansible still needs an OS to get installed. So how do we one-click ourselves out of installing Linux in the first place? The answer involves a pinch of AI 🪄.

Who needs webhooks when we’ve got Claude

When we first started provisioning servers, we did it manually with 20 Web KVM tabs in Chrome and manually interacting with debian-installer. Over time we’ve evolved to use less and less human intervention.

The Debian Installer can network boot from PXE, and Pixiecore from Dave Anderson can wrap all the PXE complexity in a few HTTP calls. We use MetalCP as a backend to Pixicore and return a simple JSON payload describing the netboot kernel, initramfs, and kernel command-line. Debian can accept a pre-seed file over HTTP if networking is configured.

We use our knowledge of the PXE booting servers MAC address, plus the system info we’ve scraped from Redfish, to create a kernel command-line and preseed file tailored to the booting machine.

These are all exposed as HTTP APIs proxied by Pixiecore to the PXE booting machine.

The function that generates our PXE boot response

Getting a host to a PXE bootable state requires us to reboot the server, but we don’t want this reboot action to happen on a server that may be running user code. To achieve this, we implement a logical state machine for each host in the provisioning process and orchestrate the OS install via another Temporal workflow.

But how does the Workflow know which state the server is in during the install? Redfish APIs tell us that it’s powered on, but little else.

Since it’s 2025, we just ask Claude.

An example of Claude’s happily recognizing a server POST screen

Supermicro introduced a CaptureScreen OEM API in their Redfish 1.14 release; with this API we can obtain a near real-time image of the server screen. With a basic prompt to Claude, we can then get a JSON payload that describes the state of the server at this point. Combining this into a Temporal workflow - alongside the PXE boot automation above - we can achieve an OS install and provision with one gRPC API call.

The temporal workflow polls the screen and monitors the install to completion, updating an internal DB at each step

There are probably more effective methods of achieving the same, but it costs us less than a dollar to provision 50 servers using Claude to screen-scrape every minute during the install.

Now that an OS is installed, some duct-tape and Ansible will allow us to get some basic software running on the machine. However, bringing up networking is something else that’s typically an annoyance.

Low Config Networking with BGP Unnumbered

All the solutions we’ve discussed thus far have relied on a Management (or Out of Band) network. This is a dedicated Gigabit Ethernet network that links a management NIC, BMCs and other support infrastructure inside the cage. This network isn’t built to scale as it instead relies on being limited to a few hundred hosts at most and uses off-the-shelf routers, DHCP and VLANs for isolation and operation with low fault tolerance. The network that carries user traffic, the dataplane network as we term it, has very different requirements.

The dataplane network is a CLOS topology with redundant switches and redundant links

For starters, the dataplane has multiple redundant links and must tolerate link failure or maintenance of redundant network switches. This requires running a routing protocol between switches and servers, with the protocol needing to route around device or link failures. Typically this routing must be configured for every point-to-point link; but this scales poorly in large deployments as the config needs to be customized for each rack.

Unlike typical BGP where the BGP peer relationship must be defined between two configured IPv4 addresses for each router-router or router-server link, BGP unnumbered allows the use of autogenerated IPv6 link-local addresses as next-hops and peer-addresses for IPv4 and IPv6 BGP routing.

With FRR, BGP adjacencies can be declared on interfaces and FRR will pick the IPv6 Link-Local address of peer connected to that interface as its next-hop for routing

This makes the routing setups on switches and servers uniform. Add all the interfaces connecting to a given type of device (eg: Top-of-Rack switch uplinks, server downlinks or Spine switch uplinks) as a BGP peer-group and configure BGP as normal; then the same config can be shipped to every equivalent device in the cluster.

At Railway, we have 1 BGP config template per kind of device and roll these all via Ansible to all switches. We do not need to reconfigure any network gear as we scale a rack or cage.

💡

Config updates will eventually be needed and the cleanest/easiest way to apply them is to update the on-disk configuration via Ansible and reboot the switch. Hot-reloads with FRR are complicated, the prevailing approach seems to be frr-reload.py which diffs two textual configs and comes up with a list of CLI commands needed to reconcile them.

Long-term we hope switchdev and DANOS will get wider ASIC support so we can directly integrate with our control plane. Failing that, at larger scale, directly integrating with SAI and going in a FBoss-esque direction seems inevitable.

Every possible port that can connect to a Top-of-Rack switch is marked as such on a Spine switch, regardless if those Racks are populated or not. This makes expansion plug-and-play.

With the switch fabric configured in this way, and by running FRR with this same configuration down to servers, we build a full L3 network with ECMP across redundant links.

💡

In addition to BGP unnumbered, this L3 fabric also requires bgp bestpath as-path multipath-relax and specific AS numbering to avoid unintended consequences from BGP loop detection. FRR provides a set of “datacenter” defaults that adjust timing for fast eBGP convergence.

When we want to add a new IP to a host on the network, we have a small agent insert a route into the Linux kernel routing table. A preconfigured FRR daemon picks this up and then propagates it through the rest of the network - as long as the routed prefix is within one of the subnets assigned to the site.

IPv4 routes with IPv6 nexthops. These IPv6 addresses are Link-Local addresses discovered via IPv6 Neighbor Discovery

Building Software to Run Hardware to Run Software

Building Railway Metal, we’re more conscious than ever that we need to invest in tooling to enable us to deliver the best Metal experience we can build. We’re finding off-the-shelf solutions to be lacking or outdated in various ways, and the tooling we’re building for Railway Metal is proving an entire software vertical in itself.

MetalCP is branching into Network Automation and is already our in-house RANCID/Oxidized replacement

We’ll continue to write more about our exploits, but in the interim - if you find any of this interesting or fun, we’re hiring!

Pop on over to railway.com/careers and check out our open roles.

Zero-Touch Bare Metal at Scale

Sorting your LEGO pieces

Who needs webhooks when we’ve got Claude

Low Config Networking with BGP Unnumbered

Building Software to Run Hardware to Run Software

Continue Reading...

Your train has arrived!