Avatar of Mark Imbriaco
Mark Imbriaco

The next generation of Metal; Railway Metal Gen 2

If you've deployed something on Railway in the last few weeks, there's a non-trivial chance your build, your stacker, or your storage is sitting on hardware that didn't exist on our network six months ago.

We've been cutting over a new generation of Metal underneath the platform.

5x more compute capacity

5x more network throughput

And two more nines of capacity stability, meaning less hot spotting for demanding workloads.

We onboarded more compute in Q1 2026 than in all of 2025.

This is what it is, why we did it, and where it's landed so far.

For many on Railway, people are aware that we run our own hosts on our own hardware. We’ve been doing this since 2024, at scale in 2025, and it’s a practice we plan to continue.

In the beginning, Railway was on a single GCP region in 2020, then 4 by 2023. But, as our founder Jake Cooper likes to remind people, you can’t build a cloud on another cloud. We were losing $20 for every dollar that came in.

We care about being a fundamentally good business so that we can avoid the mistake of other PaaSes, but also, help deliver the best experience for our customers.

Our Gen 1 hosts were dual-socket Intel Xeons with terabytes of DDR4 and modest CPU. Also, let’s be honest, Charith and Christian ordered these from a couch at our 2024 Mallorca retreat based on some napkin math. We didn’t know how they’d perform until we ran them in production. This was before I joined Railway, but I can only salute the commitment to forward progress.

In 2024 the typical Railway service was a Node webapp or a small Python service that wanted lots of RAM and not much else.

Then 2025 happened … and we had three new issues to deal with.

  1. More serious workloads

Think vector databases, scrapers, agentic loops, and inference proxies. The median Railway service now wants a real fraction of a CPU. Our internal CPU:RAM demand ratio shifted from 1:20 toward 1:8. The Xeons were not built for that, and their AVX-512 path was nerfed (read: only two AVX compute units for the entire machine) badly enough that LLM workloads in particular were a non-starter on Gen 1. We apologize to anyone who wanted to run Llama 3.

  1. The memory market

From October 2025 to January 2026, RAM prices roughly tripled. Vendors stopped honoring orders. The compute we had on the floor was suddenly worth a lot more than what it cost to put there, and the compute we still wanted to buy was either unavailable or three times the unit price we'd planned for.

  1. Demand curve going parabolic

Then with agents, in Q1 2026 we onboarded 4x more demand in one quarter than we did in the entirety of 2025. To put that in perspective, a fully loaded new Gen 1 site only buys us about three months of capacity runway.

Just three months. We were going to run out before we could build the next one.

So we redesigned what a Railway site is, and started designing the next generation of Metal.

Around June 2025, after the success of our Gen 1 Railway Metal deployment, and the pricing-cut we delivered to users, we were confident that we would be able to fully book further capacity. So, much like building a custom gaming rig that would run Unreal at max specs, Charith went shopping.

The logistics of this are challenging. Supply chains for all of the components of a modern server are stretched these days, meaning prices go up and delivery schedules get murky. And once the servers are ready to be delivered, you need to have enough space and power in your datacenters to install and run them.

Our Gen 1 sites were tapped out so we worked to secure 4 new sites, near our existing Metal regions, with multiple of the power and space of our original sites. We were around a year into our operational maturity for our sites so we had plenty of learnings to incorporate into our on datacenter design and operational processes.

At the end of 2025, we knew that demand was growing but even we didn’t fully appreciate how rapidly the curve would become a vertical line. That said, we made the decision to focus on right-sizing density for our customers to reduce the risk of noisy neighbors, provide maximum IOPS and I/O for our customers, and radically expand the networking throughput of our machines.

(All based off of feedback from our Enterprise customers)

We don’t usually publish specs but we moved to the latest generation AMD Zen 5c EYPC CPUs with 96 cores (192 threads) with DDR5, 5x more storage than Gen 1, and dual 100G ConnectX-6 NICs. All in the same chassis as our Gen 1 storage server, so we get to consolidate from four SKUs down to two. Having run both Supermicros X13 and H13 platforms for over a year, we were privy to the failure rates and firmware quirks of both platforms - the H13s were rock solid and far more efficient, so we doubled down. We learnt the hard way to stay a generation behind on platform after our then brand new X13s needed a VRM firmware update that required physical jumpering on the mainboard.

We also used this as an opportunity to improve our storage offering for persistent volumes, by moving to the latest generation AMD Zen 5c EYPC CPUs, 12 drives of NVMe per box, and 4x 100G NICs. With of 750TB storage raw per rack.
All routed with a Tomahawk 3 ASIC-based fabric: 12.8Tbps of switching, 400G uplinks to the spines, and dozes of 100G and 400G ports at the border.

Gen 2 pods are denser, with roughly 2x the server-to-switch bandwidth and 4x the rack-to-rack bandwidth of Gen 1 with less oversubscription across the fabric. Despite the improved performance, we still reduced network capex per host by about 30%, because the fixed switching cost spreads over many more servers.

More importantly the interconnect.

Since each Gen 2 site is built beside its Gen 1 sibling rather than on top of it, we link the two over our own dark fiber: 400G links, across 4 diverse paths, with DWDM where we need more wavelengths. And in most regions we can also reach the nearest hyperscaler region in well under a millisecond, which matters for our Enterprise customers who keep one foot in AWS or GCP.

For now, this is the largest hardware order in Railway's history, and they’re only going up from here.

This phase was hundreds of new servers across our four regions. Over a thousand new shards worth of headroom and petabytes of NVMe storage to feed them. Most of those orders were placed against forecasts in Q4 2025, before our funding round closed on the bet that locking in lead times would matter more than waiting for prices to soften.

Prices have not softened.

Another lesson we learned the hard way is that a purchase order placed three months out isn't binding in any real way.

By the time the parts are built the price has moved anywhere from 20 to 300 percent, and the vendor calls near the end to say the NIC you specified is gone, the drives aren't available, or there's a different CPU stepping if you'll take it.

We’ve had to re-qualify parts on short notice several times while holding the core platform spec steady throughout.

You later encounter some cosmically hilarious issues. The laptop you were balancing on a ladder (because who thinks to order a folding table) fell and broke the serial cable. Thank goodness you paranoia ordered 4 of them. You have to be resilient in this game.

Gen 2 Metal is live in three of four regions today: US West, US East, and Amsterdam. However, Gen 2 being live doesn’t mean that our rollout is done, the hardware keeps arriving in batches, so each site fills over a few weeks rather than all at once.

We're provisioning ninety more hosts as I write this.

For this generation; we’re going to be solely focused on capacity. As a customer, you get the benefits with faster deployment times because scheduling workloads is that much easier, and the pressure on our compute and networking is spread across a much larger surface area. Beyond capacity, this opens up new features that we’re excited to announce soon (…like a box full of sand).

Because of the demand spike, erm, wall. We had to emergency provision cloud capacity meaning our hosts are multi-cloud in GCP/AWS/Metal. As it stands today, 40 percent of our workloads are on cloud burst (as opposed to 100 percent Metal at the end of 2025) over time we plan to increase the share of workloads that run on Metal.

To do this, we plan to keep our Gen 1 sites and keep the same SKU for our customers.

A Gen 1 site doesn't get torn down when we build its Gen 2 partner. It keeps serving the workloads it has, gets the new site as its network upstream, and over time the new builds and new stateful workloads migrate to the Gen 2 capacity.

We made it a point to keep our sites as close as possible given the availability we were able to secure. For cloud interconnect, we’ve been able to keep latency down to 3-4ms in the worst case, and lower in most regions. Though we’re just announcing it today, we’ve actually been running Gen 2 hardware for about a month without customers noticing. Which is exactly as intended.

To recap: Each Gen 2 site sits next to its Gen 1 predecessor in the same region, has a high performance network, and is four to five times larger with five to six times the shards, four to five times the power, and a CPU-to-memory ratio that matches the workloads of today and tomorrow.

We don't list it as a new region. Your US West deploys just have more room now, with nothing for you to change. What you should expect is a more rock solid platform with now 100s years of combined experience running networking, storage, and compute hardware for your benefit.

About 13% of the containers running on Railway are on Gen 2 as of June 1, up week over week, and gaining on the share we run across GCP and AWS combined.

With five to six times the capacity per site, we're not running close to the edge of capacity and we have the relationships secured to deploy more compute as demand is sustained.

The scheduling contention some of you hit earlier this year came out of that squeeze, and it's easing. If you're still on Gen 1, you aren't missing much day to day: those sites were nearly always constrained on memory, not CPU but you should see more consistent performance.

We have not raised prices on Railway Metal, and we are trying our best to not raise them now. We know a lot of people around the world are feeling the squeeze. We may look into new SKUs that will help you take advantage of the underlying performance.

That's harder than it sounds. Everything is in shortage right now. You can read Framework’s excellent updates on how they are handling the RAM shortage.

Editor's note: I was in Shenzhen last month. Mark isn't exaggerating.

At the moment, RAM is still the most contended. Storage is catching up. None of it is ending next quarter. In the last twelve months every line item on a server BoM has increased in price.

Despite all of that, the bill you get in May 2026 was lower than bill you got in May 2025 because we passed on the egress savings we accomplished as part of this move. We are absorbing the upgrade. Unless the market changes and memory triples again, we’ll try to keep things as affordable as possible.

If your service feels faster on Railway than it did in October, the most likely explanation is that we have grown in our capacity to scale our operations, and the second most likely explanation is that the build that produced it landed on our new systems.

I want infrastructure to feel amazing no matter what machine it’s on. From a customer's perspective: it gets faster, the bill doesn't change, and you don't have to know about any of it.

But if you read this far, now you do.