Avatar of Phin Walton
Phin Walton

How to build a 100M RPS CDN in 30 days with Rust and WASM

Railway has technically ran an “edge” network for over 3 years, although it didn’t really live up to the “edge” name - it ran on our 4 Railway regions across the world, nowhere near 100+ POPs you expect when thinking about edge computing.

However, starting this week, the CDN we built (which served you this very page!) is becoming available to all Railway customers at the click of a button.

In this post, we will talk about the journey it took to get here today - and how we decided to build a CDN for the modern internet.

💡
Note from the author: I know the title says 100M RPS - that’s our theoretical benchmarked throughput under ideal conditions, however we operate this in production with 1M RPS at peak hours today. Under sustained load and DDoS attacks, we have withheld spikes to ~30m RPS with no disruption.

You may be reading this, asking yourself why we’d go through all the effort of building out such a large system that we could just buy off the shelf from another provider. The reality is, we initially did just that and tried to integrate it with our product, but quickly realized a few things:

  • Our internal development velocity outpaces the rate at which these legacy platforms can iterate
  • A lot of our users have unique, non-standard request & connection behaviors which we’ve supported for years, which other providers are not engineered to support

But the thing I actually care about is what owning the stack lets us build into the product. If you look at our support threads, network related issues made up around 20 percent of all ticket volume. CDN configuration and nameservers and the like.

During the initial scoping of what a CDN may look like if we built it ourself, we had a realization:

We’re a cloud computing platform, not just a CDN - and that unlocks some unique benefits that traditional CDNs find very hard to execute correctly;

We operate both the edge AND the origins - which means we can send traffic through the internet and backbone links directly to your workload, rather than trying to guess where the origin is located. Traditional CDNs find this very hard, and the datapoints they use to try to make this guess are very muddy - for example, even if an “origin” is 10ms away on layer 3, the origin may proxy an asset that lives 200ms away. Luckily for us, we already know exactly where your app runs - to the longitude, latitude and altitude. This give us a unique routing superpower.

We can offer performant defaults that other CDNs are too scared to offer - for example, did you know that Cloudflare doesn’t cache HTML or JSON by default, even if your website’s Cache-Control header says to do so?

I believe this is because they’re too scared of confusing the developer: the developer expects website updates to appear when they refresh, and Cloudflare doesn’t know when you make updates to your website. However, with the Railway CDN, we respect HTML cache control and automatically purge HTML cache when you deploy - best of both worlds!

Sorry, I forgot to mention - we call this new edge network and CDN Hikari, the Japanese word for “light” or “fiber optics”, also the name of the fastest Tokaido Shinkansen (bullet train), and that one Blue Archive train conductor.

Since I dive into internals, it’s easier if I use the internal name for the blog.

I won’t get too much into the specifics of the server configuration or datacenter & provider lease agreements - that could be an entire post itself! But, to give you the rundown, we started procuring these a few months before we started building Hikari, and now have 60 POPs (*we are still waiting for the delivery of ~40% of these locations), over 180 CDN nodes and tens of terabits of network capacity. Each node is a 16 core EPYC with 256G of memory, 8TB of NVME storage and 100G networking.

Our CDN POPs (so far!) as displayed by our internal DCIM tooling

Our CDN POPs (so far!) as displayed by our internal DCIM tooling

Usually when building a huge project like this, you think about the implementation, system and business logic first, and deployment later. But, with over 180 nodes, handling ~1 million RPS at peak hours, we knew we had to design this differently. To put it straight, we have been scared to perform upgrades to our existing origin proxies because of the blast radius - we use Ansible and have been burned in the past by the lossy, brittle, hard to audit behaviors that Ansible exposes - it’s also a slow tool to work with when working with hundreds of servers.

When scoping the initial project work with the team in Kyoto, we aligned on the fact that the entire deployment experience had to be first class. We need to be able to quickly iterate on the product at Railway speed, without the anxiety of potentially impacting 1M RPS at once.

Server-side WASM (WASI) and wasmtime quickly came to mind for the core request path. Our thought was - if we can expose core business logic to an extremely efficient sandboxing engine like WASI, we could invoke that binary per request. One node could serve requests under many versions of that binary, so we could easily apply rolling updates fleet-wide without effecting any existing connections. More on this later.

We effectively ended up splitting up this entire project into 4 services:

  • edge-cp - The edge control plane which all nodes connect to; handles rollouts, node health, configuration syncing, and exposes controls used by our internal tooling
  • hikari-keeper - The sidecar service which sits on every node, handles BGP, dataplane upgrades and reports node status to edge-cp
  • hikari - The core dataplane service which terminates TLS/HTTP2, invokes WASI per request, and manages cache
  • hikari-guest - The actual WASM guest binary invoked per request, which handles business logic like “should this asset be cached”, enforces limits based on the domain’s configuration, WAF rules, and more
Everything is observable and configurable in real-time from our internal tooling

Everything is observable and configurable in real-time from our internal tooling

When we want to onboard a new node, we just click a button on our internal tooling, which gives us an authenticated curl to run on the node, which installs hikari-keeper - that says hello to our edge-cp control plane, and starts bootstrapping. The bootstrap process is effectively a bit of Rust that defines a version and a list of Steps to reconcile against.

pub enum StepState {
    InSync,
    NeedsChange { reason: String },
}

#[async_trait]
pub trait Step: Send + Sync {
    fn name(&self) -> &str;
    async fn check(&self) -> Result<StepState, StepError>;
    async fn apply(&self) -> Result<(), StepError>;
}

// The list of steps we should run when the node starts up
// or receives an upgrade signal
pub fn steps(state: &State, dynamic_config: &Arc<DynamicConfigStore>) -> Vec<Box<dyn Step>> {
    let mut s: Vec<Box<dyn Step>> = vec![
      b(steps::DisableUnattendedUpgrades),
      b(steps::InstallAptPackage { name: "bird2" }),
      b(steps::InstallAptPackage { name: "parted" }),
      // -- networking: forwarding + listen backlog --
      b(steps::SetSysctl { key: "net.ipv4.tcp_syncookies", value: "1" }),
      b(steps::SetSysctl { key: "net.ipv4.ip_forward", value: "1" }),
      b(steps::InstallHikari),
      // ...etc
    ];
}

When we want to make an update to the Linux state, we just change this file, push to git, and the fleet reconverges within a 30 minute time period, gated on any errors that occur. hikari-keeper calls itself to update every minute. The entire upgrade process, and any errors are synced with our edge-cp control plane so we can see it happen live on our internal dashboards, and automatically stop the upgrade if necessary.

hikari-keeper also handles upgrades to the dataplane binary (hikari) - once it receives an upgrade signal from the control plane, it sets up a socket for the old binary to hand over file descriptors to the new binary, so that connections stay active during a (rare) upgrade to the hikari binary itself.

Once a node is up and ready, hikari-keeper constantly healthchecks the local hikari dataplane service - and as long as it’s in an expected state, keeper will install the anycast route on the loopback lo interface, which gets picked up by our local BGP daemon - BIRD, and propagated to the internet.

To be clear, most of these systems I’m describing are solid reconciliation loop state machines that are very easy to audit and understand, both for us and our agents. This means that even if a message is lost between a system or process, the reconciliation loop picks it up on the next run and aligns the system state with the expected state. (We are an orchestration platform after all).

All of our BGP configuration, including communities for traffic engineering, is dynamically configurable from our internal tooling. It all propagates when hikari-keeper reconciles with edge-cp, every 5 seconds.

Ah, there’s another fun hint in that screenshot about the reality of operating a large anycast network, too: Operating anycast at global scale is hard.

BGP was never really designed for anycast, it’s a side effect of the protocol. BGP does not understand geography or latency. It sees the internet as a graph of networks (”autonomous systems”). The “best network path” is whatever can be reached in less autonomous systems. But, a lot of these networks and autonomous systems span the entire globe, like our own network AS400940 and larger transit providers like NTT.

That distinction matters. As a CDN operator, you want user traffic to terminate at the closest CDN node to them by latency, which is usually the geographically closest node, thanks to the speed of light (modern switches can forward packets in nanoseconds).

But, an end user’s ISP may peer with a global transit network, and from the BGP perspective, that transit network is only one ASN (autonomous system) away from our CDN prefix (e.g. 69.46.46.0/24) - but we may only peer with that large transit network in only one location on the other side the globe. This means the user’s ISP takes the traffic on a world tour before it reaches our network! Not ideal.

Luckily, there is a slight knob that BGP exposes to operators called Communities. This allows you to “tag” IP announcements with behaviors such as “Do not export this route outside of Japan”. We’ve built automatic support for BGP community tagging right into Hikari, so we can optimize our network for both major and smaller ISPs in realtime. There are some outlier ISPs that make it very hard for us to engineer traffic - for example Liberty Global, a major European transit network, exposes no communities, likely because they want content networks like us to pay for transit with them.

💡
You can see how internet networks see each other using tools called Looking Glasses. The bgp.tools looking glass lets you query a lot of internet routers at once, check it out!

As hinted earlier in the post, we chose to take a bet on WASI and wasmtime - an extremely fast server runtime for WASM bytecode, which (importantly) is extremely cheap to invoke - so cheap that we do it on every request. It’s a battle tested technology, used by products like Fastly Compute in production.

If you're not familiar with WASM/WASI, to summarize briefly: WASM is a portable bytecode format - you can compile a program in almost any language to it, and run it inside another process. It's sandboxed by default; the guest can't see the host's memory, files, or network, only a set of imports and functions you explicitly hand it.

WASI is the standardized version of those imports for server-side use. The property we cared about most: a single host process can have many versions of a guest loaded at the same time, and pick which one to call per-request. That maps cleanly to what we wanted - hikari (the long-lived dataplane) terminates TLS and holds connections open forever; hikari-guest (the per-request brain) is a versioned blob we can swap underneath it without disturbing a single byte of in-flight traffic.

We can make updates to the hikari-guest Rust binary and have it live on the edge in seconds - which gives us the ability to write WAF rules, DDoS mitigations and request/response behavior dynamically - we expose well defined callbacks which hand actual IO work back up to the host.

To be honest, we'll probably write more in-depth blog posts about how we operate WASI at scale - but we were surprised by how performant wasmtime was straight out of the box. To cold instantiate a WASM guest VM, it's about ~10μs - cheap, but at our request rate we'd rather not pay it on the hot path, so we pool warm guest instances and hand them out per-request.

The instinct here is to reach for a shared concurrent pool - one big pile of warm guests sitting behind a lock-free queue (crossbeam::ArrayQueue, flume, whatever), and every worker dips into it. That works, but it means an atomic op on every acquire and release, and worse, the pool's cache lines bounce between CPU cores as workers fight over them. At a few hundred thousand requests per second per node, that adds up.

We sidestep all of it with a thread-per-core model: Each Hikari worker is a single-threaded tokio runtime pinned to one CPU core, with no work-stealing between them. A task spawned on that worker stays on that core for its entire lifetime - including across the .await inside the guest call. Which means each worker can keep its pool entirely thread-local: a HashMap<VersionId, VecDeque<GuestInstance>> sitting in a thread_local!, one queue per live guest version (we may be running several at once mid-rollout). Acquire and release are a pop_front and a push_back on a deque only one thread ever sees.

The 10μs cold-instantiate cost only shows up when a worker's local queue for the version it's about to call happens to be empty. The kernel pins each connection to one worker (SO_REUSEPORT + NIC RSS), each version's queue is capped at a PER_WORKER_POOL_CAP, and the pool warms up in the first few seconds after a deploy - so in steady state, it's pop/push from then on.

You may be asking then, after all this work… how can I use it?

It’s been live for everyone since 7 days ago from writing this. It serves now 100 percent of all traffic. This plus the private networking visualizations means I think Railway has one of the best networking experiences out there.

That said, ~40% of the PoPs are still in transit, so the map that you saw will expand over the next few weeks. While we add more sites, our tiered caching and BGP traffic engineering get tuned basically every day, and the big one on the roadmap is domains and the CDN
becoming a single thing in the product.

We're also nowhere near done optimizing.

If you're seeing latency that feels off, or there's a city or region where you wish we had a PoP closer to your users, tell us. New PoP suggestions, weird edge cases, optimizations we haven't thought of yet, we're all ears.


The CDN is live now. Read the docs here [secret debugging endpoint]