Incident Report: July 2, 2026 — US East Services Outage

Railway experienced a Major Outage concentrated in one of our US East availability zones on July 2, 2026.

A network degradation in one of the ISPs connecting our datacenters to the rest of the internet caused elevated latency and packet loss for traffic between our US regions. While rerouting traffic away from degraded ISP, a change at one of our US East availability zones briefly left the site without a stable route to the internet.

The effects from above exposed hidden bugs that silently pushed storage traffic onto a slow backup network and impacted some private networking tunnels, degrading disk performance and private networking in US East for roughly two hours.

Impact

On July 2, 2026 between roughly 07:44 UTC and 12:01 UTC, users may have experienced increased response times and intermittent connectivity issues on traffic between US regions, including private networking. Some workloads in one of our US East availability zones additionally saw degraded disk performance and disrupted private networking for roughly two hours.

Incident Timeline

All times are UTC on July 2, 2026.

07:44 — We started observing packet loss in our US East region affecting user traffic. A public incident was declared on our status page. We traced the packet loss to one of our upstream network carriers
07:44~08:32 — We disconnected from the degraded network carrier at all US border routers. Traffic was successfully rerouted through other carriers. Conditions improved across most US paths, but latency and packet loss into US East had not fully recovered
08:39 — Paths through a secondary network carrier at the affected US East zone were still showing packet loss, as it was handing traffic back through the degraded carrier on the return leg. We disconnected from the secondary carrier there as well. Unknown to us at the time, that was the only carrier still supplying that site's default route (the catch-all path a network uses to reach the internet). The primary degraded carrier had already been disconnected, and our only remaining carrier’s connection there does not supply a default route. This left the zone without a stable route to the internet for roughly 20 minutes. During this window the disruption was at its most severe as traffic into and out of US East saw failed connections and heavily degraded private networking
08:59 — We reconnected our secondary carrier. Routing stabilized, US East connectivity began recovering, and the storage cluster returned to a coordinated, healthy state
09:00~10:45 — Storage performance in the zone remained degraded despite routing looking healthy. Throughput stayed pinned at roughly a third of capacity
10:45 — Root cause of degraded storage performance identified. Significant amount of storage connections had been established over a slow internal management network during the routing instability and remained stuck there after recovery
10:45~11:00 — We terminated the stuck connections across all storage and compute hosts in the zone. They reconnected over the correct network within seconds
11:04 — I/O wait across the zone returned to baseline (58% → under 5%). Storage throughput surged as the cluster caught up on backlogged writes, then settled at normal levels. At the same time, we identified that private networking tunnels had latched onto an incorrect address during the routing instability and never corrected themselves
11:49 — A fleet-wide restart of the mesh networking agents in the zone forced all tunnels to re-establish with correct addresses. Private networking fully recovered and full connectivity was restored across US regions; we continued monitoring
12:01 — With latency, packet loss, private networking, and volume performance stable at normal levels, the incident was marked resolved

The full incident is available on our Status Page.

What Happened?

A few things went wrong that led to unintended cascading effects across our systems. The commonality across the failure cases were traced to stale connections that ended up capturing a bad path during a brief window of instability, and held onto it after the network recovered, because nothing in the system re-asserted the correct state.

1) Upstream ISP Degradation (US Regions)

Datacenters connect to the internet by buying connectivity from transit providers; carriers that operate long-haul fiber and agree to deliver your traffic to any destination in the world. The largest of these are called Tier 1 ISPs. Railway connects every Metal datacenter to at least three of them, so that any single carrier can fail without taking us offline.

On July 2, a carrier carrying our traffic was impacted by a network degradation somewhere in their US backbone. Traffic they normally carried on other paths spilled onto the route that carried our traffic between US West and US East, causing saturation leading to higher latency and packet loss.

Our own routers showed no errors and healthy connections to every carrier, which told us the problem was upstream. Internal probes caught the degradation nearly two hours before it became visible to user traffic, giving us time to trace the lossy paths, all of which ran through the degraded network provider, while retries and redundant routing absorbed the loss.

When the loss began reaching user traffic, we declared a public incident and disconnected from the degraded provider at all US borders. Traffic rerouted through other providers and packet loss returned to baseline on most US paths.

2) Storage Performance Degradation (US East)

At 08:39, we also disconnected from a secondary carrier at this zone. Paths through the secondary carrier were still showing packet loss, because it was handing traffic back through the primary carrier’s degraded network on the return leg, and disconnecting a carrier on our side does not control the route traffic takes coming back to us.

In hindsight, this change should not have been made without first verifying the behavior of default routes on our core switches. The zone is one of our first-generation sites, and unlike our newer datacenters, it gets its default route (the catch-all path to the internet) from its carriers rather than generating one itself.

The primary network carrier was already disconnected at that site, and the only remaining carrier there does not supply a default route. Therefore, disconnecting the secondary carrier removed the last one. For the roughly 20 minutes until we reconnected, the site had no stable route to the internet. This window was the most severe part of the incident for users: traffic into and out of US East saw failed connections and degraded private networking until routing stabilized at 08:59.

That instability exposed a hidden bug in how our servers behave when their primary network path disappears. Each server has two networks: a high-bandwidth fabric that carries production traffic, and a slow management network used for administrative access.

When the fabric's default route vanished, the servers' operating systems fell back to the only route left: the one on the management network. A default Linux behavior then allowed servers to answer for their storage addresses on that network too, so storage traffic began flowing over a path with a small fraction of the fabric's capacity.

Network connections don't re-check their route once established; they keep using the path they started on until they close. So the storage connections created during that 20-minute window stayed stuck on the slow management network even after the fabric was fully restored. The routing tables looked correct, and the storage cluster reported healthy, but storage throughput was capped at roughly a third of normal, and two thirds of servers in the zone sat waiting on disk while the cluster tried to push its backlog through.

Once we found the connections coming from management-network addresses, we terminated those connections across every storage and compute server in the zone. They reconnected over the correct fabric within seconds, and I/O wait dropped from 58% to under 5% in about 15 minutes.

3) Private Networking Degradation (US East)

Railway's private networking runs over encrypted tunnels between servers, and each tunnel learns its peer's address from the packets it receives.

During the routing disturbance, tunnel traffic was briefly funneled through a device that rewrites the source address of traffic passing through it. Thousands of tunnels learned that device's address as their peer's address, and kept it after routing was rolled back. The mesh only re-verifies peer addresses when its membership changes, and because these tunnels sit silent when idle, a broken one never sends the packet that would have fixed it.

At peak, roughly 20,000 host-to-host private network links were blackholed. This included inter-region services communicating with services in US East. Recovery required restarting the mesh networking agents across the fleet, forcing every tunnel to re-establish with the correct addresses.

Preventative Measures

We have already rolled out the following:

Disconnected the degraded network carrier at all US borders. Our fleet is currently operating on other Tier 1 carriers with full headroom. We will reconnect the degraded carrier once their backbone has recovered and we have verified path health
Cleared all stuck management-network storage connections across every storage and compute host in the affected zone
Restarted the mesh networking agents fleet-wide in the zone, re-establishing all private networking tunnels with correct addresses

We are additionally working on:

Migrating first-generation sites to self-generated default routes. Our newer datacenters generate their own default route at the border rather than depending on carriers to supply one. At those sites, losing any single carrier is a minor path change, not a site-wide routing event. The affected zone predates this design, which is why it was vulnerable today. Bringing all remaining first-generation sites onto this pattern is the biggest structural fix from this incident
Correcting host fallback behavior so that production traffic never takes the management network as a fallback path. Both networks are internal to our infrastructure and within the same trust boundary — traffic that crossed the management network stayed inside our own equipment, and private networking traffic remained encrypted end-to-end — so this was a path-selection and capacity problem. We will fix this by making a missing fabric route fail cleanly and immediately instead of silently taking a slower path, which will give us the ability to detect and resolve it in minutes
Alerting on management network utilization and blackholed private network links, so that either failure mode pages us immediately instead of surfacing as degraded performance

A carrier failure is a normal event on the internet, and our multi-carrier design handled it the way it should. The degradation that followed came from our side: an older site design that depended on carriers for its default route, a disconnection made without checking what routing would remain, and systems that captured a bad path during the instability and held onto it silently after the network recovered.

We apologize for this outage and are actively working to prevent similar issues from happening again. Each of the fixes above targets one of those links, so that the next carrier failure ends where this one should have (with traffic quietly taking a different road).