Avatar of Ray ChenRay Chen

Incident Report: April 23rd, 2025

We recently experienced an outage that affected our networking control plane. During this outage, public and private networking on Railway was unavailable for a portion of users.

When a Major Outage occurs, it is Railway’s policy to share the public details of what happened.

This outage impacted our Edge Network, Private Network, and TCP Proxy across all regions.

If you were visiting a Railway-hosted domain or if your application was communicating over our private network, the connections may have been highly unstable.

We were also unable to provision new domains and SSL certificates during this period. If you tried adding a new domain to your Railway service, you would have encountered an error.

Our Asia region had the highest impact. All other regions had a significantly lower impact, and network connectivity was only intermittently unavailable for a small subset of users in those regions. We also observed that newer deployments had the highest incidence rate.

Incident on our Status Page

  • 09:10 UTC: Our on-call engineers were paged for network monitoring failures, and started investigating immediately
  • 09:15 UTC: We started seeing partial recovery from users, and continued our investigations into the root cause
  • 09:28 UTC: We started seeing full recovery across the platform
  • 10:36 UTC: Incident resolved

At 09:10 UTC, we started noticing severe networking degradation on our Public TCP/HTTP monitors.

At 09:11 UTC, we saw the same failures for our Private Network monitors.

Our monitors actively probe various networking endpoints to monitor the health of public and private networking on Railway. When a probe fails, the on-call engineer is paged immediately.

The root cause of this outage was traced to our networking control plane. The control plane is responsible for provisioning and maintaining all user networking resources on Railway. For example:

  • When you create a domain on Railway, the control plane provisions SSL certificates and informs our platform about its existence;
  • When you issue a new deploy on Railway, the control plane is responsible for updating its public and private routing information across our fleet;
  • When you visit a Railway domain, the control plane is consulted for its routing information so we know where to send traffic to;
  • etc.

This control plane uses Postgres as its persistent data store. We use this to persist traffic routing information (among other types of critical data) that lets us know where to send traffic to within our fleet. This routing information is also cached in-memory.

The Postgres service lives on Google Cloud SQL and is configured to be highly-available with a read replica.

At 09:07 UTC, Google Cloud SQL started a maintenance on this Postgres service that included the read replica, causing all connections to be terminated.

At 09:08 UTC, we started seeing database-related errors across all instances of our network control plane that mentioned read replica had to sync with the primary instance.

At 09:17 UTC, the maintenance initiated by Google Cloud SQL was completed. This lasted 10 minutes, and effectively meant our network control plane’s read-only database replica was unreachable.

Some instances of our control plane started crashing when the database went offline due to improper error handling in its code. While they were configured to restart automatically, all routing information would be evicted from memory on each restart, and had to be re-fetched from Postgres which was offline.

At around 09:15 UTC, we started seeing partial recovery from user reports, along with blips in our monitors which suggested the issue was intermittent. The healthy instances of our control plane continued serving traffic normally, so our networking was in a degraded state where a subset of user traffic may be routable, while others were not.

After Postgres was back online, our control plane started to reboot itself successfully and resume normal operations. The boot process took longer than expected due to post-boot operations delaying it from being able to serve traffic (warming its local cache, etc.)

At 09:28 UTC, we started seeing full recovery after the control plane and database became fully healthy and operational.

There are multiple parts of our control plane that should have gracefully degraded instead of failing over.

We have prioritized fixing this as an important project for 2025Q2. As part of this work, we’re going to create regional control planes instead of a global control plane, which will help us contain the blast radius of control plane failures, and allow us to mitigate any single-region failure by failing over route information lookup to other regions.

On top of improving the architecture of our control plane, we’re going to:

  • Fix the bug that causes the control plane to crash when it encounters a database connection error
  • Introduce a layer of persistent cache on routing information as a secondary fallback
  • Lower the start-up time of the control plane so we can recover from similar failures faster
  • Re-visit our Google Cloud SQL Postgres configuration to ensure its primary and replica aren’t under maintenance at the same time, and to notify us on upcoming maintenance(s) so we can monitor its progress and impact closely

Railway is committed to providing the best-in-class cloud experience. Any downtime is unacceptable for us. We apologize for any inconvenience caused by this, and we are going to work towards eliminating the entire class of issues contributing to this incident.