Incident Report: September 22nd, 2025

We recently experienced an outage that impacted our dashboard and deployment pipeline.

When a Major Outage occurs, it is Railway’s policy to share the public details of what happened.

Impact

This outage impacted our dashboard and deployment pipeline.

You may have experienced an infinite loading state when trying to load your projects’ canvas, delayed deployments, or a “Limited Access” message when initiating new deployments.

All running deployments and platform-level networking features remained online throughout this period. Users who did not access the dashboard or initiate deployments during the incident window experienced no service disruption.

Some users may have experienced issues with “Invalid database URL” or missing Private Networking during this period. These issues have been fixed.

Incident Timeline

11:05 UTC - Initial reports received of Railway project canvas failing to load for users
11:14 UTC - Incident declared as Partial Outage affecting dashboard functionality; investigation commenced
11:35 UTC - Root cause identified and remediated; partial service restoration confirmed
11:45 UTC - Full service restoration achieved; incident marked as resolved
11:48 UTC - Control Plane exhibited abnormal resource consumption and elevated API latency; new incident initiated. Deployment capabilities suspended for non-Pro tier users (Free, Trial, Hobby), who encountered "Limited Access" notifications when attempting to deploy
12:14 UTC - Correlation established between dashboard and Control Plane issues; original incident reopened
12:25 UTC - Control Plane degradation intensified; all deployments suspended across all tiers (including Pro)
12:44 UTC - Deployment functionality restored for Pro users
12:50 UTC - Deployment functionality restored for non-Pro users
13:05 UTC - Rolled back deployment functionality for non-Pro users due to persistent Control Plane instability; Pro tier deployments remained operational
14:01 UTC - Full service recovery confirmed; deployment capabilities restored for all user tiers
21:27 UTC - Retroactive fixing of a rare edge case where private networks and their endpoints no longer appear to be visible on the dashboard

For further reference, please refer to the incident’s live updates on our Status Page on dashboard may fail to load for some users and control plane instability.

What Happened?

The Railway dashboard and project canvas relies on our API to fetch data. During a code cleanup, we made a database schema change to remove an unused column. However, a subsequent code change reverted part of this unintentionally, leading to failures for a core API our frontend relies on.

Because the project canvas relies on this API, users started seeing an infinite loading state that hid the underlying error. We declared an incident and deployed a fix within 15 minutes, successfully resolving the immediate issue.

However, this failure cascaded globally across our backend infrastructure, resulting in:

Deployment failures across all interfaces (Dashboard, GitHub Webhooks, CLI)
Further dashboard loading failures

After the resolution of the initial incident, a substantial backlog of queued deployments were triggered to start simultaneously. This surge overwhelmed our Control Plane, manifesting as elevated database errors on our end.

We traced the root cause of this down to our Control Pane. The dashboard's dependency on Control Plane APIs for critical data, such as service domains and private network endpoints, prevented it from loading properly during the outage. We started a second incident, re-opened the original incident, and began remediation efforts immediately.

To manage the increased deployment backlog, we implemented a phased response:

Initially suspended deployments for non-Pro tier users. This allowed all Pro users to deploy normally, and helped us offgas the deployment queue so it can gradually recover,
Temporarily suspended all deployments during peak instability. This suspended new deployments globally (including for Pro users) for up to 20 minutes to help us offgas the queue further,
Significantly scaled up our Control Plane's database resources to handle increased load. This included increasing CPU and memory allocations, optimizing connection limits, and expanding the database's IOPS capacity to ensure it could handle the anticipated traffic surge once deployments resumed.

Despite the performance optimization performed on the Control Plane’s database, performance issues persisted. Investigation revealed that the root cause was our connection pooling layer (PgBouncer). It was causing an abnormal amount of CPU and Memory usage beyond what was nominal:

Control Plane’s API Server CPU and Memory spikes during the incident

Further investigation revealed the issue stemmed from a PgBouncer version upgrade. Our Control Plane database relies on PgBouncer as a connection pooler, common in most applications for managing Postgres connections at scale.

We had been using an older Bitnami image, and following Bitnami's deprecation of their public images, we migrated to our internal mirrored repository. This migration advanced us several PgBouncer versions ahead.

The upgraded version under load exhibited abnormally high CPU and memory consumption, manifesting as increased latency and database connection timeouts. We identified the likely cause as a breaking default configuration change in PgBouncer.

Upon discovering this, we immediately rolled back to our previously stable PgBouncer version (rather than adjusting the configuration to prioritize stability and eliminating any risk of additional version-related issues). Following the rollback, we saw a 100x performance increase with all metrics returning to nominal values, and we re-enabled deployments for all non-Pro users.

Preventative Measures

At the start of Q3, we started a project to re-engineer parts of our Control Plane to be more resilient. This includes moving large parts of its storage layer away from Postgres. This was intended to defend against single region Control Plane failures impacting the platform, and the upcoming system will allow for several orders higher magnitude IOPS compared to our current Postgres implementation.

In the meantime, we will be making these changes to our development practices to prevent any disruption(s) of this class from happening again:

Enhanced CI pipeline validation to prevent disruptive schema changes from reaching production, and automated schema compatibility checks before backend deployments. We believe that if the breaking schema change did not make it to production here, we would not have this cascading failure scenario
Implement monitoring and alerting for PgBouncer node performance metrics (CPU, memory, connection pool utilization). If we had known about the performance regression prior, we could have recovered from the incidents much faster
Phased rollout strategy for PgBouncer upgrades. By incrementally updating PgBouncer instances rather than performing a global upgrade, combined with the enhanced monitoring above, we would have identified the version-specific performance issues before full deployment.

Railway is committed to providing the best-in-class cloud experience. Any downtime is unacceptable for us. We apologize for any inconvenience caused by this, and we are going to work towards eliminating the entire class of issues contributing to this incident.

Incident Report: September 22nd, 2025

Impact

Incident Timeline

What Happened?

Preventative Measures

Continue Reading...

Your train has arrived!