Incident Report: October 28th, 2025

We recently experienced an outage that impacted our backend API.

When a Major Outage occurs, it is Railway’s policy to share the public details of what happened.

Impact

This outage rendered the dashboard inaccessible, caused CLI operations relying on our API to fail (e.g. railway up), delayed GitHub-based deployments due to webhook processing failures, and disrupted all Railway Public API operations.

All running deployments and platform-level features remained online throughout this period. Users who did not access the dashboard, use the CLI, or initiate a new GitHub-based deploy during the incident window experienced no service disruption.

Incident Timeline

On October 28th, 2025:

18:23 UTC - A database change containing a schema modification was introduced
18:34 UTC - Database change is live in production
18:36 UTC - Internal monitoring triggered alerts, engineering team paged
18:41 UTC - Incident declared
18:56 UTC - Root caused identified
19:00 UTC - Gradual service recovery confirmed
19:15 UTC - Full service restoration confirmed

For further reference, please refer to the incident’s live updates on our Status Page on Dashboard and API Service Disruption.

What Happened?

Railway’s client-facing interfaces (Dashboard, CLI, Public API, etc.) depend on our backend, which relies on Postgres as its primary datastore.

A routine change to this Postgres database introduced a new column with an index to a table containing approximately 1 billion records. This table is critical in our backend API’s infrastructure, used by nearly all API operations.

The index creation did not use Postgres’ CONCURRENTLY option, causing an exclusive lock on the entire table. During the lock period, all queries against the database were queued behind the index operation. Our API servers continued accepting requests, each attempting database connections that immediately blocked on the locked table.

This triggered a cascading failure where our PgBouncer configuration exceeded the underlying Postgres connection limit, exhausting all available connection slots including administrative ones. Manual intervention attempts to terminate the index creation failed with:

FATAL: remaining connection slots are reserved for roles with privileges of the "pg_use_reserved_connections" role

With no ability to establish an administrative connection to Postgres, we could not execute commands to terminate the problematic index creation. While exploring other recovery options, the migration completed successfully after approximately 30 minutes, automatically releasing the table lock. This allowed all queued operations to process and connection pools to return to normal levels.

Preventative Measures

We’re going to introduce several changes to prevent errors of this class from happening again:

In CI, we will enforce CONCURRENTLY usage for all index creation operations, blocking non-compliant pull requests before merge.
PgBouncer connection pool limits will be adjusted to prevent overwhelming the underlying Postgres instance's capacity.
Database user connection limits will be configured to guarantee administrative access during incidents, ensuring maintenance operations remain possible under all conditions.

Railway is committed to providing the best-in-class cloud experience. Any downtime is unacceptable for us. We apologize for any inconvenience caused by this, and we are going to work towards eliminating the entire class of issues contributing to this incident.

Incident Report: October 28th, 2025

Impact

Incident Timeline

What Happened?

Preventative Measures

Continue Reading...

Your train has arrived!