Avatar of Noah DunnaganNoah Dunnagan

Incident Report: December 8th, 2025

We recently experienced an outage that impacted our backend and downstream systems.

When a Major Outage occurs, it is Railway’s policy to share the public details of what happened.

This incident impacted our backend API disrupting dashboard access, CLI operations, GitHub-based deployment processing, login, and API functionality.

All running deployments remained online throughout this period. Users who didn't interact with the dashboard or the aforementioned systems during the incident window experienced no disruption.

On December 9th, 2025:

  • 15:03 UTC - Engineers merged a change to our database schema involving a data migration
  • 15:17 UTC - Database migration began applying
  • 15:18 UTC - Lock conflicts in the database caused backend replicas to become unhealthy and fail to serve traffic
  • 15:26 UTC - Data migration finished and all backend replicas recovered

For further reference, please refer to the incident’s live updates on our Status Page here.

Railway operations depend on a backend connected to a shared PostgreSQL database, with PgBouncer handling connection pooling. Multiple replicas serve the backend, each relying on periodic health checks to stay in rotation. If a replica can't obtain a database connection, it's marked unhealthy and stops receiving traffic.

We deployed a migration adding a nullable column to a heavily used table with ~1 billion rows. A long-running query on this table held locks, stalling the migration. As connection attempts accumulated, PgBouncer exceeded the database's connection limit. Replicas failed their health checks and were removed from rotation.

After the long-running query completed and released its locks, the migration applied immediately. Replicas were then able to reconnect, pass health checks, and return to rotation. Our backend stabilized and resumed accepting traffic.

  • We will be increasing database connections and re-configuring PgBouncer to work within these limits
  • The long-running query will be fixed to prevent row lock issues again

We apologize for this outage and are actively working to prevent similar issues from happening again. We understand reliability is central to your workflow, and this disruption is unacceptable.