Brody OverIncident Report: November 25th, 2025
We recently experienced an outage that impacted deployments and parts of the dashboard.
When a Major Outage occurs, it is Railway’s policy to share the public details of what happened.
This incident affected our task queue system. All deployments across Free, Trial, and Hobby were temporarily paused. Pro deployments continued, but with delays. Service-level actions, such as configuration changes, environment creation, and deployment removals, were also impacted.
All running deployments and platform-level features remained online throughout this period. Users who didn't interact with the dashboard or trigger a new deploy during the incident window experienced no disruption.
On November 25th, 2025:
- 22:47 UTC - Engineers were paged as deploy throughput rate dropped sharply
- 22:50 UTC - We observed elevated error rates on multiple systems dependent on the task queue
- 23:04 UTC - Free and Trial deployments temporarily disabled
- 23:10 UTC - Paused hobby deployments to alleviate back pressure and prevent extra load
- 23:16 UTC - Issue identified and mitigation put in place
- 00:06 UTC - Several fixes were pushed to reallocate resources in an attempt to stabilize the system
- 00:21 UTC - Previously, queued deployments were being picked up and processed
- 00:48 UTC - Queue processing resumed and caught up. All delayed deployments finished successfully
- 00:50 UTC - Hobby deployments re-enabled
- 01:08 UTC - Free and Trial deployments re-enabled
- 01:22 UTC - Incident resolved
For further reference, please refer to the incident’s live updates on our Status Page on Deploys and configuration changes delayed.
Railway runs critical operations, such as deployments, configuration changes, and resource limit updates, through a task queue backed by Temporal. When you push code, update variables, or change service limits, a workflow is created in this queue to process that action.
Around 19:30 UTC, we observed GitHub API calls slowing to nearly 4x their usual p95 latency. This happened during peak deployment hours, and the slowdown cascaded into growing backlog of tasks in the queue and delayed processing. We saw no signs that GitHub's API itself was unhealthy. Workers handling GitHub API calls began consuming elevated resources, eventually hitting Out-Of-Memory failures and crashing. As those workers went offline, new tasks were shifted to the remaining workers, increasing load and intensifying system pressure.
With this elevated pressure, the remaining workers ended up crashing and new tasks began piling up faster than workers could process them. As additional workers came online, they were immediately overloaded by the backlog and hit Out-Of-Memory failures as well. Free and Hobby deployments were temporarily disabled to reduce strain on the task queue.
To reduce system load, worker CPU and memory was increased. We also adjusted worker parameters that were causing them to request more tasks than they could handle.
After these fixes, we saw gradual recovery. The task queue began clearing the backlog of deployments, and as load decreased, we re-enabled deployments in stages: first Hobby, then Trial and Free. After additional monitoring, we observed full recovery and declared the incident resolved.
We’re going to introduce several changes to prevent errors of this class from happening again:
- We’ve implemented an auto-tuning algorithm that should prevent our fleet of workers from starving themselves if a similar thundering herd scenario appears again.
- We’ve scaled our task queue’s resources up across the board to account for additional load.
- We’re working to remove external API dependencies from critical workflows to limit the impact of similar outages.
We apologize for this outage and are actively working to prevent similar issues from happening again. We understand reliable deployments are central to your workflow, and this disruption is unacceptable.
