Avatar of Noah DunnaganNoah Dunnagan

Incident Report: November 20th, 2025

We recently experienced an outage that impacted our deployments.

When a Major Outage occurs, it is Railway’s policy to share the public details of what happened.

All deployments on Railway were temporarily delayed during the outage due to an issue with our deployment task queue.

All running deployments and platform-level features such as public and private networking remained online throughout the duration of this incident. Any user who did not push new code or trigger a redeploy did not experience any service disruption.

On November 20th, 2025:

  • 16:54 UTC - Our engineers noticed an unusually low amount of GitHub webhooks for Push Events. Users noticed that deployments were stuck in the “Initializing” step.
  • 17:29 UTC - Webhook traffic from GitHub surged by 10x, causing a flood of concurrent deployment initializations
  • 17:32 UTC - Engineering team paged for increased deployments backlog.
  • 17:41 UTC - An incident is called. Free, Trial, and Hobby deployments were disabled to reduce pressure on the deploy queue.
  • 18:19 UTC - Pro deployments temporarily disabled.
  • 18:47 UTC - Recovery observed. Pro deployments re-enabled.
  • 19:01 UTC - Additional recovery confirmed. Hobby deployments re-enabled.
  • 19:05 UTC - Free and Trial deployments re-enabled.
  • 19:18 UTC - Full recovery confirmed. All deployments have been re-enabled at this time and incident is resolved.

For more information, please refer to the incident’s live updates on our Status Page.

Railway processes deployments asynchronously via a task queue:

  • When you push code to GitHub, we receive webhook events from GitHub that tell us to create jobs to deploy your code.
  • When you perform a deployment through the CLI (railway up), we enqueue the deployment job.

After observing a sudden dip in GitHub webhooks delivery, we noticed that a surge of events were delivered to us in a short span of time. This triggered massive deployment creation, overwhelming our deployment processing pipeline. During this time, we noticed some of our workers were locking up due to memory pressure. To remedy this, we started increasing the number of workers and cycled all workers by restarting them.

With jobs still piling up, we had to gracefully degrade service. First we halted Free, Trial, and Hobby deployments to reduce load. When that wasn't enough, we stopped deployments for Pro users. Railway degrades service gradually in the event of a major outage, prioritizing our Enterprise and Pro users first.

We started seeing recovery once the new workers came online, and subsequently re-enabled deployments for Pro users. Once additional recovery was confirmed, we re-enabled Hobby deployments.

Once full recovery was confirmed, we re-enabled deployments for Free and Trial users. All deployments were re-enabled for all users across the platform by this time.

We’re introducing multiple changes to prevent errors such as this from happening again:

  • We are going to segment and improve our alerts related to the deployment processing queue so mitigative action can be taken faster.
  • We are adding more internal monitoring for unusual deployment spikes that signal this type of failure.
  • We are working on fixing the root cause of workers locking up when under memory pressure, which will prevent the processing queue from stalling in the future.

Railway is committed to providing the best-in-class cloud experience. Any downtime is unacceptable for us. We apologize for any inconvenience caused by this, and we are going to work towards eliminating the entire class of issues contributing to this incident.