Avatar of Chandrika KhanduriChandrika Khanduri

Incident Report: October 16th, 2025

We recently experienced an outage that affected our Edge Network connectivity. Some users may have experienced intermittent issues reaching their services for a few minutes via public endpoints.

This outage impacted our Edge Network across all regions. Users accessing Railway-hosted domains over HTTP and TCP may have experienced intermittent connection failures or timeouts for approximately 2-3 minutes.

Private networking was not impacted during this incident. All running deployments remained operational, though some users were unable to reach their services via public endpoints during the outage window. Users might have seen this as HTTP 404 errors, HTTP 522 errors (Connection Timed Out), or Cloudflare error pages indicating upstream connectivity issues.

Incident on our Status Page: Brief disruption to edge network connectivity

  • 20:50 UTC - We noticed a spike in external traffic across the proxy fleet
  • 20:51 UTC - Our internal monitoring probes began failing; on-call engineers paged
  • 20:51 UTC - Routing services reached high memory utilization under sustained request volume; user-facing outage began
  • 20:52 UTC - Load balancers stopped forwarding traffic to proxys as health checks failed
  • 20:53 UTC - Routing service works through spike in traffic and eventually returns to previous levels of throughput; partial recovery observed
  • 20:53 UTC - Full service restoration across all regions

We experienced an increased load on our edge network that exhausted available memory in a supporting service. This coincided with a routine edge network service upgrade, which temporarily increased load as the traffic shifted between nodes.

The affected internal service was operating at high utilization when the traffic spike occurred. The combination of the increased load, high utilization, and service reloading triggered the incident.

New edge network services couldn’t initialize properly while the supporting service was under strain. The proxies began failing their health checks, causing our load balancers to stop routing traffic to them. This is when users started seeing connection failures to their services.

The service recovered as caching mechanisms kicked in to handle the increased load. During this window, users may have experienced intermittent connectivity issues over the public network to their services.

We take the stability of your workloads seriously. Our infrastructure team has already implemented immediate fixes and is working to eliminate the entire class of issues that caused this incident and significantly increase the resilience of our edge network.

Immediate measures already taken:

  • We’ve increased internal routing service memory limits to add sufficient breathing room for increased load
  • We’ve improved our memory utilization monitoring with tiered alerts that warn us when utilization is high and escalate to critical pages before service degradation
  • We are adding safeguards to prevent the internal routing service from being overwhelmed during anomalies

On the roadmap (Q4 2025):

Our team was already rebuilding our internal routing service before this incident occurred. The work underway will eliminate this entire class of failures:

  • Our infrastructure engineers are completely rewriting the routing service to be far more memory-efficient and handle substantially higher traffic volumes
  • We plan to migrate to a multi-region, distributed architecture that will scale orders of magnitude beyond our current capacity and isolate failure to individual regions instead of affecting the entire platform

Railway is committed to providing the best-in-class cloud experience. Any downtime is unacceptable for us. We apologize for any inconvenience caused by this incident, and we are working to eliminate the entire class of issues that contributed to it.