Incident Report: December 16th, 2024
We recently experienced an outage which affected inbound traffic, on Google Cloud, on all regions of our network.
During this outage, inbound requests on Google Cloud Edge servers were unable to route requests to their applications. This outage lasted approximately 10 minutes.
Private networking, outbound requests, Metal Hosts, and Dedicated Instances were unaffected.
When production outages occur, it is Railway’s policy to share the public details of what occurred.
NOTE: All dates are December 17th (UTC)
[12:47am] - A change is merged modifying IP allow-list rules
[12:49am] - Railway engineer is notified via automated alerting
[12:54am] - Change reverted
[12:58am] - Change applied
[12:59am] - Our automated monitoring recovered
Over the last 2 years, Railway has maintained both a static and dynamic block list.
Over this period, as we’ve gotten better at dealing with DDoS, we’ve transitioned entirely to the dynamic solution.
A customer reached out to Railway that their application was having intermittent timeouts communicating with Railway.
It turned out their IP was being statically blocked on a list from our “pre-dynamic” era
We began the process of removing this IP list from the blocked list of IPs, by emptying the array
However, the Google Cloud Terraform Provider consumes the source_ranges as an optional list. When nothing is provided, it defaults to applying the action to everything.
When a GCP Firewall rule does not specify a source IP range, by default it selects all IPs. Terraform, when supplied with an empty list, changes the underlying GCP firewall rule to remove the source IP list - essentially changing the meaning of the rule as read in Terraform from “block this empty list of IPs” to “block everything”.
Railway’s policy is to “Revert first, then figure out the underlying issue”
We could have been faster in a couple places
- Our page fired to the current oncall, waking them up, instead of the engineer who merged the change
- Our bastion Terraform runner ran a plan before the apply, causing 3 minutes of “delay” to resolve
As such, our short term resolution:
- Automatically apply “page override” for merging changes (for faster notification)
- Skip the “plan” phase during a revert for Terraform
- Safely removing the offending firewall rule and audit all remaining legacy Google Cloud firewall rules for loose specificity
Our move to metal is aimed at building new primitives that we fully control, and building predictable behaviour into those primitives. Every layer of this system - from the diverse mix of ISP we contract with, to the BGP based L3 fabric running across our switches to the design of our virtualisation stack is aimed towards resiliency and fault isolation.
By maintaining support for GCP, adding AWS (currently trialing with limited customers) and investing in multiple metal datacenters in each region - we are building not just resilient new infrastructure, but adding support for a diverse mix of compute while simultaneously making this incident a thing of the past.