Incident Report: December 16th, 2024

We recently experienced an outage which affected inbound traffic, on Google Cloud, on all regions of our network.

During this outage, inbound requests on Google Cloud Edge servers were unable to route requests to their applications. This outage lasted approximately 10 minutes.

Private networking, outbound requests, Metal Hosts, and Dedicated Instances were unaffected.

When production outages occur, it is Railway’s policy to share the public details of what occurred.

Incident Response Timeline

NOTE: All dates are December 17th (UTC)

[12:47am] - A change is merged modifying IP allow-list rules

[12:49am] - Railway engineer is notified via automated alerting

[12:54am] - Change reverted

[12:58am] - Change applied

[12:59am] - Our automated monitoring recovered

What Happened

Over the last 2 years, Railway has maintained both a static and dynamic block list.

Over this period, as we’ve gotten better at dealing with DDoS, we’ve transitioned entirely to the dynamic solution.

A customer reached out to Railway that their application was having intermittent timeouts communicating with Railway.

It turned out their IP was being statically blocked on a list from our “pre-dynamic” era

We began the process of removing this IP list from the blocked list of IPs, by emptying the array

However, the Google Cloud Terraform Provider consumes the source_ranges as an optional list. When nothing is provided, it defaults to applying the action to everything.

💡

Terraform, being written in Golang, makes no such distinction between “empty array” and nil.

When a GCP Firewall rule does not specify a source IP range, by default it selects all IPs. Terraform, when supplied with an empty list, changes the underlying GCP firewall rule to remove the source IP list - essentially changing the meaning of the rule as read in Terraform from “block this empty list of IPs” to “block everything”.

Short Term Resolution

Railway’s policy is to “Revert first, then figure out the underlying issue”

We could have been faster in a couple places

Our page fired to the current oncall, waking them up, instead of the engineer who merged the change
Our bastion Terraform runner ran a plan before the apply, causing 3 minutes of “delay” to resolve

As such, our short term resolution:

Automatically apply “page override” for merging changes (for faster notification)
Skip the “plan” phase during a revert for Terraform
Safely removing the offending firewall rule and audit all remaining legacy Google Cloud firewall rules for loose specificity

Long Term Mitigation

Our move to metal is aimed at building new primitives that we fully control, and building predictable behaviour into those primitives. Every layer of this system - from the diverse mix of ISP we contract with, to the BGP based L3 fabric running across our switches to the design of our virtualisation stack is aimed towards resiliency and fault isolation.

By maintaining support for GCP, adding AWS (currently trialing with limited customers) and investing in multiple metal datacenters in each region - we are building not just resilient new infrastructure, but adding support for a diverse mix of compute while simultaneously making this incident a thing of the past.

Incident Report: December 16th, 2024

Incident Response Timeline

What Happened

Short Term Resolution

Long Term Mitigation

Continue Reading...

Your train has arrived!