Incident Report: August 27th, 2024
We recently experienced an outage on our platform which affected the new edge network.
During this outage, about 30% of traffic (those using the new proxy or public TCP proxy), affecting mostly new customers, entered a state where requests would time out. This lasted for approximately ~30 minutes before customers saw recovery, with elevated latency, and ~60 minutes to restore all systems to a healthy state.
Timeline available here
[10:04 PM UTC] A pull request was merged by a Railway Engineer to add more proxies.
[10:08 PM UTC] The Platform on-call was paged for synthetic testing failures in production related to TCP proxy probes
[10:13 PM UTC] The on-call had identified that the pull request made at 10:04 PM UTC had recreated all instances of the new Railway proxy at once. Immediately the on-call proceeded with the run-book to recreate these instances.
Context: The same proxy is used to power TCP workloads and HTTP workloads.
[10:44 PM UTC] Instances had been re-bootstrapped and customers began to see partial recovery.
[10:49 PM UTC] Recovery of a handful of proxies in each region had been deemed sufficient enough to update the status page.
[11:17 PM UTC] US West had been fully recovered
[11:32 PM UTC] Singapore had been fully recovered
[11:34 PM UTC] Europe and US-East had been fully recovered
[11:40 PM UTC] The incident was declared resolved
Railway is in the process of rolling out a new proxy to eventually cover 100 percent of traffic in accordance with previously announced milestones.
We are performing this migration to support increased network capacity since Railway is handling ever greater amounts of traffic while working to enable end-user network monitoring. The platform engineering team as a result are provisioning additional machines to support the stated goals.
The issue boils down to a Pull Request recreating proxy machines with live traffic.
This was a routine pull request, modifying the number of instances of type “newproxy” within a Terraform file.
Railway utilizes an external vendor to assist with Railway’s IaC orchestration that helps the team manage resources. It chiefly lints and creates comments on pull requests projecting if the change to the IaC config is valid or not.
Due to an open issue on the Google Terraform Provider, the boot disk could not be resized without recreating this instance. Given this can be done with zero downtime/without draining the instances, the engineer elected to simply resize the boot disk on Google’s dashboard and create a pull request to make the required boot disk size increases to accommodate our retention window due to increasing traffic as we migrate from our old proxy to our new one.
This pull request was created on Friday, and remained opened.
As a result, when the pull request to modify the number of instances of type “newproxy” was merged on Tuesday, the external vendor pulled the older configuration with the incorrect boot disk information and then proceeded to re-create machines, based on boot disk resizing, with live traffic causing the outage.
The Railway platform team has a number of safeguards in place for it’s infrastructure, however, a few safeguards failed leading to the cause of the incident.
The IaC orchestration platform we use for managing our Terraform stack plans changes to the infrastructure within the pull request. It comments on these pull requests, telling the author what changes will be made.
In this case, the IaC platform commented there were “4 to add, 29 to change, 0 to destroy”
However, the Terraform plan on the linked run stated there were “4 to add, 12 to change, 17 to destroy”
The above error was buried by the IaC Platform.
As a result, no checks failed, no alarms were tripped, and the pull request was able to merge moving forward.
Our internal providers make use of a feature commonly known as “Deletion Protection”. This is standard across the clouds, and designed to prevent exactly this issue. It is a configurable field option on a variety of resources across AWS, GCP, etc. (See: Example)
However, this feature is only used on our production environments. Staging providers usually don’t configure this.
As a result of only configuring this option within the production environment, and due to our internal service template not having this field enabled by default, the deletion protection was skipped for the newproxy.
It’s likely that deletion protection was omitted within the newproxy’s configuration files when promoted from staging to production.
This incorrect configuration made it difficult to ensure the behavior in testing would be the same in production.
This incident is obviously unacceptable. Railway is critical infrastructure for the internet, upon which hundreds of thousands of businesses and users rely on to power their businesses, 24/7/365.
We are making the following changes:
- Modifying all Terraform policies internally, including our examples, to ensure they have deletion protection rules where necessary in the production environment.
- This will help guard against situations where a malformed configuration will remove a potentially active resource in production.
- Remove the ability to destroy resources from our “Terraform Coordinate” boxes
- Additionally, we have added supplemental alerting on these failures.
- With this change, the only way to destroy a resource at Railway is via a privileged escalation request internally.
We are additionally looking into creating machine images for the new proxies. We currently image other systems in our stack such that they can be brought up immediately instead of having to be re-bootstrapped just-in-time
Meanwhile, the Railway customer account teams are working closely with our customers to help mitigate business impact. Companies on Railway are encouraged to get in touch with our team and are being offered Slack connect channels to proactively help our customers if any issues arise. We also appreciate any and all feedback to our incident response, you are encouraged to share your thoughts here.