Incident Report: June 11th, 2024
We recently experienced an outage on our platform which affected our US and EU fleet.
During this outage, about 20% of our instances entered a degraded state, meaning they were slow to serve requests. In some rare cases, a handful of boxes had to be fully cycled, resulting in full user and application downtime.
When a production outage occurs affecting availability, it is Railway’s policy to document about it.
At 1:41 AM UTC on June 11th, an engineer was paged for IO pressure resulting in slower response rates on a single Europe instance. Upon inspection of the machine, on call found a service using ~1TB of ephemeral storage. Once ephemeral storage utilization reaches a threshold, the service is redeployed, and the user is notified. Upon redeploying this workload, IO pressure returned to nominal.
The above was considered “resolved” at 2:31 AM UTC.
At 7:53 AM UTC on June 11th, again, we were notified of higher than normal IO pressure. This time on multiple instances across the fleet in US-west.
At 8:24 AM UTC, after identifying an IO locked box, we called an incident.
Throughout this period, some machines would continue to serve workloads, but within a degraded state. We monitored the fleet. If they became unresponsive, we restarted them.
From unresponsiveness to redeployed, some workloads saw up to 20m downtime. In situations where multiple services were affected, this downtime may have occurred multiple times.
At 9:03 AM UTC we disabled Hobby provisioning to prioritize restoring Pro plan workloads.
By 10:45 AM UTC, the last machine had been restarted.
By 11:57 AM UTC, we confirmed that instances were up and running. Additionally, we identified and fixed an issue affecting about ~5% of machines where the private networking did not initialize correctly.
12:29 PM UTC, we had restored private networking features
By 1:00 PM UTC, we had declared the incident as “Resolved”
We identified the root cause as an errant migration workflow, which exacerbated IO pressure by calling an IO hot path concurrently with other retry queues.
When Railway makes scheduling decisions, it requests machines across the fleet to make it. Based on their response, we compute the “optimal” placement based on a variety of factors. Part of this status response involves a pathway on the machine to check on the status of all instances running on it. In this pathway, many files can be read, which normally results in a small but very uneventful amount of IO load.
Railway receives almost 100k deployment requests per day. It is not unusual to have hundreds of deployments happening at the same time. However, these deployments are queued by our build system as they go out.
However, Railway additionally performs host migrations from time to time. That is, we drain instances on a machine and replace it. We do this for a variety of reasons — internal change rollouts, security updates, upgrading machines, etc.
Unfortunately, a variety of issues occurred at the same time:
- Standard deployments happened, causing the deployment queue to fill up
- Crons fired on the hour, causing the cron queue to fill up
- Some European placements failed due to status latency, causing deploys to be filled up in the secondary queue
- The migration script additionally fired at the same time as all of the above
The last one is the large source of problems. We discovered a few issues:
- The migration workflow had an issue where it would call a redeploy for every replica, not every deployment
- This meant if a machine happened to call redeploy on something with say, 50 replicas and 3 replicas on the same box, it would trigger 150 redeploys
- The status endpoint, called for each deployment, is high IO
- The status endpoint is called by the scheduling algorithm in all regions, even if those regions aren’t valid for placement
When the migration workflow fired, it stacked up jobs in its queue. While any, or even a couple of these queues retrying on their own is usually not a problem, the combination of having everything happen in concert caused issues. Unfortunately, when these all happened together, the status endpoint was hit with 4x load. This did not take the system down, merely caused the service degradation we saw prior.
However, the above made parts of the platform unschedulable, because we could not retrieve status results from the instances. This caused retries across the queues, which pushed queued status requests to the workers, in all regions, which made other regions unschedulable, filling up THEIR retry queues.
We quickly canceled all running migrations, crons and deployments. However, we had already queued the requests to the instances. From there, we simply had to wait until workloads stopped serving requests, or for the box to recover on it’s own.
- Cache status responses from the machines themselves to prevent high IO operations (listing instances)
- Our new V2 Runtime does this by default; it does not use files, instead, an in-memory database
- Add global rate limits to the worker processes
- Have the scheduler only dial workloads within the region, to prevent cross cluster write amplification from queued retries
- Tune memory/cpu/etc reaper processes to prioritize evicting Hobby stateless workloads, which can be spun up elsewhere with zero downtime using our “scale to zero” engine
The Railway Infra team has now implemented a number of fixes from the solutions made above, in addition to the triage performed to recover workloads. Meanwhile, the Railway team is working closely with our customers with business impact to help further harden their applications. Companies on Railway are encouraged to get in touch with our team and are being offered Slack connect channels to proactively help our customers if any issues arise. We also appreciate any and all feedback to our incident response, you are encouraged to share your thoughts here.