Not Everything Is Google’s Fault (Just Most Things)
Railway uses Google Cloud Platform products such as Google’s Compute Engine to power our application development platform.
At 16:40 UTC, one by one, a subset of machines in our us-west fleet became unresponsive but did not failover. Individual instances were offline for ~10 minutes at a time, on a rolling basis.
By 20:53 UTC, the issue had been resolved, all workloads failed over successfully, and service was subsequently restored.
Through our findings, we believe that there is a potentially fatal interaction in userspace-to-kernel memory transfer on GCP guests which causes softlocks in rare cases under resource pressure.
Over the last 18 months, we’ve had more than a handful of issues with Google. I was waiting to write about them until AFTER we got off Google, but now seems like a good time as any.
In 2022, we experienced continual networking blips from Google’s cloud products. After escalating to Google on multiple occasions, we got frustrated.
So we built our own networking stack — a resilient eBPF/IPv6 Wireguard network that now powers all our deployments
Suddenly, no more networking issues.
In 2023, Google randomly throttled our quota on their Artifact Registry down to nothing.
This caused our builds to be delayed as throughput of image distribution was cut substantially. Again, we got frustrated.
Following this, we built our own registry product.
Voila, no more registry throughput issues.
After the issues above, I was fuming. How could Google do this? We paid them multiple millions of dollars per year, yet they simply couldn’t be bothered to care that their actions were affecting user workloads (both ours and many. other. customers).
So, I did what any self-respecting founder does: I started tweeting.
Google reached out, and I took it upon myself to sit down with a few of their VPs to get to the bottom of what happened, so it never happened to anybody else.
As it turns out, a Google engineer was able to arbitrarily change the quota on GCP.
I expressed to the VPs that this wasn’t acceptable. They agreed and said “We’re digging in heavily to this. It’s going all the way to the top!”
That was June. To this day, I’m still following up to get that retro and official response + policy added to prevent arbitrary quota changes
They also said they’d get back to us on that. 🦗
After citing Steve Yegge’s Platform Rant to any VP that would listen, I felt defeated.
Last quarter, we made the decision internally to sunset all Google Cloud services and move to our own bare metal instances. We got our first one up a couple weeks ago, and will be moving all our instances in 2024.
In our experience, Google isn’t the place for reliable cloud compute, and it’s sure as heck not the place for reliable customer support.
That leads us to today’s incident.
Yesterday, November 30th, at 21:41 UTC, a box dropped offline, caused by Google restarting a machine. While this isn’t unexpected, it is rare. As such, we have automated systems in place to detect and resolve this. We’re notified in Discord and, if the box doesn’t become healthy after failover, we’re paged. This box failed over successfully. No issue, no page, we all slept great.
Today, December 1st, at 16:52 UTC, a box dropped offline; inaccessible. And then, instead of automatically coming back after failover — it didn’t.
Our primary on-call engineer was paged and dug in. While digging in, another box fell offline and didn’t come back.
We started manually failing over these boxes (~10 minutes of downtime each), but soon enough there were a dozen of these boxes and half the company was called in to go through runbooks.
Given our experience with Google Cloud, and what we saw in the serial logs lining up with what we’d seen prior for Google Cloud’s automated live-migration, we assumed this was a routine restart from Google that had gone awry. I emailed the Googler’s who said they’d “Support us, day or night.” I immediately received an OOO email from Google.
So, as we were manually failing these over, we kept digging.
Our first reaction was to look through the serial console logs — these logs come straight from the kernel via a virtualized serial device.
When we scrubbed through the serial console logs, we noticed soft-locked CPU cores as well as stack traces for locked CPUs showing entries such as kvm_wait
or __pv_queued_spin_lock_slowpath
.
The last time we had seen similar logs and behavior within the serial console logs was during a Google-initiated restart which occurred on three boxes last year, also in December.
As we dug into this more, we found additional kernel errors which lined up with a few threads on GCP’s nested kernel virtualization causing soft lockups. You can see Google acknowledge this bug here. Additionally, other users have complained about it here and here. All on GCP.
Because we don’t use virtualization ourselves on these hosts (yet), these messages re: kvm and paravirtualization would relate to the guest kernel code that interacts with the GCP hypervisor.
The users in the 3 issues above seemed to experience the same thing, all on GCP. GCP seems to have dismissed this as “not reproducible,” but we strongly believe we experienced the same thing here.
Specifically, we believe that there is a potentially fatal interaction in userspace-to-kernel memory transfer on GCP guests which causes softlocks in rare cases under resource pressure.
More accurately, we believe this to be related to the paravirtualized memory management and how these pages are mapped and remapped on the hypervisor during certain kinds of resource pressure. The one common factor in all the reports we’ve seen is that nearly all reports are from GCP users.
If the above is true, much like the quota issue we experienced prior, this means there is an arbitrary, unposted speed limit/threshold/condition in which your boxes softlock, despite being well below utilization on all possible observed telemetry (CPU/mem/IOPS). These machines were around 50% of their posted resource limits, which is inline with this article about GCP nested virtualization.
After a manual reboot, we disabled some internal services to decrease resource pressure on the affected instances, and the instances became stable after that point.
During manual failover of these machines, there was a 10 minute per host downtime. However, as many people are running multi-service workloads, this downtime can be multiplied many times as boxes subsequently went offline.
For all of our users, we’re deeply sorry.
This is obscenely frustrating for us and, as mentioned, we’re moving to our own bare metal to give you all increased reliability.