Railway’s 8-hour outage exposes multi-cloud’s hidden weak spot

2026-06-04

Railway, a cloud deployment platform, went offline for roughly eight hours on May 19 after Google Cloud suspended its production account through an automated process. What makes the incident worth examining is not the suspension itself, which Google reversed within seven minutes of receiving an emergency support ticket. It is what stayed broken long after account access returned.

Customer workloads running on AWS and Railway’s own hardware kept running throughout the entire outage. Nobody could reach them anyway.

The reason comes down to how traffic actually finds its destination. Every request hitting a Railway-hosted application passes through edge proxies that consult a routing control plane to figure out where to send traffic. That control plane lived on Google Cloud. When GCP suspended the account, the control plane went dark. Cached routing data bought about 35 minutes before expiring, and after that, every workload, regardless of which cloud physically hosted it, started returning 404 errors.

Running workloads across multiple clouds did not help because all of those clouds still depended on one layer to direct traffic. That layer was not redundant.

A second problem piled on. The volume of failed login retries triggered GitHub’s rate limiting, which blocked users from logging in or starting new deployments even as other services gradually recovered.

Full restoration took until 07:58 UTC on May 20. Along the way, storage volumes came back first, then compute and networking, then the dashboard, then deployments. Terms-of-service records were also reset during recovery, requiring users to re-accept on next login, which added friction during an already difficult situation for affected customers.

Railway published a detailed post-incident report acknowledging the architectural gap without softening the conclusion. The committed fixes include removing the hard dependency on GCP for routing decisions, extending database redundancy across AWS and Railway Metal, and keeping Google Cloud off the critical path for live traffic going forward.

One line from Railway’s report is worth sitting with: “Your customers don’t care whether the failure was Google or Railway; they see your product.”

For any business running on a platform like Railway, the practical question this incident raises is straightforward. If your hosting provider’s primary cloud account disappeared tomorrow, which layer would actually stop traffic from reaching your users?