ITOps Lessons From Outages at Google Cloud and OpenAI

This is The Internet Report, where we analyze outages and trends across the Internet through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or listen to the podcast for firsthand commentary.

Internet Outages & Trends

Service delivery chains often have a longer string of dependencies than you might expect. Sometimes, the root cause of an outage you are experiencing isn't due to your own systems or even a third-party provider you rely on; instead, it may originate from another third-party provider that they depend on.

We saw this phenomenon in action recently when some Cloudflare services were impacted by an outage ultimately caused by Google Cloud issues.

Read on to learn more about what happened at Google Cloud and Cloudflare, and also explore takeaways from a recent OpenAI outage—or use the links below to jump to the sections that most interest you:

Google Cloud outage
OpenAI outage
By the numbers

Google Cloud Outage

On June 12, Google Cloud experienced an outage that affected a number of applications that utilize Google Cloud services, including Spotify and Fitbit, among many others. ThousandEyes first observed the outage around 18:00 UTC and it was mostly resolved by 20:40 UTC.

Impacted companies experienced problems such as HTTP server errors, timeouts, and elevated response times, indicating that the problems stemmed from the underlying Google services, rather than network issues.

Explore the Google Cloud outage further on the ThousandEyes platform (no login required).

Table showing overview of HTTP errors observed during the Google Cloud outage — Figure 1. Overview of HTTP errors observed during the Google Cloud outage

ThousandEyes screenshot showing consistent HTTP 401 (Unauthorized) responses observed for Spotify during the outage — Figure 2. Consistent HTTP 401 (Unauthorized) responses observed for Spotify during the outage

Google Cloud confirmed that the issues stemmed from an invalid automated update to its API management system. This problem impacted Google’s Identity and Access Management (IAM) functionality, impairing its ability to authorize requests and making it hard to assess the actions that authenticated users and services were permitted to take. Consequently, this had a substantial cascading effect, impacting services that depend on Google Cloud, which were unable to secure proper authorization.

For more insight on the Google Cloud outage, see ThousandEyes’ full deep dive.

Google Cloud & the Cloudflare Outage

One notable example of these ripple effects is the related Cloudflare outage, which affected many Cloudflare services, including Workers KV, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile and Challenges, AutoRAG, Zaraz, and parts of the Cloudflare Dashboard.

Cloudflare’s Workers KV is a globally distributed key-value storage service that serves as a critical dependency for many Cloudflare products. It operates by keeping data in a limited number of centralized data centers and then caching that data in Cloudflare's distributed data centers after it's accessed. It seems that some of these centralized data centers for backend storage relied on Google Cloud resources. When Google's Service Control system experienced a failure due to the corrupted policy data, it was unable to authorize access to these storage resources.

Consequently, several issues emerged. Backend storage became inaccessible, resulting in 90.22% of Workers KV requests failing. A cascade failure also occurred. Since Workers KV holds crucial configuration details and authentication tokens for numerous Cloudflare services, its failure disrupted multiple other Cloudflare products.

This situation had such a major impact because Workers KV is not just a simple storage solution; it is essential for managing authentication tokens and configuration data required by other Cloudflare services. When access to the Google Cloud backend was blocked due to authorization issues, Cloudflare had trouble retrieving the necessary data to authenticate users or configure their services. This created a chain reaction, turning a single point of failure—in this case, Google's policy system—into a cascading impact that affected multiple layers of Internet infrastructure.

This Cloudflare disruption is a reminder that dependency chains are often longer than you think, and an issue can often manifest differently at various points in the chain, ranging from partial service failure to total service failure, depending on the impacted architecture. Cloudflare depended on Google Cloud, and when Google Cloud experienced issues, it impacted Cloudflare, which then affected some services that depended on Cloudflare; hence, users experienced outages in services that had no direct relationship with Google.

"Independent" services may not be truly independent. Even companies that pride themselves on being cloud-agnostic often still have some dependencies on major cloud providers for critical components like authentication or storage backends.

This outage demonstrates that Internet infrastructure, while robust in many ways, has evolved some single points of failure that can cascade far beyond their original scope. IT teams must have deep visibility across their full service delivery chain to proactively identify potential issues and their source—and comprehensive backup plans in place to mitigate impacts on users when outages do happen.

OpenAI Outage

On June 9 and 10, OpenAI experienced a more than 15-hour outage that impacted its API and ChatGPT services, causing elevated error rates and latency.

While an unexpected spike in traffic may seem a likely cause for an outage at a popular service like ChatGPT, examining the outage’s characteristics suggests that load alone was not the primary culprit.

The outage started around 6:30 AM (UTC) on a Tuesday (11:30 PM (PDT) on Monday), which probably isn’t a typical peak usage period. It’s unlikely that there would have been a sudden, unprecedented jump in traffic at that specific time that the infrastructure couldn't handle.

The incident’s length also suggested that it was not purely load-related. Load issues often either resolve quickly once traffic patterns normalize, or are mitigated within hours through scaling or load balancing. A sustained outage of that length strongly suggested a more fundamental issue, likely related to backend changes. Additionally, the increased error rates and latency that users experienced during the outage suggested that the problem may have been related to a performance or capacity issue.

OpenAI confirmed the outage did indeed arise from a capacity issue that was caused by backend changes: “a routine update to the host Operating System on our cloud-hosted GPU servers caused a significant number of GPU nodes to lose network connectivity. This led to a drop in available capacity for our services.”

This routine update was “a daily scheduled system update” that “inadvertently restarted the network management service (systemd-networkd) on affected nodes, causing a conflict with a networking agent that [OpenAI runs] on production nodes. This resulted in all routes being removed from impacted nodes, effectively making these nodes lose network connectivity.”

ThousandEyes observations align with OpenAI’s reported cause. We saw degraded performance specifically in the initial page load and component population process—where multiple backend services and APIs need to coordinate to get the interface ready—which is consistent with the issue being in their service mesh.

ThousandEyes screenshot showing longer page load times, without increase in latency or any corresponding network conditions — Figure 3. During the OpenAI outage, ThousandEyes observed longer page load times, without an increase in latency or any corresponding network conditions

An increase in page load times, meaning the page takes longer to render, without any noticeable change in latency or response time, and with consistent network conditions across all paths, strongly indicated that the bottleneck was entirely within OpenAI's infrastructure and was not related to network issues.

To address the issue, OpenAI’s engineering teams initiated “a large-scale re-imaging of the affected GPU nodes.” Recovery took longer than perhaps anticipated because OpenAI didn’t have “break-glass tooling that would have enabled OpenAI engineers to bypass normal deployment pipelines and directly access production systems to implement emergency fixes, to restore network connectivity on affected nodes,” meaning that additional measures were required to get the impacted nodes back online, according to OpenAI’s post-incident report.

OpenAI reached almost full system recovery at 3 PM (UTC) and declared services fully restored at 10 PM (UTC).

The company has reported that it is taking several steps to prevent similar issues in the future. These include auditing VM configurations across its fleet to identify and address similar risks; prioritizing improvements in recovery speed, especially for critical infrastructure components such as GPU VMs and clusters; and organizing regular disaster recovery drills.

OpenAI has also already disabled automatic daily updates on GPU VMs and updated system configurations to guard against conflicts between systemd-networkd and the networking service.

This OpenAI outage leaves helpful takeaways for IT operations teams. First, it’s important to remember that load is rarely the root cause in mature systems. When investigating outages, look first at recent changes, deployments, and infrastructure modifications.

Second, outage duration matters more than initial severity. A prolonged outage of over 10 hours can reveal the need to improve your organization’s incident response strategies, making sure you have backup plans to minimize user impact in situations where quick fixes aren’t possible. You may also need to identify and resolve fundamental architectural issues that may prevent you from efficiently addressing issues. As OpenAI noted, its incident response was prolonged by their lack of “break-glass tooling to rapidly restore network connectivity on affected nodes.”

By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed over four recent weeks (May 19 - June 15) across ISPs, cloud service provider networks, collaboration app networks, and edge networks.

Global Outages

After peaking in mid-May, global outages declined over the subsequent weeks. From May 19-25, ThousandEyes recorded 383 global outages, representing a 29% decrease from the previous week's high of 536. This downward trend continued into the following period (May 26 - June 1), when outages dropped further to 241, a substantial 37% decline.
However, this downward trend reversed during the week of June 2-8, with global outages increasing to 304, a 26% rise from the previous week. This upward momentum continued into June 9-15, where outages climbed further to 376, representing a 24% increase and bringing levels closer to those observed in late May.

U.S. Outages

The United States followed a similar overall pattern during the period from May 19 - June 15. From May 19-25, U.S. outages remained relatively stable at 147, showing minimal change from the previous week's 149. During the week of May 26 - June 1, U.S. outages decreased significantly to 84, representing a 43% drop. However, while global outages started to rise the week of June 2-8, U.S. outages continued declining slightly to 77, representing an 8% decrease. During the week of June 9-15, U.S. outages again started to mirror the upward trend seen globally, with outages increasing to 135, a substantial 75% surge.
Over the four weeks from May 19 - June 15, the United States displayed varying levels of representation in global network outages. During the week of May 19-25, the U.S. accounted for approximately 38% of all observed network outages. This proportion decreased to 35% from May 26 - June 1, dropped further to 25% during the week of June 2-8, but then increased to 36% from June 9-15. This trend reflects the pattern observed since the April 7-20 period, where U.S.-centric outages represented less than 40% of all recorded outages.

Month-over-month Trends

Examining the month-over-month trends, global network outages experienced modest growth from April to May 2025, increasing 2% from 1,804 to 1,843 incidents. This pattern stands in sharp contrast to seasonal trends observed in previous years. In 2024, the April-to-May increase was more significant, rising from 687 to 822 outages, a 20% increase. Meanwhile, 2023 saw an even more dramatic seasonal surge from 1,024 to 1,304 incidents, reflecting a 27% increase. Compared to the substantial double-digit increases witnessed in prior years, the relatively muted 2% growth in 2025 suggests either improved global network stability or more effective seasonal planning.
As it did in 2024, the U.S. also experienced a notable deviation from the seasonal trends typically observed at this time of year. In the past, U.S. outages typically increased from April to May. For example, 2023 showed the expected seasonal increase, with outages rising from 451 to 597 incidents, a 32% surge. However, in both 2024 and now 2025, outages instead decreased slightly. In 2024, U.S. outages declined 4%, decreasing from 299 in April to 287 in May. In 2025, outages decreased from 531 in April to 516 in May, representing a 3% drop. The consistent slight decline in U.S. outages over the past two years, in contrast to the significant seasonal increases observed in 2023, may point to a reduced maintenance activity that has altered traditional spring outage patterns.

Bar graph showing global and U.S. network outage trends over eight recent weeks from April 21 through June 15, 2025 — Figure 4. Global and U.S. network outage trends over eight recent weeks

The Internet Report