ThousandEyes actively monitors the reachability and performance of thousands of services and networks across the global Internet, which we use to analyze outages and other incidents. The following analysis is based on our extensive monitoring, as well as ThousandEyes’ global outage detection service, Internet Insights. See how the outage unfolded in this analysis.
Outage Analysis
Updated on June 17, 2025
On June 12, Google Cloud experienced an outage that impacted applications that use Google Cloud services, including Spotify and Fitbit, among many others. First observed around 18:00 UTC, the outage lasted over 2.5 hours, and was mostly resolved by 20:40 UTC. Google Cloud reported that the problem “stemmed from an incorrect change to [its] API endpoints, which caused a crash loop and affected [its] global infrastructure, impacting all services.”
What happened during the Google Cloud outage?
At around 18:00 UTC, ThousandEyes started detecting performance degradation and availability issues affecting some applications that rely on Google Cloud services. Symptoms included HTTP server errors, timeouts, and elevated response times, suggesting the problems stemmed from the underlying Google services, rather than network issues.
Google Cloud confirmed that it experienced issues due to an invalid automated update to its API management system. This problem affected Google’s Identity and Access Management (IAM) functionality, essentially hindering its ability to authorize requests and making it difficult to determine the actions that authenticated users and services could perform. As a result, there was a significant ripple effect, particularly impacting services relying on Google Cloud, which were unable to obtain proper authorization.
What led up to this outage?
On May 29, 2025, Google rolled out a new code implementation intended to enhance Service Control's functionality by introducing "additional quota policy checks." However, this particular piece of code remained dormant during the initial rollout phase, as it required a specific policy modification to be activated and tested effectively.
On June 12, 2025, a critical event triggered the dormant code when a policy change was made to the regional Spanner tables that Service Control depends on. Spanner, which is Google’s globally distributed database system, is engineered to replicate data in real time across multiple data centers worldwide.
This policy change inadvertently introduced “unintended blank fields” into the Spanner tables—fields that were either empty or not properly initialized. When Service Control attempted to execute the new code in conjunction with these altered policies, it encountered a significant issue: a code path that led to a null pointer dereference. In other words, the code tried to use data that didn’t exist, causing it to crash—like following directions to a location that doesn’t exist. This error resulted in a crash loop, causing services to repeatedly fail and restart, ultimately disrupting the system's functionality.
Where did the failure occur?
Google identified a failure in its distributed API management and control plane infrastructure. Its Service Control system, which integrates Identity and Access Management (IAM) functions as core components in every API request processing pipeline, is responsible for authorization, policy enforcement, and quota management for all API requests. However, it became non-functional due to corrupted policy data that impacted unprotected code that was unable to handle null values.
Google’s API management and control planes act as distributed IAM enforcement points, with Service Control serving as the authorization engine that validates each API request. These control planes handle the complete IAM workflow, which includes authenticating requests, retrieving access policies, enforcing permissions, and applying quota controls before redirecting requests to their respective endpoints.
The failure was traced back to corruption in the IAM Policy Store. Corrupted policy configuration data was globally replicated through the Spanner tables that support the API management control planes. These control planes depend on real-time policy data to make IAM decisions. When blank fields appeared in the policy metadata, the authorization decision-making process across all regional API gateways broke down.
Additionally, there was a control plane authorization cascade. When the IAM components within the API management control planes encountered null policy data, they crashed instead of handling the error gracefully. This led to a global failure where the control planes were unable to authorize any API requests, regardless of the specific service being accessed.
What is Identity and Access Management, and why does it matter?
Backend services require authentication to access resources such as storage and databases, even for "public" endpoints.
During the June 12 outage, a failure in the IAM functionality disrupted the service-to-service authentication mechanism. As a result, services were unable to authenticate with their Google Cloud dependencies. This failure prevented them from retrieving the necessary data to serve the accounts URL, leading to timeouts and "service unavailable" messages. In some cases, ThousandEyes also observed 401 errors because the service could not verify permissions to access the required backend resources.
Although the service-to-service authentication mechanism experienced issues, the root cause of the Google outage was ultimately an authorization issue (i.e., what you are allowed to do), rather than an authentication (i.e., who you are) issue. Google's systems could still verify users’ identities, but they couldn't determine what those authenticated users were permitted to access or do because the policy system was corrupted.
Notably, users were not forced to re-authenticate or log in again. Instead, ThousandEyes observed persistent authorization failures for active requests. These failures occurred because services could not verify permissions as a result of the policy corruption issue. For example, there were issues verifying user permissions for Spotify as well as service-to-service permissions for Cloudflare.
How did the outage manifest for users?
The outage affected several services, and ThousandEyes observed a variety of error conditions and status codes that varied based on the location of the request and the point in the authorization chain where the corrupted policy issue was encountered.
The geographic location determined which regional Service Control instance processed the request. Some instances had crashed, others were processing corrupted data, and a few remained temporarily healthy. The specific point in the authorization chain where the corruption was encountered also influenced the type of error returned. Requests that never reached policy evaluation resulted in timeouts, while those that entered the policy processing logic but crashed mid-execution returned 500 errors. Instances that successfully processed blank policy fields but misinterpreted them as denials returned 401 or 403 errors, and requests that hit overwhelmed instances resulted in 503 errors.
This combination of geographic distribution and authorization pipeline stage created a matrix of failure modes where the same logical request could produce completely different error responses depending on these two factors. This confirmed that the authorization policy corruption was manifesting differently across Google's global infrastructure as corrupted data propagated at varying rates and was handled inconsistently by Service Control instances in different states of failure and recovery.

When observing Fitbit, the tests showed a mix of timeouts and 503 errors from identical requests sent from various regions worldwide, while some areas were receiving 200 OK responses. This pattern indicates regional failures in the Service Control system, where Google's globally distributed authorization infrastructure was displaying uneven propagation and recovery of corrupted data across different geographic locations.
The regions that received 200 OK responses were likely interacting with Service Control instances that had not yet been impacted by the corrupted policy data replication, had successfully processed the corrected policy updates, or were temporarily bypassing the failed quota checks. In contrast, the regions experiencing timeouts and 503 errors appeared to be hitting Service Control instances either overwhelmed by policy processing failures (resulting in 503 "Service Unavailable" errors) or trapped in crash loops caused by null pointer exceptions (leading to timeouts). A key issue was that Service Control instances crashed whenever encountering null pointer exceptions caused by corrupted policy data, entering repeated crash loops. In scenarios where a critical service fails, it is not unusual for the service to return 503 errors.
Furthermore, many services will throw 503 errors if they cannot access their backend dependencies, such as Google Cloud APIs, indicating they are operational but unable to provide services due to the unavailability of those dependencies. Similarly, load balancers will respond with 503 errors when backend services become inaccessible due to authorization issues, rather than allowing requests to hang indefinitely.
This geographic inconsistency in failure patterns for identical requests clearly confirms that the root cause of the issue was the corruption of the authorization policies, rather than a problem with the network or load balancer. If it were an infrastructure-related issue, we would expect to see consistent failures across the globe. However, the selective regional failures demonstrate that the corrupted policy data impacted various Service Control deployments at different times and ways as it spread through Google's global Spanner replication system, substantiating that the problem lay specifically within the distributed policy evaluation system.

Looking at Spotify tests, ThousandEyes observed a consistent pattern of 401 responses throughout the entire outage, which is quite revealing. This strongly suggests that the issue was due to a failure in the authorization policy rather than a general system outage. Authentication was functioning normally, as there were no login screens appearing; however, authorization consistently failed. Service Control successfully received the authenticated requests but was unable to determine permissions because of corrupted policy data.
The repetitive 401 responses indicated that requests were reaching Service Control instances, which were struggling with blank or null policy fields. These fields were interpreted as "the user has no permissions for anything," resulting in immediate and predictable denials of access. This steady stream of 401 responses reinforced that the problem stemmed from corrupted policy data affecting authorization decisions rather than a broader infrastructure failure. It confirms that there were no issues with the backend network or load balancer. If there had been a networking or routing problem, we would not have seen consistent authorization denials, which suggest that the entire request pipeline was functioning smoothly, except for the flawed policy processing logic itself.

During the outage, we also observed occurrences where the same logical request triggered a variety of error responses across different systems, often in no particular order. We observed transitions from timeouts to HTTP 403 errors, and then sometimes back to timeouts or even to HTTP 500 errors. This erratic behavior likely stemmed from the service infrastructure, which spanned multiple regions, load balancers, and service instances that were failing independently.

On occasion, it seemed that the Service Control instance was completely unresponsive, caught in a crash loop caused by a null pointer exception, resulting in timeouts where requests appeared to disappear. Another instance appeared to process requests but crashed mid-authorization due to corrupted policy data, resulting in a HTTP 500 error, indicating, "I tried to help but broke." Meanwhile, a third instance appeared to successfully process requests with blank policy fields but interpreted them as explicit denials, generating a HTTP 403 error, indicating, "I processed your request, and you're definitely not allowed."
The random cycling of errors points to an authorization issue rather than a problem with load balancers or backend routing. If it were a routing issue, we would expect to see consistent patterns based on which backends were healthy or unhealthy. Instead, the erratic cycling among timeouts, HTTP 500, and HTTP 403 errors reveals that the core authorization system is fundamentally flawed. Each Service Control instance is failing to process the same corrupted policy data in different ways—some are crashing, others are throwing exceptions, and some are misinterpreting blank fields as denials—regardless of which instance is handling the request.
This unpredictable failure pattern across all instances points to root cause lying within the authorization policy processing logic itself, not in the infrastructure routing requests to those instances. If it were a routing issue, we would see predictable patterns where some backends consistently perform well while others consistently falter, rather than facing the same unpredictable failures across all instances due to the corrupted policy data they are all attempting to process.
A “Lights Dimming” Effect
From ThousandEyes’ analysis of the impacts experienced by various services during the Google Cloud outage, we see the presence of a "lights dimming" effect rather than a “lights on, lights off” outage. In other words, the issues propagated across various regions and services, rather than suddenly impacting everyone at the same time.
This “lights dimming” effect makes sense given that issues originated from the IAM services. Different services experienced the impacts at varying times, depending on their specific verification patterns and timing. Additionally, the types of errors observed were influenced by how each individual service handles instances of identity verification failure. For example, ThousandEyes observed timeouts for services prone to timeouts (i.e., those with longer retry logic or deeper dependency chains), 503 Service Unavailable errors for services with good circuit breaker patterns that fail gracefully, and 401 errors for services that fail fast when identity verification breaks.
The diverse errors and “lights dimming” behavior provide further evidence that IAM issues sparked the outage. If all services had instead failed simultaneously with the same error code, that would have pointed to a different issue source.
When did the Google Cloud outage end?
The outage appeared to be mostly resolved by 20:40 UTC. Google reported that they had “identified the root cause and applied appropriate mitigations.” They noted that “the underlying dependency” had recovered in all locations except the us-central1 region. In that region, the quota policy database was overloaded, leading to a much longer recovery time there. Google also acknowledged that “several products had moderate residual impact (e.g., backlogs) for up to an hour after the primary issue was mitigated and a small number were still recovering after that.” The incident officially ended at 20:49 UTC, according to Google.
To guard against the issue happening again, Google is taking steps to prevent its “API management platform from failing due to invalid or corrupt data,” and to keep metadata from propagating globally without proper protection, testing, and monitoring. They also plan to “improve system error handling and comprehensive testing for handling of invalid data.”
Lessons Learned
The Google Cloud outage serves as a reminder that, when diagnosing outages, it’s important for IT operations teams to efficiently assess the scope of the impact by identifying any common factors related to location, services, or networks. This assessment can help teams discover possible dependencies on third-party services that might be the source of the outage. For faster diagnosis, it’s also key to understand error conditions, and their origins across various components in the service delivery chain—such as CDN and backend APIs. Taking these steps will empower teams to resolve issues more quickly and improve resilience. Additionally, organizations should also maintain an understanding of the locations of critical service deployments and communicate infrastructure dependencies openly to users during incidents.
Want more insights like this? Tune in to The Internet Report podcast.
Previous Updates
[June 12, 2025, 2:00 PM PT]
As of 20:40 UTC, the Google Cloud incident is mostly resolved. ThousandEyes data indicates multiple services were impacted by the June 12 GCP outage, including Spotify and Fitbit, among others. Follow the links for an interactive view of these incidents in the ThousandEyes platform—no login required.

[June 12, 2025, 1:00 PM PT]
At around 18:00 UTC, ThousandEyes began detecting performance degradation and availability issues impacting some applications that use Google Cloud services. The incident does not appear to be network related. Instead, some Google Cloud customers are experiencing HTTP server errors, time outs, and elevated response times, suggesting problems with the underlying Google services they rely on.