On August 9, 2022, at approximately 01:15 UTC (6:15 PM PDT), widely-used Google services, such as Google Search and Google Maps, became unavailable to users around the world. Attempts to reach these services resulted in error messages from Google’s edge servers, including HTTP 500 and 502 server responses that generally indicate internal server or application issues.
The outage lasted nearly 60 minutes during which time users were either completely unable to load sites, such as www.google.com, or could load certain Google sites but could not successfully execute functions, such as search.
Beginning at approximately 01:15 UTC on August 9, 2022, Google experienced a widespread outage that affected many of their globally distributed locations and impacted a variety of Google sites and services. One of the first service impacts observed by ThousandEyes was with Google Search. Within approximately 15 minutes of the start of the event, ThousandEyes observed impacts to Google Maps, and the service subsequently became unavailable in many locations.
As shown in figure 2 below, ThousandEyes Internet Insights: Application Outages observed the progressive onset of the incident, culminating with both of these Google services being unavailable across many global locations.
ThousandEyes vantage points located in multiple countries captured the incident as it unfolded. In the next image, you can see how the issue spread to other vantage points around the world after only a few minutes.
The full global impact could best be seen in the application layer, with ThousandEyes vantage points seeing increasing numbers of web connection attempt failures. When looking at the corresponding collected network layer metrics from these same vantage points, no packet loss was observed on the paths to the Google servers; this further highlights this incident as an application layer problem.
Instead, during the outage, Google web servers responded with HTTP 500 Internal Server Error messages, which indicates an error on the server side that prevents it from completing the request.
At around 01:35 UTC, some ThousandEyes vantage points began experiencing connection time out errors, meaning that the server did not respond within an acceptable amount of time. Other vantage points were seen to receive HTTP 502 status code responses, indicating that the web server received an invalid response from an internal server dependency while attempting to fulfill the request.
Approximately 35 minutes after the onset of the incident, services appeared to be restored for many global users; however, a second brief outage at around 02:00 UTC again disrupted service availability, with Google servers returning identical errors for impacted services.
Google services were eventually fully restored by 02:10 UTC for global users (with the exception of China, whose connectivity failures with Google are unrelated to this incident) as seen in figure 10.
Google eventually issued a statement acknowledging the outage and identifying the root cause as a software update gone wrong.
Conclusions and Recommendations
This event brought forth two interesting aspects that warrant consideration for IT professionals. First, it highlights the fact that even the most stable of services, such as Google Search, a service for which we rarely experience issues or hear of outages, is still subject to the same forces that can bring down any complex digital system. Secondly, the event revealed how ubiquitous some software systems can be, woven through the many digital services we consume on a daily basis and yet unaware of these software dependencies.
The importance of end-to-end monitoring cannot be overstated. Having independent, data-based information is critical, especially in the absence of timely acknowledgements or updates by the provider. For those systems that rarely go down, seeing a dashboard full of green is a good thing; and when these applications or their dependencies become unavailable, knowing the when, where, and why is essential for timely problem isolation and remediation.
The second component of this is the need to monitor your application dependencies. The criticality of Google Search and its use as a software function by many other Internet applications, such as Google Maps and Google Images, means that when this function stops working, so do many other applications. Being able to measure performance of not just your critical application service front ends, but also of any known and measurable dependencies of that digital ecosystem, is necessary to ensure outstanding user experience.