Live Webinar
Troubleshooting Digital Experiences Across Owned and Unowned Networks

The Internet Report

Analyzing X’s Livestream & GitHub, Google Outages

By Mike Hicks
| | 18 min read
Internet Report on Apple Podcasts Internet Report on Spotify Internet Report on SoundCloud

Summary

Explore the recent Google Cloud and GitHub outages, plus insights on the August 12 X livestream event featuring Elon Musk and Donald Trump.


This is The Internet Report, where we analyze recent outages and trends across the Internet, through the lens of ThousandEyes Internet and Cloud Intelligence. I’ll be here every other week, sharing the latest outage numbers and highlighting a few interesting outages. As always, you can read the full analysis below or tune in to the podcast for firsthand commentary.


Internet Outages & Trends

When a service or application appears unresponsive, it's easy to conclude that its problems are the result of unexpected traffic patterns. While that may be true in some cases, the atypical traffic patterns are more often simply an effect rather than a cause.

Identifying and recognizing the results and signals related to a service or systems condition (or the absence of certain signals), played a part in understanding three events in the past fortnight.

In the case of Google Cloud, a power issue in one of its European regions affected connectivity and impacted a number of services and networking equipment. The issue disrupted connectivity into the region as well as some Partner Interconnect connections and associated routes between other Google regions.

When delayed access to a broadcast on X's Spaces service was attributed to a network load-related issue, the lack of a specific signal pattern suggested a different technical explanation.

And finally, traffic to and from GitHub.com became an issue when a database configuration change caused some critical services to lose connectivity unexpectedly.

Read on to learn more about these events, or use the links below to jump to the sections that most interest you:


Google Cloud Outage

On August 12, a Google Cloud outage in a European Google point of presence (PoP) occurred due to the failure of both the primary and backup power feeds caused by a substation switchgear failure. This outage impacted a significant portion of Google's Front Ends (GFEs) in the europe-west2 region and affected the availability of Cloud CDN, Cloud Load Balancing, Hybrid Connectivity, Virtual Private Cloud (VPC), and associated services that leverage them. According to a status update, this outage also impacted services like Gmail, Google Calendar, Google Chat, Google Docs, Google Drive, and Google Meet.

The issue was first observed around 1:20 PM (UTC), with elevated packet loss observed within nodes located in Google's European region. As a result, some users within this region experienced intermittent timeouts when attempting to connect with these services.


Explore this outage further in the ThousandEyes platform (no login required).

ThousandEyes screenshot showing outage in Google’s network impacting customers and partners
Figure 1. Outage in Google’s network impacting customers and partners

The major disruption to service accessibility appeared to last from 1:20 PM (UTC) to around 1:50 PM (UTC), with some availability being restored. However, this was followed by elevated latency for a portion of users.

Screenshot showing service availability restored for some Google services, followed by a period of increased latency
Figure 2. Service availability restored for some Google services, followed by a period of increased latency

As the outage continued, traffic was rerouted to reach services hosted in the impacted region, which likely resulted in the increased latency we observed. Google’s edge network in London went offline, temporarily removing the Internet routes advertised by Google from networks connected to Google's network. Consequently, these were automatically replaced by alternative routes that did not rely on the affected networking equipment.

Screenshot from ThousandEyes platform showing outage in Google’s U.K. network affecting traffic path
Figure 3. Outage in Google’s U.K. network affecting traffic path

Google acknowledged that the issue was within its europe-west2 region. The biggest impact appeared to affect users with workloads or services within that region, but it also seemed to have some impact on global connectivity. Loss was observed on a number of Google intra-cloud paths, although it did not appear to impact multi-cloud connectivity.

Google's production backbone is a global network that enables connectivity for all user-facing traffic via points of presence (PoPs) or Internet exchanges. A failure of a component or system on a path from the europe-west2 region in Google’s production backbone could lead to a decrease in available network bandwidth and suboptimal routing.

Screenshot from ThousandEyes platform showing loss observed in Google intra-region paths
Figure 4. Loss observed in Google intra-region paths

There was also an impact on customers using some Partner Interconnect connections in London, which resulted in connectivity loss to downstream Google Cloud services.

ThousandEyes screenshot showing customers using Partner Interconnect impacted by Google outage
Figure 5. Customers using Partner Interconnect impacted by Google outage

A post-incident report attributed the outage to a loss of both primary and backup power feeds at a Google PoP “due to a substation switchgear failure.” The PoP “hosts about one-third of serving first-layer Google Front Ends located in europe-west2 and some distributed networking equipment for that region,” the report stated, adding that the power outage also “caused Internet routes advertised by Google to be withdrawn in networks connected to Google’s network.”

In general, when architecting for resilient cloud operations, distributing resources across various zones and regions helps lower the risk of an entire infrastructure outage affecting all resources at the same time. In many cases, the key to resilience is to have duplicate services or alternative paths that are cost-effective and maintain independent visibility throughout the entire digital delivery chain. This allows problems to be identified and located at the earliest stages before they cause larger issues.

Operating at scale introduces many complexities; however, it's not always possible to protect against every combination of issues when planning for possible scenarios. The biggest challenges often come from unknown dependencies and unforeseeable edge cases.

To maintain end-to-end service delivery, it’s crucial to understand all the components and dependencies in the services so that you can efficiently troubleshoot when something unexpected does go wrong. This knowledge not only speeds up fault domain identification, but also it helps in determining which adjacent services, users, and areas will be affected. It allows you to take appropriate mitigation steps to maintain business continuity when an incident occurs.

GitHub Outage

On August 14, between 11:02 PM (UTC) and 11:38 PM (UTC), GitHub.com experienced a complete service outage. The cause of this disruption was an incorrect configuration change that affected the traffic routing within its database infrastructure, “resulting in critical services losing connectivity unexpectedly.”

The change appeared to be deployed to all GitHub.com databases, which affected the databases’ ability to respond to health check pings from the routing service. Consequently, “the routing service was unable to identify healthy databases to redirect application traffic to,” causing widespread disruption on GitHub.com.

When the change was rolled back, database connectivity was restored.


Explore this outage further in the ThousandEyes platform (no login required).

Screenshot from ThousandEyes platform showing the outage impacted the United Kingdom as well as other regions
Figure 6. The outage impacted the United Kingdom as well as other regions

During the outage, the issue appeared as a 503 "service unavailable" error, and ThousandEyes observed unavailability at the HTTP server layer. ThousandEyes also observed a significant drop in page load times suggesting that the page loaded faster because not all components or data were able to be loaded. This was likely caused by the inability to retrieve information and elements from GitHub’s backend infrastructure, which aligns with GitHub's acknowledgment of an issue with the database. This also suggests that the network itself was not the problem, since reachability to GitHub’s environment was good, as indicated by the server-side 503 error code being seen—as opposed to a simple timeout.

Screenshot showing ThousandEyes observed 503 service unavailable errors during the outage
Figure 7. ThousandEyes observed 503 “service unavailable” errors during the outage

The likelihood of a distributed application losing all of its network connectivity at once is highly improbable, so the GitHub outage seems likely linked to a single point of failure resulting from the configuration change that GitHub identified.

ThousandEyes screenshot showing the GitHub outage impacted regions around the globe
Figure 8. The GitHub outage impacted regions around the globe

X Spaces Insights

On August 12, a livestream discussion between Elon Musk and former President Donald Trump hosted on X’s audio streaming platform Spaces kicked off 40 minutes late, with error messages displayed to people trying to tune in at the scheduled start time.

The problems were attributed to “a massive DDoS [distributed denial-of-service attack] on X.” While we cannot definitively state the underlying cause of the event, ThousandEyes did not observe traffic conditions typically present during a DDoS attack, such as network congestion, packet loss, and elevated latency.

About DDoS Attacks

For context, a DDoS is an attack launched from multiple connected devices with the intention of making a service inaccessible or unavailable. This is usually accomplished by causing the entire site or application to become inoperable, rather than targeting a single site resource or specific user.

There are three main DDoS attack types: volumetric attacks, protocol attacks, and application attacks.

  • Volumetric attacks: These attacks use a huge amount of traffic to saturate the target’s bandwidth, which can completely block access to the website or service. Not considered application specific, volumetric attacks simply seek to target any site or application capable of accepting requests.

  • Protocol attacks: This type of attack renders a target inaccessible by exploiting a weakness in the Layer 3 and Layer 4 protocol stack, consuming all the processing capacity of the target or critical resources like firewalls.

  • Application attacks: Application layer DDoS attacks differ from volumetric attacks as they target the application layer itself, exploiting specific vulnerabilities, for example, overloading the application by inundating it with compute-intensive GET or POST requests, consequently disrupting the delivery of content to users. Essentially, application layer attacks aim to compromise specific applications.

A NoneXistent Name Server Attack (NXNSAttack) has characteristics of both volumetric and application attacks. The NXNSAttack specifically affects recursive DNS resolvers, which are part of the name lookup service. This attack can crash the DNS resolver, rendering it unable to respond to requests. As a result, user DNS requests for site details are not responded to, and as such the target destination becomes inaccessible, leading to a failure to load any part of the website.

If a company experiences a sudden spike in malicious traffic from attackers, it may be necessary to take action to mitigate the attack. Possible strategies can include rerouting, blackholing, sinkholing, or scrubbing traffic. Regardless of the specific approach, each of these mitigation techniques will reset and disrupt the connection for legitimate users. This means they will have to reconnect and re-authenticate to access the service.

What ThousandEyes Observed During the Spaces Broadcast

X Spaces is an audio-only streaming service operating as a resource within the x.com domain. Throughout the entirety of the Musk-Trump stream, there were no significant issues or variations in performance on the x.com domain itself or other associated services. ThousandEyes didn’t observe any signals normally associated with a DDoS attack, such as a high loss rate, DNS failures, and so on.

Additionally, the Space itself was reachable and appeared to be serving content, as evidenced by the music that was played before the broadcast began. From ThousandEyes’ observations, there were also no resets or drops that would typically be associated with DDoS mitigation.

ThousandEyes screenshot showing that x.com services did not experience any issues during the broadcast
Figure 9. x.com services did not experience any issues during the broadcast

Regarding the Space stream used for the broadcast, there were some intermittent increases in page load and response time, indicative of congestion within the application. However, there were no issues with completion, as the stream appeared consistent, with no resets or drops in transmission observed and no obvious loss rate.

Screenshot showing ThousandEyes observed a variable increase in page load and response time for the X Space
Figure 10. During the event, ThousandEyes observed a variable increase in page load and response time for the specific X Space used for the broadcast, but no service interruptions or significant loss rate were observed

During the broadcast, there were no signs of network issues impacting the reachability of the specific space used for the event.

ThousandEyes screenshot showing network connectivity to the specific X Space resource showed no signs of network congestion
Figure 11. Network connectivity to the specific X Space resource showed no signs of network congestion

Name resolution was also consistently responsive during the event. Authoritative DNS servers for x.com were up and available.

ThousandEyes screenshot showing that no DNS issues were observed during the broadcast
Figure 12. No DNS issues were observed during the broadcast

All in all, while we can’t definitively state the underlying cause of the issue, the ThousandEyes team didn’t see conditions typically present during a DDoS attack.


By the Numbers

Let’s close by taking a look at some of the global trends ThousandEyes observed across ISPs, cloud service provider networks, collaboration app networks, and edge networks over the past two weeks (August 5-18):

  • Outages increased throughout the August 5-18 period, with outages rising 11% in the first week, from 183 to 204. This upward trend continued into the following week, with outages increasing from 204 to 211 between August 12-18, marking a 3% increase compared to the previous week.

  • The United States did not follow this pattern. An initial increase was observed in the first week of this period (August 5-11), with outage numbers rising 32%. However, the following week saw a decrease, with outages dropping 23% from August 12-18.

  • Despite U.S. outages decreasing in the second week of this period, U.S.-centric outages still comprised over 40% of all global outages. From August 5 to August 18, 44% of network outages occurred in the United States, compared to 35% in the previous fortnight (July 22 to August 4). This indicates a return to the long-term trend observed for the majority of 2024, where U.S.-centric outages accounted for at least 40% of all observed outages.

Bar graph showing global and U.S. network outage trends from June 24 through August 18, 2024.
Figure 13. Global and U.S. network outage trends over the past eight weeks

Subscribe to the ThousandEyes Blog

Stay connected with blog updates and outage reports delivered while they're still fresh.

Upgrade your browser to view our website properly.

Please download the latest version of Chrome, Firefox or Microsoft Edge.

More detail